Domain-Aware Self-Supervised Learning for Digital Pathology with Multi-Dataset Evaluation

Aneesh Chatrathi & Aniketh Malipeddi

Lay Summary:

We trained a computer model to recognize patterns in medical images without needing human labels, using thousands of pathology slides from different diseases. The model learned to keep each dataset’s information separate while also finding common ground, showing promise for building AI tools that can learn from diverse medical data more efficiently.

Abstract:

Self-supervised learning (SSL) has become an important paradigm for representation learning from unlabeled data, offering a clear advantage in digital pathology where expert annotations are costly and often inconsistent. In this study, we applied a Vision Transformer (ViT Small) trained with a DINOv2 style self distillation framework across three distinct digital pathology datasets- thyroid cell images, bile duct tissue patches, and cervical cancer patches. We enforced balanced sampling, with each dataset contributing 30,000 samples per epoch with the goal of mitigating scale differences among the different datasets. Training incorporated multi crop augmentations, mixed precision computation, and caching strategies optimized for high performance execution. The resulting embeddings were assessed through clustering analyses, linear probes, retrieval evaluation, and low dimensional visualization. Across datasets, the learned representations exhibited strong domain separation- linear probes achieved perfect accuracy within each dataset, while clustering metrics confirmed well defined structure (Adjusted Rand Index = 1.0, Normalized Mutual Information = 1.0, silhouette ≈ 0.99). A UMAP projection revealed three compact, nonoverlapping clusters aligned with dataset origin. Interestingly, centroid cosine similarity exceeded 0.99, indicating that although clusters were distinct, they were also closely aligned in a shared representational space. This pattern suggests that the model captured both dataset specific boundaries and cross domain consistency, balancing specialization with generalization. These results show that the SSL with a transformer backbone can produce robust and interpretable representations across heterogeneous digital pathology datasets. The framework provides a reliable baseline for future work in representation learning, particularly in scenarios where multiple datasets must be integrated to support downstream tasks such as classification, clustering, or biomarker discovery.

Q&A:

Bios: Aneesh Chatrathi,Aniketh Malipeddi

Program Track: Advanced Research

GitHub Username:

AneeshChat -Aneesh Chatrathi

animalipeddi -Aniketh Malipeddi

What was your favorite seminar? Why?

My favorite seminar was Zarif Azher’s talk. I really enjoyed hearing how he built a “Fitbit for cattle” and used AI in agriculture. It was inspiring to see how he connected research with real products, and learning that he just got into Y Combinator made it even more exciting. -Aneesh Chatrathi

Dr. Cornell’s. I found taking environmental risk factors of ALS to be a unique angle to tackling the causes of the diseases. -Aniketh Malipeddi

If you were to summarize your summer internship experience in one sentence, what would it be?

This summer, I explored how self supervised learning with a transformer backbone and a domain aware token can capture both dataset specific structure and shared alignment across digital pathology images. -Aneesh Chatrathi

Over the summer, I built and trained advanced machine learning models on massive cytology datasets at Dartmouth’s EDIT program, developing foundation-level tools for disease classification and gaining hands-on experience in both high-performance computing and biomedical AI research. -Aniketh Malipeddi