Multimodal Colon Cancer Prognosis with Inferred Spatial Transcriptomics and Biologically-Informed LLM loss
Arjun Chitla, Ali Usman, Saatvik Kesarwani & Navneet Prakash
Lay Summary:
Our project integrates a lot of different types of data as well as workarounds for expensive data that a single AI/machine learning model would usually not be able to use in order to find which patients are at a higher risk of death. Additionally, our project uses an LLM to supervise the model's training as if a doctor or pathologist were in the room with the model to guide it towards becoming smarter and correct it when it made mistakes.
Abstract:
Over 150,000 Americans are diagnosed with colorectal cancer (CRC) each year, making it the fourth most commonly diagnosed cancer [8]. More than 50,000 patients die annually from CRC, demonstrating how critical it is for disease prognostication to be improved. Current prognosis relies on traditional patch-based methods to interpret whole-slide image data, which fail to capture the complex biological and spatial heterogeneity of tumours, considering that only a fraction of the thousands of patches derived from a whole-slide image are used. Furthermore, spatial transcriptomics data to predict survival is costly and inaccessible to widespread clinical use. The purpose of this study is to improve CRC prognosis by integrating multiple data sources, including whole-slide images (WSI), bulk RNA sequencing, DNA methylation, and inferred spatial transcriptomics, into a multimodal machine learning model. Using data from The Cancer Genome Atlas (TCGA), unimodal encoders for each modality were pretrained with a Multilayer Perceptron (MLP) survival prediction head. Frozen embeddings for each modality were extracted and sent through a gated attention layer and concatenation to return a final risk score. Additionally, Large Language Models (LLMs) were incorporated to refine risk predictions through two approaches- (1) a reward-based learning system using LLM-generated accuracy scores, and (2) TextGrad’s TextLoss using LLM backpropagation to output new risk scores [11]. LLM loss improved model accuracy with our best multimodal model having a validation C-index of 0.717 and a test C-index of 0.693, demonstrating our pipeline is highly accurate and competitive considering inferred transcriptomics were used. Future work will focus on adding inferring modalities, cross-modal pretraining, and incorporating additional clinical features into our model, such as metastasis, lymph node, tumour progression, and cancer staging data, to improve the accuracy of our pipeline.
Q&A:
Bios: Arjun Chitla,Ali Usman,Saatvik Kesarwani,Navneet Prakash
Program Track: Advanced Research
GitHub Username:
ARC-21 -Arjun Chitla
AliMUsman -Ali Usman
Saatvik538 -Saatvik Kesarwani
NavneetPra -Navneet Prakash
What was your favorite seminar? Why?
My favorite seminar was the one by Ms. Monica Dimambro. I remember being interested in how she combined machine learning with statistics, and how she was able to use it to create datasets for low-risk patients, which I thought was really cool because by doing this she was able to tackle a key problem–data inavailability. Also, upon hearing about your paper on predicting risk-stratified suicide, I thought it was great to use more interpretable models so that we could analyze and attribute certain things to specific indicators within low-risk suicide rate. -Arjun Chitla
I loved the seminar by Monica and how she’s using natural language processing to characterise suicide risk for veterans. It continually inspires me to see how many ways AI has the potential to help people. -Ali Usman
My favorite seminar during this program was the one led by Monica Dimambro. I found it really interesting because she connected her work in suicide risk prediction with natural language processing. She showed how NLPs can be used to analyze clinical notes and patient records. Since I am really interested in NLPs and their applications to different fields, this seminar was really interesting to me. Her seminar stood out because it combined technical depth with meaningful clinical outcomes, which is the intersection I hope to work in. -Saatvik Kesarwani
I enjoyed the seminar with Dennis Hazelett, as I believe it was very unique in comparison to the others and discussed a very important part of research that is often overlooked by many. Tips and skills that were shared during that seminar would have been great to know even earlier on to make the writing and documentation process easier in the end. -Navneet Prakash
If you were to summarize your summer internship experience in one sentence, what would it be?
Lots of learning, hard work, banging my head, navigating HPC, using foundation models, learning new kinds of creative ML techniques, collaborating with those of differing skillsets, and again, lots of learning. -Arjun Chitla
Full of learning, chaotic, and definitely a summer to remember! -Ali Usman
My summer internship was a challenging yet rewarding experience where I gained invaluable knowledge working on an AI-driven medical prognosis project and getting advice from experienced and talented researchers a long the way. -Saatvik Kesarwani
This internship entailed a lot of cluelessness, hard work, learning, and discovery for me, and was ultimately a greatly educational and rewarding experience. -Navneet Prakash