Varun Kalidindi & Parry Nall

Breast Cancer Prognosis- Leveraging Cancer Registry Data for Survival Prediction

Varun Kalidindi & Parry Nall



Lay Summary:

Breast cancer is one of the leading causes of death for women, and doctors often know whether a patient is likely to survive but not how long. By analyzing clinical data, we developed computer models that can better predict survival timelines, helping doctors make timely and personalized treatment decisions.

Abstract:

Breast cancer is the second leading cause of cancer-related deaths among women worldwide. While prior research has mainly focused on classifying patient survival status, the ability to estimate how long patients will survive remains limited, representing a missed opportunity for timely and potentially life-saving interventions. Our project aims to address this gap by constructing an accurate time-to-event prognostication model to guide treatment selection. Using a SEER-derived public clinical registry, we developed and evaluated machine learning survival models. Routinely collected clinical variables including TNM staging, receptor status, regional lymph node findings, tumor grade and size, patient age, and follow-up duration were preprocessed and encoded as model features. We trained and compared two modeling approaches- an ensemble survival learner and a gradient-boosted time-to-event model optimized for survival prediction. Model discrimination was assessed across short-term, medium-term, and long-term horizons using established survival metrics and time-dependent receiver operating characteristic analysis. Interpretability analysis using SHAP showed that measures related to lymph node involvement and regional disease extent had the greatest influence on predictions, while tumor size and local tumor stage contributed less. The gradient-boosted model outperformed the ensemble learner, especially for shorter-term survival predictions. This likely reflects limitations of registry data such as inconsistent follow-up, variable reporting, and limited sample diversity. These results suggest that AI time-to-event prognostic models can improve access to validated survival risk assessments in clinical settings, supporting more timely and personalized treatment decisions and enhancing healthcare delivery efficiency.



Q&A:


Bios: Varun Kalidindi,Parry Nall

Program Track: Skills Development

GitHub Username:

varunkal -Varun Kalidindi

Parrouge -Parry Nall

What was your favorite seminar? Why?

My favorite seminar was Zarif Asifer’s presentation where he talked about multimodal modeling, his research experience, and entrepreneurial journey. By the end of the seminar, I had learned how closely related pathology and entrepreneurship can be: overall, changing my perspective on what is achievable with AI and medicine. -Varun Kalidindi

My favorite seminar was Zarif Asifer’s project on multimodal modeling and his perspectives on research and entrepreneurship because he had very relevant and useful insights that a common person would understand. -Parry Nall

If you were to summarize your summer internship experience in one sentence, what would it be?

During my summer internship, I developed and tested a machine learning model on relevant clinical data to more accurately predict breast cancer prognosis, and along the way, my perspective changed on what is achievable with the intersection of AI and medicine - two very relevant fields in our world. -Varun Kalidindi

During my Dartmouth internship, I developed and evaluated machine learning models on high dimensional clinical datasets, creating valuable insights from data that provided new ideas on the intersection between AI and medicine. -Parry Nall

Blog Post


Blog
Scientific premise:

Our project aims to develop a robust prognostic model using machine learning algorithms to predict time-to-event survival months for patients with breast cancer.

Aims/goals:

Predict patient-specific survival times for breast cancer using clinical data.

Develop machine learning models (e.g., Survival XGBoost, Random Forest) that estimate how long a patient is likely to survive based on clinical features.

Help clinicians identify high-risk patients earlier.

Use survival probabilities to support earlier interventions and more targeted care planning for patients with poor prognoses.

Improve explainability of AI predictions in a clinical setting.

Apply tools like survSHAP to interpret model outputs and highlight which features most influence survival.

Ensure the model generalizes across different patient populations.

Validate model performance and test on multiple subsets of data to maintain consistency and reliability

Our approach: XGBoost & Random Survival Forest

Data Collection and Preprocessing:

We used data from a SEER-derived public clinical registry, which included 4,024 patient records for all analyses. Key variables such as TNM staging, receptor status, regional lymph node findings, tumor grade, tumor size, age, and survival time were processed for analysis.

Categorical features were one-hot encoded when appropriate to ensure the model could interpret them effectively.

Model Development:

To train and evaluate our models, we randomly split the dataset into 80% for training and 20% for testing, while making sure the proportion of patients who experienced the event stayed consistent in both groups.

We compared two different approaches to predicting survival outcomes:

●​ Random Survival Forest – an ensemble of decision trees built from bootstrapped samples with random feature selection. This method is great at uncovering complex, non-linear patterns and interactions in the data.

●​ XGBoost with Accelerated Failure Time (AFT) loss – a gradient boosting technique that models the logarithm of survival time directly. It can flexibly capture non-linear relationships between patient characteristics and survival outcomes.

Evaluation and Interpretability:

Evaluation
We measured model performance using two key metrics: the concordance index (C-index) and the time-dependent area under the ROC curve (AUC). These were calculated for short-, medium-, and long-term time horizons to see how well the models distinguished between patients with different outcomes over time.

Interpretability
To understand why the models made certain predictions, we used SHAP (Shapley Additive explanations) analysis. This helped us identify which features were most influential overall, and also allowed us to create patient-level plots showing how each predictor affected an individual’s survival prediction, both in direction (positive or negative impact) and in strength.

Results:

The gradient-boosted AFT model performed better than the Random Survival Forest, particularly for short-term survival prediction, achieving a C-index of 0.7500 compared to 0.5769 for the Random Survival Forest. At the patient level, the model showed excellent discrimination for 2-year outcomes (AUC 0.978), weaker performance at 5 years (AUC 0.427), and moderate performance at 8 years (AUC 0.725).

SHAP analysis revealed that lymph node involvement and the extent of regional disease were the most influential predictors of survival. Tumor size and local stage also played a role, but with less overall impact.

Deliverables:

A breast cancer prognosis model capable of predicting time to event, providing both overall survival estimates and individualized risk insights.

Conclusion:

Overall, AI-based time-to-event models can deliver interpretable, individualized survival predictions to guide earlier interventions and informed treatment decisions. From our first endeavor with Random Survival Forest, we moved on to explore XGBoost with Accelerated Failure Time (AFT) to estimate how long a patient with breast cancer might survive and found it successful. With our accurate model, we’re better able to individualize predictions that could guide earlier interventions and more informed treatment decisions.

More specifically, the model showed strong short-term prognostic accuracy, making it a promising tool for effective triage and patient prioritization. However, we also found that its performance varied across certain time periods, highlighting the importance of model refinement.

While our results are encouraging, they were based on a limited dataset. With a larger and more diverse set of patient records (potentially including imaging, genetic, and lifestyle data) we believe machine learning could provide even stronger and more reliable survival predictions.

Looking ahead, we now understand how valuable machine learning is in the medical field. What once made traditional clinical methods tedious can now be handled by machine learning, which analyzes vast amounts of data, identifies patterns, and predicts outcomes with greater accuracy: enabling faster diagnoses, personalized treatment plans, and improved patient care all around the world.