Lucas Bichescu

Machine Learning-Based Classification of Cancer Types and Identification of Gene Biomarkers from RNA-Seq Data

Lucas Bichescu



Lay Summary:

By analyzing cancer cells via their cellular blueprint, aspects hidden by imaging can be uncovered. With a combination of advanced plots and gene importance analyses, genetic markers for cancers can be reliably obtained.

Abstract:

Cancer is a highly complex and heterogeneous disease. Conventional cancer classification using histopathology relies on visual inspection of tissue, often overlooking subtle molecular differences. This can prove a detriment in predicting patient outcomes and guiding personalized treatments. In contrast, gene expression profiling captures thousands of genes simultaneously, offering a more quantitative and powerful way to uncover molecular complexities and more precisely distinguish cancer subtypes. By leveraging advanced data analysis approaches, subtle molecular signatures that define distinct cancer subtypes can be identified and used to accurately classify new samples. In this study, I perform an integrative analysis of RNA sequencing data from five cancer types- BRCA, KIRC, LUAD, COAD, and PRAD. I employ multiple standard machine learning models to predict cancer types, demonstrating the predictive power of transcriptomic data. In my findings, Logistic Regression and Support Vector Classifier perform the best with over 99% accuracy in the given dataset. Integrating statistical analyses like ANOVA and Recursive Feature Elimination to detect genes with high discriminative potential, and visualization tools including heatmaps and volcano plots to highlight expression patterns, I aim to provide a comprehensive view of the molecular landscape. This project illustrates how predictive machine learning frameworks can be used to translate complex RNA-seq profiles into accurate classification of cancer subtypes, offering actionable insights for precision oncology.



Q&A:


Bios: Lucas Bichescu

Program Track: Skills Development

GitHub Username:

FireBob69 -Lucas Bichescu

What was your favorite seminar? Why?

I enjoyed Ken Lau’s Hallmarks of Precancer the most. It was very fascinating to learn of a new branch of cancer I was largely unbeknownst to and to learn the many reasons as to why precancer can form. Thousands of lesions are within each one of us, some appearing normal but are molecularly abnormal. I was drawn by Lau’s description of the reasons to why these lesions form, ranging from biological age to cellular transitions in Precancer. -Lucas Bichescu

If you were to summarize your summer internship experience in one sentence, what would it be?

A wonderful experience like no other. -Lucas Bichescu

Blog Post


Introduction:

Cancer is a complex and heterogeneous disease, affecting millions worldwide each year. Traditional methods of cancer classification rely on visual inspection of tissue structure and cell morphology. While useful, this approach can be subjective and often overlooks subtle molecular differences that could be critical in diagnosis and treatment. RNA sequencing transforms this by analyzing thousands of genes in one instance, allowing comprehensive and precise analysis to be had. RNA-seq offers data that is generated through collection of mRNA samples then being converted into complementary DNA samples using reverse transcriptase. This allows for underlying information to be revealed such as transcription errors, alternative splicing, RNA editing, and more that DNA information alone would not be able to display.

Methods

Data Collection:

Included in the UCI Learning Repository made first by the TCGA Pan Cancer analysis project, the dataset obtained contained 801 instances and 20,531 dummy gene expression features (gene_1,gene_2, etc.). The instances were split into five different cancer types: (BRCA) Breast Invasive Carcinoma, (KIRC) Kidney Renal Clear Cell Carcinoma, (COAD) Colon
Adenocarcinoma, (LUAD) Lung Adenocarcinoma, (PRAD) Prostate Adenocarcinoma with generally unbalanced data.

Preprocessing:

Before analyzing the RNA-seq data, the counts were log-transformed to compress the wide range of expression values and stabilize variance across genes. Next, genes were Z-scored, standardizing each gene’s expression relative to its mean and standard deviation, which puts all genes on the same scale and allows meaningful comparison across samples. This allows algorithms to more effectively compare gene expression values and speeds up computation.

K-fold Cross validation:

K-fold cross validation is a widely used technique in machine learning for model estimation, training, and hyperparameter optimization. In this approach the dataset is split into K equally sized folds. The model is trained K times, each time using k-1 fold as the training set and the remaining fold as the test set. This allows every data point to be used for both training and evaluation, providing a robust estimate of model performance. For hyperparameter tuning, GridSearch is applied in combination with k-fold CV. A predefined grid of hyperparameter combinations is evaluated across the training folds and the combination that achieves the best average performance is selected. This method ensures efficient use of data and reduces risk of overfitting.

Gene Significance :

In my initial analyses, I use a combination of descriptive methods such as ANOVA, heatmaps, volcano plots, and Recursive Feature Elimination. This series of analyses provide a descriptive tool to identify gene biomarkers in RNA-seq data.

Cancer-Type Classification Models;​

I assess the effectiveness of several machine-learning models to classify between the five different cancer types. Models such as Logistic Regression, Support Vector Classification, and more are first benchmarked using k-fold CV estimation. The highest scoring models from this analysis are chosen for training. This analysis ranks which models are most appropriate for RNA-seq data.

Results:

[Gene Significance Plots:]{.underline}
ANOVA (Analysis of Variance) F-Stat:

ANOVA is used to identify genes whose expression differs significantly across multiple cancer types. It represents the ratio of variance between group means to variance within groups. Genes such as 9175, 9176, and 220 show the highest f-stat values, indicating those of highest importance.

{width=”6.5in” height=”2.6666666666666665in”}

Heatmap/Dendrogram:

The heatmap/dendrogram combination is a visual method of observing which genes serve as biomarkers for specific cancer types with hierarchical clustering. The genes chosen are supplied by the ANOVA test. The x-axis is formatted with cancer types in the order: KIRC, PRAD, COAD, BRCA, and LUAD. Clusters of relevant genes to each cancer type can be discerned through darker shades of red. For example, some of the most relevant genes of KIRC are genes 1510, 13818,and 16392 with lesser important genes 3461 and 7964 sharing similar relevancy to COAD. COAD demonstrates a very targeted group of genes of high relevance with genes 3523 and 2037 bolded in dark red. It can also be noted that BRCA has a very low to near no amount of relevant genes in the dataset.

{width=”6.5in” height=”3.1555555555555554in”}

Volcano Plots:
Compares variability of one cancer type to the rest. Larger (right) log z-score represents higher expression compared to overall variability. Genes marked to the left show rarely in selected cancer type even being a biomarker for the remaining four being compared

{width=”5.031944444444444in” height=”3.0208333333333335in”}

{width=”5.0in” height=”2.9902777777777776in”}

{width=”5.083333333333333in” height=”3.0416666666666665in”}

{width=”5.020833333333333in” height=”2.9902777777777776in”}

{width=”5.198611111111111in” height=”3.1152777777777776in”}

Genes analyzed from previous ANOVA analysis may show up in these volcano plots,
demonstrating the robustness of these techniques. For example, genes 1858, and 16342 show in both ANOVA and the volcano plot analysis of KIRC.

Recursive Feature Elimination (RFE):

RFE involves training a chosen model, in this case it is logistic regression, recursively on the dataset, assigning a coefficient to each gene based at each iteration based on their importance. 10 are removed after each iteration, repeating until there are only 40 left. Gene 1858 is shown to appear in RFE as well in the descriptive analysis above, further underscoring the correlation between certain cancer types.

Selected RFE genes: gene_1858 gene_2037 gene_2318 gene_3439 gene_3523 gene_3737 gene_3921 gene_4274 gene_5578 gene_6160 gene_6594 gene_6733 gene_7238 gene_7623 gene_7964 gene_7965 gene_8013 gene_8316 gene_8348 gene_9175 gene_9176 gene_10460 gene_11393 gene_11903 gene_12013 gene_12808 gene_12848 gene_12977 gene_13631 gene_13976 gene_14092 gene_15895 gene_15898 gene_15900 gene_16358 gene_16372 gene_17316 gene_17442 gene_17801 gene_18135

K-fold CV Estimation:

{width=”6.7402766841644794in” height=”4.031944444444444in”}

{width=”6.5in” height=”4.333333333333333in”}

Model performance:

{width=”6.5in” height=”3.176388888888889in”}

From the four chosen, all perform very accurately with Linear Regression and LinearSVC performing the best. These results are consistent with prior findings reported in the literature (e.g. Alanzi et al., 2024).

Discussion:

In tandem, these analyses allow researchers to decipher which genes serve as biomarkers to their respective cancer. Plots such as ANOVA and RFE allow for general comprehension of genes that correlate to one of the five cancers. Volcano plots reinforce this by supplying precise information as to which genes directly relate to a cancer type.

Machine learning models are shown to be highly proficient in predicting cancer type using the quantitative data of RNA-seq. Logistic Regression and Support Vector Classification models display the highest accuracy, F1 scores, and Recall scores with each over 99.5%. This demonstrates the highly effective nature of cancer classification using RNA-seq data

Future Work:

The data provided by the UCI Machine Learning Repository is for the most part limited due to the vast and complex nature of cancer. Expanding this data to include not only more cancer types, but normal tissue may provide a far more accurate and realistic model. To enable this process to be as easy and streamlined as possible for researchers, developing a web app is a next step integrated with algorithms to simply display important genes without the need of reading the plots.

Conclusion:

RNA-seq data combined with machine learning enables highly accurate cancer classification and biomarker discovery. Using preprocessing, ANOVA, RFE, and visual tools like heatmaps and volcano plots, we identified genes that distinguish five cancer types with over 99.5% accuracy using Logistic Regression and SVC. Expanding datasets and developing a web platform could further improve performance and make these tools accessible for broader cancer research and precision medicine.