MACHINE LEARNING-BASED CLASSIFICATION OF BREAST CANCER USING GENEEXPRESSION PROFILES
Keywords:
Breast cancer, Gene expression analysis, Machine learning, Support Vector Machine, Biomarker identificationAbstract
Breast cancer remains one of theleading causes of cancer-related mortality worldwide, highlightingthe need for accurate diagnostic approaches and improved understanding of its molecularmechanisms. Advances in transcriptomic technologies have enabled large-scale analysis of geneexpression profiles, providing valuable opportunities for identifying molecular biomarkers associatedwith cancer development. In this study, gene expression data were analyzed to identify significantgenes differentiating tumor and normal breast tissue samples and to evaluate the effectiveness ofmachine learning models for breast cancer classification. The dataset consisted of 590 samples,including 529 tumor and 61 normal samples, with expression values measured for 17,814 genes.Differential gene expression analysis using a two-sample t-test was performed to identify informativegenes, and the top 500 statistically significant genes were selected as predictive features. Threemachine learning modelsLogistic Regression, Support Vector Machine (SVM), and RandomForestwere developed to classify tumor and normal samples based on the selected gene expression features.The dataset was divided into training and testing subsets using stratified sampling, and modelperformance was evaluated using accuracy and receiveroperating characteristic area under thecurve (ROC–AUC). The results demonstrated strong classification performance, with LogisticRegression and SVM achieving an accuracy of 97.46% and an ROC–AUC of 0.997, while RandomForest achieved an accuracy of 96.61% and an ROC–AUC of 0.994. These findings highlight thepotential of combining gene expression analysis with machine learning techniques for breast cancerclassification and biomarker discovery.
References
Giaquinto, A. N., Sung, H., Newman, L. A., Freedman, R. A., Smith, R. A., Star, J., ... & Siegel,
R. L. (2024). Breastcancer statistics 2024.CA: a cancer journal for clinicians,74(6), 477-495.2.Hacking, S. M., Yakirevich, E., & Wang, Y. (2022). From immunohistochemistry to new digital
ecosystems: a state-of-the-art biomarker review for precision breast cancermedicine.Cancers,14(14), 3469.
Lopez-Gonzalez, L., Sanchez Cendra, A., Sanchez Cendra, C., Roberts Cervantes, E. D.,
Espinosa, J. C., Pekarek, T., ... & Diaz-Pedrero, R. (2024). Exploring biomarkers in breast cancer:hallmarks of diagnosis, treatment, and follow-up in clinical practice.Medicina,60(1), 168.
Albitar, M., Goy, A., Pecora, A., Graham, D., McNamara, D., Charifa, A., ...& Waintraub, S.
(2024). The use of transcriptomic data in developing biomarkers in breastcancer.ImmunoMedicine,4(1), e1051.
Kawiak, A. (2022). Molecular research and treatment of breast cancer.International journal of
molecular sciences,23(17), 9617.
Moar, K., Pant, A., Saini, V., Pandey, M., & Maurya, P. K. (2023). Potential diagnostic and
prognostic biomarkers for breast cancer: A compiled review.Pathology-Research andPractice,251, 154893.
Zhu, S., Zhang, M., Liu, X., Luo, Q., Zhou, J., Song, M., ...& Liu, J. (2023). Single-cell
transcriptomics provide insight into metastasis-related subsets of breast cancer.Breast CancerResearch,25(1), 126.
Han, X., Li, X., Bai, L., & Zhang, G. (2025).Single-cell transcriptomics in metastatic breast
cancer: mapping tumor evolution and therapeutic resistance.Frontiers in Genetics,16, 1669741.9.Wang, X., Venet, D., Lifrange, F., Larsimont, D., Rediti, M., Stenbeck, L., ... & Sotiriou, C.
(2024). Spatial transcriptomics reveals substantial heterogeneity in triple-negative breast cancerwith potential clinical implications.Nature communications,15(1), 10232.
An, J., Lu, Y., Chen, Y.,Chen, Y., Zhou, Z., Chen, J., ...& Peng, F. (2024). Spatial
transcriptomics in breast cancer: providing insight into tumor heterogeneity and promotingindividualized therapy.Frontiers in Immunology,15, 1499301.
Zhang, Y., Gong, S., & Liu, X. (2024).Spatial transcriptomics: a new frontier in accurate
localization of breast cancer diagnosis and treatment.Frontiers in Immunology,15, 1483595.12.Rezaei, S., Hamedani, Z., Ahmadi, K., Ghannadikhosh, P., Motamedi, A., Athari, M., ... & Arabi,
H. (2025). Role of machine learning in molecular pathology for breast cancer: A review on geneexpression profiling and RNA sequencing application.Critical Reviews inOncology/Hematology,213, 104780.
Chen, X., Yi, J., Xie, L., Liu, T., Liu, B., & Yan, M. (2024). Integration of transcriptomics and
machine learning for insights into breast cancer: exploring lipid metabolism and immuneinteractions.Frontiers in Immunology,15, 1470167.
Orvile. (2025).Breast cancer gene expression dataset[Data set]. Kaggle.
https://www.kaggle.com/datasets/orvile/gene-expression-profiles-of-breast-cancer
Sahu, D., Shi, J., Segura Rueda, I. A., Chatrath, A., & Dutta, A. (2024). Development of a
polygenic score predicting drug resistance and patient outcome in breast cancer.NPJ PrecisionOncology,8(1), 219.
Thalor, A., Joon, H. K., Singh, G., Roy, S., & Gupta, D. (2022). Machine learning assisted
analysis of breast cancer gene expression profiles reveals novel potential prognostic biomarkersfor triple-negative breast cancer.Computational and structural biotechnology journal,20, 1618-1631
.
Mirza, Z., Ansari, M. S., Iqbal, M. S., Ahmad, N., Alganmi, N., Banjar, H., ...& Karim, S.(2023).
Identification of novel diagnostic and prognostic gene signature biomarkers for breast cancerusing artificial intelligence and machine learning assisted transcriptomicsanalysis.Cancers,15(12), 3237.
Park, J. W., & Rhee, J. K. (2024).Integrative analysis of ATAC-seq and RNA-seq through
machine learning identifies 10 signature genes for breast cancer intrinsicsubtypes.Biology,13(10), 799.
Tschodu, D., Lippoldt, J., Gottheil, P., Wegscheider, A. S., Käs, J. A., & Niendorf, A. (2023).
Re-evaluation of publicly available gene-expression databases using machine-learning yields amaximum prognostic power in breast cancer.Scientific Reports,13(1), 16402.
Muthamilselvan, S., & Palaniappan, A. (2023). Brcadx: Precise identification of breast cancer
from expression data using a minimal set of features.Frontiers in Bioinformatics,3, 1103493.21.Di Cosimo, S., Pizzamiglio, S., Ciniselli, C. M., Duroni, V., Cappelletti, V., De Cecco, L., ...&
Verderio, P. (2024). A gene expression-based classifier for HER2-low breast cancer.ScientificReports,14(1), 2628.
Kwon, M. J. (2023). Matrix metalloproteinases as therapeutic targets in breast cancer.Frontiers
in oncology,12, 1108695.
Kang, S. U., Cho, S. Y., Jeong, H., Han, J., Chae, H. Y., Yang,H., ... & Kwon, M. J. (2022).
Matrix metalloproteinase 11 (MMP11) in macrophages promotes the migration of HER2-positive breast cancer cells and monocyte recruitment through CCL2–CCR2signaling.Laboratory investigation,102(4), 376-390.
Kim, H. S., Kim,M. G., Min, K. W., Jung, U. S., & Kim, D. H. (2021).High MMP-11 expression
associated with low CD8+ T cells decreases the survival rate in patients with breast cancer.PLoSOne,16(5), e0252052.