Prediction of Breast Cancer Survival by Machine Learning Methods: An Application of Multiple Imputation
Abstract
Background: The low breast cancer survival rates in less developed countries are critical. The machine learning techniques predict cancers survival with high accuracy. Missing data are the most important limitation for using the highest potential of these techniques to predict cancers survival. Multiple imputation (MI) was implemented and analyzed in detail to impute the missing data of a breast cancer dataset.
Methods: The dataset was from The Omid Treatment and Research Center Urmia, Iran between Jan 2006 and Dec 2012 and had information from 856 women. The algorithms such as C5 and repeated incremental pruning to produce error reduction were applied on the imputed versions of the original dataset and the non-imputed dataset to predict and extract clinical rules, respectively.
Results: The findings showed the performance of C5 in all the evaluation criteria including accuracy (84.42%), sensitivity (92.21%), specificity (64%), Kappa statistic (59.06%), and the area under the receiver operator characteristic (ROC) curve (0.84), was improved after imputation.
Conclusion: The dataset of the present study met the requirements for using the multiple imputation method. The extracted rules after the application of MI were more comprehensive and contained knowledge that is more clinical. However, the clinical value of the extracted rules after filling in the missing data did not noticeably increase.
2. Ferlay J, Soerjomataram I, Dikshit R, et al (2015). Cancer incidence and mortality worldwide: sources, methods and major patterns in GLOBOCAN 2012. Int J Cancer, 136(5):E359-86.
3. Sharifian A, Pourhoseingholi MA, Emadedin M, et al (2015). Burden of Breast Cancer in Iranian Women is Increasing. Asian Pac J Cancer Prev, 16(12):5049-52.
4. Rahimzadeh M, Pourhoseingholi MA, Kavehie B (2016). Survival rates for breast cancer in iranian patients: a meta-analysis. Asian Pac J Cancer Prev, 17(10): 4615–4621.
5. Kate RJ, Nadig R (2017). Stage-specific predictive models for breast cancer survivability. Int J Med Inform, 97:304-311.
6. Delen D, Walker G, Kadam A (2005). Predicting breast cancer survivability: a comparison of three data mining methods. Artif Intell Med,34(2):113-27.
7. Park K, Ali A, Kim D, et al (2013). Robust predictive model for evaluating breast cancer survivability. Engineering Applications of Artificial Intelligence,26(9):2194-2205.
8. Lotfnezhad Afshar H, Ahmadi M, Roudbari M, Sadoughi F (2015). Prediction of breast cancer survival through knowledge discovery in databases. Glob J Health Sci,7(4):392-8.
9. Jerez JM, Franco L, Alba E, et al (2005). Improvement of breast cancer relapse prediction in high risk intervals using artificial neural networks. Breast Cancer Res Treat,94(3):265-72.
10. Thongkam J, Xu GD, Zhang YC, Huang FC (2009). Toward breast cancer survivability prediction models through improving training space. Expert Systems with Applications,36(10):12200-12209.
11. Han J, Kamber M, Pei J (2011). Data Mining: Concepts and Techniques. 3rd ed. Morgan Kaufmann Publishers Inc,USA, pp.: 100-115.
12. Dehghan M, Dehghan D, Sheikhrabori A, et al (2013). Quality improvement in clinical documentation: does clinical governance work? J Multidiscip Healthc,6:441-50.
13. Saravi BM, Asgari Z, Siamian H, et al (2016). Documentation of Medical Records in Hospitals of Mazandaran University of Medical Sciences in 2014: a Quantitative Study. Acta Inform Med,24(3):202-6.
14. Ahmad L, Eshlaghy A, Poorebrahimi A, et al (2013). Using three machine learning techniques for predicting breast cancer recurrence. J Health Med Inform,4(2):3.
15. Garcia-Laencina PJ, Abreu PH, Abreu MH, Afonoso N (2015). Missing data imputation on the 5-year survival prediction of breast cancer patients with unknown discrete values. Comput Biol Med,59:125-133.
16. Jerez JM, Molina I, García-Laencina PJ, et al (2010). Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif Intell Med, 50(2):105-15.
17. Khalkhali HR, Lotfnezhad Afshar H, Esnaashari O, Jabbari N (2016). Applying Data Mining Techniques to Extract Hidden Patterns about Breast Cancer Survival in an Iranian Cohort Study. J Res Health Sci, 16(1):31-5.
18. Sterne JAC, White IR, Carlin JB, et al (2009). Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ,338:b2393.
19. Horton NJ, Kleinman KP (2007). Much ado about nothing: A comparison of missing data methods and software to fit incomplete data regression models. Am Stat,61(1):79-90.
20. Buuren. Sv, Groothuis-Oudshoorn. K (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software,45(3):1-67.
21. Bonadonna G, Gabriel N H, Pinuccia V (2006). Textbook of Breast Cancer: A Clinical Guide to Therapy. 3rd ed. Informa HealthCare, UK, pp.: 85-93.
22. Kuhn. M, Weston. S, Coulter. N, Culp. M. C50: C5.0 Decision Trees and Rule-Based Models (2015). R package version 0.1.0-24. https://CRAN.R-project.org/package=C50
23. Lantz B (2015). Machine Learning with R. 2nd ed. Packt, UK, pp.: 123-32.
24. Hornik. K, Buchta. C, Zeileis (2009). A. Open-Source Machine Learning: R Meets Weka. Computational Statistics,24(2):225–232.
25. Zhang Y, Xin Y, Li Q, et al (2017). Empirical study of seven data mining algorithms on different characteristics of datasets for biomedical classification applications. BioMedical Engineering OnLine,16(1):125.
26. Hölzel D, Eckel R, Bauerfeind I, et al (2017). Survival of de novo stage IV breast cancer patients over three decades. J Cancer Res Clin Oncol,143(3):509-519.
27. Macià F, Porta M, Murta-Nascimento C, et al (2012). Factors affecting 5- and 10-year survival of women with breast cancer: An analysis based on a public general hospital in Barcelona. Cancer Epidemiol,36(6):554-9.
Files | ||
Issue | Vol 50 No 3 (2021) | |
Section | Original Article(s) | |
DOI | https://doi.org/10.18502/ijph.v50i3.5606 | |
Keywords | ||
Breast neoplasms Survival Observer variation Imputation Machine learning |
Rights and permissions | |
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. |