Original Article

Computational Modeling and Analysis to Predict Intracellular Parasite Epitope Characteristics Using Random Forest Technique


Background: In a new approach, computational methods are used to design and evaluate the vaccine. The aim of the current study was to develop a computational tool to predict epitope candidate vaccines to be tested in experimental models.

Methods: This study was conducted in the School of Allied Medical Sciences, and Center for Research and Training in Skin Diseases and Leprosy, Tehran University of Medical Sciences, Tehran, Iran in 2018. The random forest which is a classifier method was used to design computer-based tool to predict immunogenic peptides. Data was used to check the collected information from the IEDB, UniProt, and AAindex database. Overall, 1,264 collected data were used and divided into three parts; 70% of the data was used to train, 15% to validate and 15% to test the model. Five-fold cross-validation was used to find optimal hyper parameters of the model. Common performance metrics were used to evaluate the developed model.

Results: Twenty seven features were identified as more important using RF predictor model and were used to predict the class of peptides. The RF model improves the performance of predictor model in comparison with the other predictor models (AUC±SE: 0.925±0.029). Using the developed RF model helps to identify the most likely epitopes for further experimental studies.

Conclusion: The current developed random forest model is able to more accurately predict the immunogenic peptides of intracellular parasites.

1. Kaufmann SH, Juliana McElrath M et al (2014). Challenges and responses in hu-man vaccine development. Curr Opin Im-munol, 28 (1):18–26.
2. Flower DR (2014). Computer-Aided Vaccine Design. Hum Vaccines Immunother, 10 (1): 241–43.
3. Kuleš J, Horvatić A, Guillemin N et al (2016). New approaches and omics tools for mining of vaccine candidates against vector-borne diseases. Mol Biosyst, 12 (9): 2680–94.
4. Soria-Guerra RE, Nieto-Gomez R, Govea-Alonso DO, Rosales-Mendoza S (2015). An overview of bioinformatics tools for epitope prediction: Implications on vac-cine development. J Biomed Inform, 53 (1): 405–14.
5. Sanchez-Trincado JL, Gomez-Perosanz M, Reche PA (2017). Fundamentals and Methods for T- and B-Cell Epitope Pre-diction. J Immunol Res, 2017:2680160.
6. Kar P, Ruiz‐Perez L, Arooj M, Mancera RL (2018). Current methods for the predic-tion of T-cell epitopes. Pept Sci, 110 (2) :e24046.
7. Luo J, Wu M, Gopukumar D, Zhao Y (2016). Big Data Application in Biomedi-cal Research and Health Care: A Litera-ture Review. Biomed Inform Insights, 8 (1):1–10.
8. Chapter: Shaoning Pang, Ilkka Havukkala, Yingjie Hu, Nikola Kasabov (2008). Bootstrapping Consistency Method for Optimal Gene Selection from Microarray Gene Expression Data for Classification Problems. In: Machine learning in bioinformat-ics. Eds, Zhang and Rajapakse. 1st ed, John Wiley & Sons Inc. Hoboken, NJ. pp.: 89-110.
9. Larrañaga P, Calvo B, Santana R et al (2006). Machine learning in bioinformatics. Brief Bioinform, 7 (1):86–112.
10. Chapter: Lata S, Bhasin M, Raghava GP (2007). Application of Machine Learning Techniques in Predicting MHC Binders. Methods Mol Biol, 409:201-15.
11. Breiman L. Bagging predictors (1996). Mach Learn, 24 (2):123–40.
12. Chapter: Qi, Yanjun (2012). Random For-ests for Bioinformatics. In: Ensemble Ma-chine Learning: Methods and Applications. Eds, Zhang and Ma. 1st ed, Springer-Verlag, New York. pp.: 307-23.
13. Chapter: Hastie T, Tibshirani R, Friedman J (2009). Boosting and Additive Trees. In: The Elements of Statistical Learning: Data Min-ing, Inference, and Prediction. Eds, Hastie and Friedman. 2nd ed, Springer-Verlag, New York. pp.: 337-87.
14. Chen X, Wang M, Zhang H (2011). The use of classification trees for bioinformatics. Wiley Interdiscip Rev Data Min Knowl Discov, 1 (1):55–63.
15. Breiman L (2001). Random Forests. Mach Learn, 45 (1):5–32.
16. The UniProt Consortium (2008). The Uni-versal Protein Resource (UniProt). Nucleic Acids Res, 36 (Database issue):D190–5.
17. Apweiler R, Bairoch A, Wu CH et al (2004). UniProt: the Universal Protein knowledgebase. Nucleic Acids Res, 32 (Da-tabase issue):D115–9.
18. Vita R, Overton JA, Greenbaum JA et al (2015). The immune epitope database (IEDB) 3.0. Nucleic Acids Res, 43(Database issue):D405–12.
19. Wang J, Huda A, Lunyak VV, Jordan IK (2010). A Gibbs sampling strategy ap-plied to the mapping of ambiguous short-sequence tags. Bioinformatics, 26 (20):2501–8.
20. Nielsen M, Lundegaard C, Worning P et al (2004). Improved prediction of MHC class I and class II epitopes using a novel Gibbs sampling approach. Bioinformatics, 20 (9):1388–1397.
21. Ikai A (1980). Thermostability and aliphatic index of globular proteins. J Biochem, 88 (6):1895–8.
22. Kyte J, Doolittle RF (1982). A simple meth-od for displaying the hydropathic charac-ter of a protein. J Mol Biol, 157 (1):105–32.
23. Grantham R (1974). Amino Acid Difference Formula to Help Explain Protein Evolu-tion. Science, 185 (4154):862–4.
24. Kozlowski LP (2017). Proteome-pI: prote-ome isoelectric point database. Nucleic Ac-ids Res, D1112–6.
25. Zimmerman JM, Eliezer N, Simha R (1968). The characterization of amino acid se-quences in proteins by statistical meth-ods. J Theor Biol, 21 (2):170–201.
26. Book: Barrett GC, Elmore DT (1998). Phys-icochemical properties of amino acids and peptides. In: Amino Acids and Peptides. 1st ed. Cambridge University Press, Cambridge. pp.: 32-46.
27. Fawcett T (2006). An introduction to ROC analysis. Pattern Recognittion Letters, 27 (8):861–74.
28. Mandrekar JN (2010). Receiver Operating Characteristic Curve in Diagnostic Test Assessment. J Thorac Oncol, 5 (9):1315–6.
29. Chapter: Friedman CP, Wyatt J (2006). Ana-lyzing the Results of Demonstration Studies. In: Evaluation Methods in Biomedical Informatics. Eds, Kathryn and Marion. 2nd ed, Springer-Verlag, New York, pp: 224-47.
30. Book: Altman DG (1990). Practical Statistics for Medical Research. 2nd ed. Chapman and Hall/CRC, Boca Raton, Fla, p.:149-93.
31. Book: Bland M (2000). An Introduction to Med-ical Statistics. 3rd ed. Oxford University Press, New York, pp.: 47-66.
32. Dimitrov I, Garnev P, Flower DR, Doytchi-nova I (2010). EpiTOP—a prote-ochemometric tool for MHC class II binding prediction. Bioinformatics, 26 (16):2066–8.
33. Guan P, Doytchinova IA, Zygouri C, Flower DR (2003). MHCPred: a server for quan-titative prediction of peptide–MHC bind-ing. Nucleic Acids Res, 31 (13):3621–4.
34. Mustafa AS, Shaban FA (2006). ProPred analysis and experimental evaluation of promiscuous T-cell epitopes of three ma-jor secreted antigens of Mycobacterium tuberculosis. Tuberculosis (Edinb), 86 (2):115–24.
35. Zhang L, Chen Y, Wong H-S et al (2012). TEPITOPEpan: Extending TEPITOPE for Peptide Binding Prediction Covering over 700 HLA-DR Molecules. PLOS ONE, 7 (2):e30483.
36. Guan P, Hattotuwagama CK, Doytchinova IA, Flower DR (2006). MHCPred 2.0: an updated quantitative T-cell epitope predic-tion server. Appl Bioinformatics, 5 (1):55–61.
37. Liu W, Meng X, Xu Q, Flower DR, Li T (2006). Quantitative prediction of mouse class I MHC peptide binding affinity us-ing support vector machine regression (SVR) models. BMC Bioinformatics, 7 (1):182.
38. Dönnes P, Kohlbacher O (2006). SVMHC: a server for prediction of MHC-binding peptides. Nucleic Acids Res, 34:W194–W197.
39. Bhasin M, Raghava GPS (2004). SVM based method for predicting HLA-DRB1*0401 binding peptides in an antigen sequence. Bioinformatics, 20 (3):421–3.
40. Reche PA, Glutting J-P, Zhang H, Reinherz EL (2004). Enhancement to the RANK-PEP resource for the prediction of pep-tide binding to MHC molecules using profiles. Immunogenetics, 56 (6):405–19.
41. Andreatta M, Schafer-Nielsen C, Lund O et al (2011). NNAlign: A Web-Based Predic-tion Method Allowing Non-Expert End-User Discovery of Sequence Motifs in Quantitative Peptide Data. PLOS ONE, 6 (11):e26781.
42. Karosiene E, Rasmussen M, Blicher T et al (2013). NetMHCIIpan-3.0, a common pan-specific MHC class II prediction method including all three human MHC class II isotypes, HLA-DR, HLA-DP and HLA-DQ. Immunogenetics, 65 (10): 711-24.
43. Brown JH, Jardetzky TS, Gorga JC et al (1993). Three-dimensional structure of the human class II histocompatibility an-tigen HLA-DR1. Nature, 364 (1):33-39.
44. Painter CA, Stern LJ (2012). Conformational variation in structures of classical and non‐classical MHCII proteins and func-tional implications. Immunol Rev, 250 (1):144–57.
45. Mommen GPM, Marino F, Meiring HD et al (2016). Sampling From the Proteome to the Human Leukocyte Antigen-DR (HLA-DR) Ligandome Proceeds Via High Specificity. Mol Cell Proteomics, 15 (4):1412–23.
46. Sidney J, Steen A, Moore C et al (2010). Five HLA-DP Molecules Frequently Ex-pressed in the Worldwide Human Popu-lation Share a Common HLA Supertypic Binding Specificity. J Immunol, 184 (5):2492–503.
47. Sidney J, Steen A, Moore C et al (2010). Di-vergent Motifs but Overlapping Binding Repertoires of Six HLA-DQ Molecules Frequently Expressed in the Worldwide Human Population. J Immunol, 185 (7):4189–98.
IssueVol 49 No 1 (2020) QRcode
SectionOriginal Article(s)
DOI https://doi.org/10.18502/ijph.v49i1.3059
Computational model Immunogenic peptides Intracellular parasites

Rights and permissions
Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
How to Cite
JAVADI A, KHAMESIPOUR A, MONAJEMI F, GHAZISAEEDI M. Computational Modeling and Analysis to Predict Intracellular Parasite Epitope Characteristics Using Random Forest Technique. Iran J Public Health. 2020;49(1):125-133.