Genomic evaluation of support vector machine and common genomic prediction methods in different prevalence of threshold phenotype- A simulation study

Abstract

Background and objectives: Many prominent traits in livestock including disease resistance and dystocia, present a classification distribution of phenotypes. These traits are important in animal breeding due to importance of animal welfare and human tendency for healthy and high quality products. Therefore, identifying and characterizing the genetic variants that impact threshold traits, ranging from disease susceptibility, is one of the central objectives of animal genetics. In this regard, genomic selection can have an important role in increasing the genetic progress of the threshold traits. The objective of current study was genomic evaluation of area under receiver operating characteristic curve (AUROC) of support vector machine (SVM), GBLUP and Bayes LASSO methods for different rates of binary phenotype distribution in training set.
Materials and methods: A population of 1000 animals genotyped for 10,000 markers was simulated using QMSim software. Genomic population were simulated to reflect variations in heritability (0.05 and 0.2), number of QTL (100 and 1000) and linkage disequilibrium (low and high) for 29 chromosomes. In order to create different rates of discrete phenotype, the animal’s phenotype of training set was coded as 1 (inappropriate phenotype) depending on whether their phenotype residuals was less than the average of residuals (e ̅), e ̅- 1〖SD〗_eor e ̅+ 1〖SD〗_efor the first, second and third approaches, respectively, and other individuals was defined as code 0 (appropriate phenotype). Three statistical models were implemented to analyze the simulated data including SVM, GBLUP and Bayes LASSO methods.
Results: Optimal training sets were characterized by inappropriate phenotype rate that were similar to the population real, leading to the highest AUROC in SVM, GBLUP and Bayes LASSO methods, in which concluded for e ̅- 1〖SD〗_e threshold point to the training set. The highest (0.813)and lowest(0.521) AUROC were observed for SVM method.Generally, heritability of trait was a factor affecting on genomic AUROC of SVM, GBLUP and Bayes LASSO methods; so that we recognized an increase in genomic AUROC with increase in heritability in all three statistical methods. Average r2 in the low and high LD scenarios was 0.221 and 0.435 at distances of 0.05 cM and the results showed an increase in genomic AUROC using GBLUP, Bayes LASOO and SVM methods with increasing in linkage disequilibrium. The result of current study showed that high level of LD between SNP and QTLs increased the probability of adjacent markers sampling for re-sampling methods. Therefore, this resulted in a positive performance of SVM. Despite of the higher AUROC of GBLUP and Bayes LASSO methods at different scenarios, SVM method showed a better performance when discrete traits were controlled by a large number of QTLs.
Conclusions: Despite the important role of different rates of binary phenotype distribution in training set, SVM method to predict genomic AUROC of discrete traits depends on genetic basis of the population analyzed and cost parameter.

Keywords

Main Subjects


  1. Abdollahi-Arpanahi, R., Pakdel, A., Nejati-Javaremi, A. and Shahrbabak, M. M. 2013. Comparison of genomic evaluation methods in complex traits with different genetic architecture. Journal of Animal Production. 15: 65-77.
  2. Bo, Z., Zhang, J. J., Hong, N., Long, G., Peng, G., Xu, L.-Y., Yan, C., Zhang, L. P., Gao, H. J. and Xue, G. 2017. Effects of marker density and minor allele frequency on genomic prediction for growth traits in Chinese Simmental beef cattle. Journal of Integrative Agriculture. 16(4): 911-20.
  3. Boser, B. E., Guyon, I. M. and Vapnik, V. N. 1992. A training algorithm for optimal margin classifiers. Proceedings of the fifth annual workshop on computational learning theory. Association for Computing Machinery. 144-152.
  4. Calus, M., De Roos, A. and Veerkamp, R. 2008. Accuracy of genomic selection using different methods to define haplotypes. Genetics. 178(1): 553-61.
  5. Daetwyler, H. D., Calus, M. P., Pong-Wong, R., de los Campos, G. and Hickey, J. M. 2013. Genomic prediction in animals and plants: simulation of data, validation, reporting, and benchmarking. Genetics. 193(2): 347-65.
  6. De Los Campos, G., Naya, H., Gianola, D., Crossa, J., Legarra, A., Manfredi, E., Weigel, K. and Cotes, J.M. 2009. Predicting quantitative traits with regression models for dense molecular markers and pedigree. Genetics. 182(1): 375-85.
  7. Ghafouri-Kesbi, F., Rahimi-Mianji, G., Honarvar, M. and Nejati-Javaremi, A. 2017. Predictive ability of Random Forests, Boosting, Support Vector Machines and Genomic Best Linear Unbiased Prediction in different scenarios of genomic evaluation. Journal of Animal Production Science. 57(2): 229-36.
  8. González-Recio, O. and Forni, S. 2011. Genome-wide prediction of discrete traits using Bayesian regressions and machine learning. Genetics Selection Evolution. 43(1): 7.
  9. Hayes, B. and Goddard, M. E. 2001. The distribution of the effects of genes affecting quantitative traits in livestock. Genetics Selection Evolution. 33(3): 209.
  10. Hill, W. and Robertson, A. 1968. Linkage disequilibrium in finite populations. TAG Theoretical and Applied Genetics. 38(6): 226-231.
  11. Honarvar, M. and Ghiasi, H. 2013. A comparison of genomic predictions using support vector machines (SVMs) and GBLUP methods. Agrochimica Research. 57: 3-21.
  12. Kappes, S. M., Keele, J. W., Stone, R. T., McGraw, R. A., Sonstegard, T. S., Smith, T., Lopez-Corrales, N. L. and Beattie, C.W. 1997. A second-generation linkage map of the bovine genome. Genome Research. 7(3): 235-49.
  13. Long, N., Gianola, D., Rosa, G.J., Weigel, K., and Avendano, S. 2007. Machine learning classification procedure for selecting SNPs in genomic selection: application to early mortality in broilers. Journal of Animal Breeding and Genetics. 124(6): 377-89.
  14. Madsen, P. and Jensen, J. 2010. A users guide to DMU. A package for analysing multivariate mixed models, Version 6.Meuwissen, T., Hayes, B. and Goddard, M. 2001. Prediction of total genetic value using genome-wide dense marker maps. Genetics. 157(4): 1819-29.
  15. Meyer, D. 2014. Support Vector Machines—the Interface to libsvm in package. 1-8.
  16. Naderi, S., Bohlouli, M., Yin, T. and König, S. 2018. Genomic breeding values, SNP effects and gene identification for disease traits in cow training sets. Animal Genetics. 49(3): 178-92.
  17. Naderi, S., Yin, T. and König, S. 2016. Random forest estimation of genomic breeding values for disease susceptibility over different disease incidences and genomic architectures in simulated cow calibration groups. Journal of Dairy Science. 99(9): 7261-73.
  18. Naderi, Y. 2018. Evaluation of genomic prediction accuracy in different genomic architectures of quantitative and threshold traits with the imputation of simulated genomic data using random forest method. Research on Animal Production. 9(20): 129-39. (In Persian).
  19. Naderi, Y. and Sadeghi, S. 2019. Assessment of the genomic prediction accuracy of discrete traits with imputation of missing genotypes. Animal Science Papers and Reports. 37(2): 149-68.
  20. Neves, H. H., Carvalheiro, R. and Queiroz, S. A. 2012. A comparison of statistical methods for genomic selection in a mice population. BMC Genetics. 13(1):100.
  21. Ogutu, J. O., Piepho, H. P. and Schulz-Streeck, T. 2011. A comparison of random forests, boosting and support vector machines for genomic selection. BMC proceedings. BioMed Central. 5(3): 11.
  22. Park, T. and Casella, G. 2008. The Bayesian LASSO. Journal of the American Statistical Association. 103(482): 681-6.
  23. Pimentel, E.C., Wensch-Dorendorf, M., König, S. and Swalve, H. H. 2013. Enlarging a training set for genomic selection by imputation of un-genotyped animals in populations of varying genetic architecture. Genetics Selection Evolution. 45(1): 12.
  24. Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A., Bender, D., Maller, J., Sklar, P., De Bakker, P. I. and Daly, M. J. 2007. PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics. 81(3): 559-75.
  25. Sadeghi, S., Rafat, S. A. and Alijani, S. 2018. Evaluation of imputed genomic data in discrete traits using Random forest and Bayesian threshold methods. Acta Scientiarum Animal Sciences. 40: 39007.
  26. Sargolzaei, M. and Schenkel, F. S. 2009. QMSim: a large-scale genome simulator for livestock. Bioinformatics. 25(5): 680-1.
  27. Shirali, M., Ashtiani, S., Pakdel, A., Hilli, K. and Vanoog, R. 2012. Comparison between Bayesc and GBLUP in estimating genomic breeding values under different QTL variance distributions. Iranian Journal of Animal Science (IJAS). 43(2): 261-8.
  28. Su, G. and Madsen, P. User’s Guide for GMATRIX version 2, a Program for Computing Genomic Relationship Matrix. 2013.
  29. Tibshirani, R. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B (Methodological). 267-88.
  30. VanRaden, P.M. 2008. Efficient methods to compute genomic predictions. Journal of Dairy Science. 91(11): 4414-23.
  31. Yang, P., Hwa Yang, Y., B Zhou, B. and Y Zomaya, A. 2010. A review of ensemble methods in bioinformatics. Current Bioinformatics. 5(4): 296-308.