Possibility of application of relative entropy in clustering of some milk governing genes in dairy cattle

Abstract

Abstract
Background and objectives: Apart from the fact that milk plays an important role in human nutrition, increasing milk production or changing its composition has attracted the attention of animal breeders, therefore, it is crucial to study and evaluate the genes underpinning milk production and its composition. Information theory is a branch of mathematics that overlaps with communications, biology, and medical engineering. Entropy is a measure of uncertainty in the set of information. In his famous article in 1948, Shannon introduced this concept and used its results in a number of basic issues of coding and data transferring theory, which forms the basis of new information theory. Information theory is used in genetic and bioinformatics analyses and can be used for many analyses related to the structures and sequences. Bio-computational grouping of genes facilitates genetic analysis, sequencing and structural-based analyses

Materials and methods: DNA sequence of 30 genes involved with milk protein production were extracted ad hoc from NCBI genome database and stored in FASTA format. In this study, for each gene and its exons sets, the entropy was calculated in orders one to four. In this way, the Markov chain up to order 3 was used. Based on the relative entropy of genes and exons, kullback-Leibler divergence was calculated. After obtaining the kullback-Leibler distance for genes and exons sets, the results were entered as input into 7 clustering algorithms: Single, Complete, Average, Weighted, Centroid, Median and K-Means. In order to aggregate the results of clustering, AdaBoost algorithm was used. Finally, the results of AdaBoost algorithm were investigated by GeneMANIA prediction server to explore the results from gene annotation point of view. All calculations were performed using the MATLAB Engineering Software (2015)

Results: By investigating the results of genes metabolic pathways based on their gene annotations, it was turned out that proposed clustering method, yielded correct, logical and fast results. This method at the same that that hadn't had the disadvantages of aligning allowed the genes with actual length and content to be considered and also didn't require high memory for large-length sequences.

Conclusion: It can be concluded that the performance of the proposed method could be used with other competitive gene clustering methods to group biologically relevant set of genes Also, the proposed method can be seen as a predictive method for those genes bearing up weak genomic annotations.

Key words: Information theory, Dairy cattle, Kullback-Leibler divergence, Gene clustering

Keywords

Main Subjects


1. Buitenhuis, A.J., Sundekilde, U.K., Poulsen, N., Bertram, H.C., Larsen, L.B., and Sørensen, P. 2013. Estimation of Genetic Parameters and Detection of QTL for Metabolites in Danish Holstein Milk. Journal of Dairy Science. 14: 1-10.
2. Alinaghizadeh, H., Mohammad Abadi, MR., and Zakizadeh, S. 2010. Exon 2 of BMP15 gene polymorphismin Jabal Barez Red Goat. Journal of Agricultural Biotechnology. 2: 69-80.
3. Barazandeh, A., Mohammadabadi, MR., Ghaderi, M., and Nezamabadipour, H. 2016. Genome-wide analysis of CpG islands in some livestock genomes and their relationship with genomic features. Czech Journal of Animal Science. 61: 487-495.
4. Clemente, J.C., Satou, K., and Valiente, G. 2007. Phylogenetic reconstruction from non-genomic data. Bioinformatics. 23: 110–115.
5. Erill, I. 2012. Information Theory and biological sequences: Insights from an evolutionary prespective. Nova Science Publishers, Inc.
6. Freund, Y., and Schapire, R. 1996. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences. 55: 119.
7. Freund, Y., and Schapire, R. 1996. Experiments with a new boosting algoritm. Paper read at Proceeding of the Thirteenth Internatioanal Conference on Machine Learning.
8. Forst, C.V., and Schulten, K. 2001. Phylogenetic analysis of metabolic pathways. Journal of Molecular Evolution. 52: 471–489.
9. Ghaderi-Zefrehei, M., A. Bandi Dastjerdi, A., Bahreini Behzadi, M.R., F. Samadian, F., and Meamar, M. 2016. Investigation of Information Accumulation in Escherichia Coli's DNA Sequence Affecting Mastitis in Dairy Cow Using Information Theory. Journal of Ruminant Research. 4: 2016.
10. Gray, R.M. 2013. Entropy and Information Theory. First Edition. Springer-Verlag New York publisher.
11. Heymans, M., and Singh, A.K. 2003. Deriving phylogenetic trees from the similarity analysis of metabolic pathways. Bioinformatics. 19: 138–146.
12. Javanmard, A., Mohammadabadi, M.R., Zarrigabayi, G.E., Gharahedaghi, A.A., Nassiry, MR., Javadmansh, A., and Asadzadeh, N. 2008. Polymorphism within the intron region of the bovine leptin gene in Iranian Sarabi cattle (Iranian Bos Taurus). Russian Journal of Genetics. 44: 495-497.
13. Jiang, S.C., Tang, C., Zhang, L., and Zhang, A. 2014. A Maximum Entropy Approach to Classifying Gene Array Data Sets. Workshop on Data Mining for Genomics, First SIAM International Conference on Data Mining.
14. Kharrati Koopaei, H., Mohammad Abadi, MR., Ansari Mahyari, S., Tarang, AR., Potki, P., and Esmailizadeh, AK. 2012a. Effect of DGAT1 variants on milk composition traits in Iranian Holstein cattle population. Animal Science Papers and Reports. 30: 231-240.
15. Kharrati Koopaei, H., Mohammadabadi, M.R., Tarang, A., Kharrati Koopaei, M., and Esmailizadeh Koshkoiyeh, A. 2012b. Study of the association between the allelic variations in DGAT1 gene with mastitis in Iranian Holstein cattle. Modern Genetics Journal. 7: 101-104.
16. Kharrati Koopaei, H., Mohammadabadi, M.R., Ansari Mehyari, S., Esmailizadeh, A.K., Tarang, A., and Nikbakhti, M. 2011. Genetic variation of DGAT1 gene and its association with milk production in Iranian Holstein cattle breed population. Iranian Journal of Animal Science Research. 3: 185-192.
17. Khatib, H., Monson, R.L., Schutzkus, V., Kohl, D.M., Rosa, GJM., and Rutledge, J.J. 2008. Mutations in the STAT5A gene are associated with embryonic survival and milk composition in cattle. Journal of Dairy Science. 91: 784–793.
18. Kim, J., Kim, S., Lee, K., and Kwon, Y. 2009. Entropy analysis in yeast DNA. Chaos, Solitons and Fractals. 39: 1565–1571.
19. Kullback, S., and Leibler, R. 1951. On information and sufficiency. The Annals of Mathematical Statistics. 22: 79–86.
20. Lee, L. 2009. Used kullback-Liebler measure as a new method for the reconstruction of the phylogenetic tree of the Cornavirus and SARS viruses.
21. Lemay, D.G., Lynn, D.J., Martin, W.F., Neville, M.C., Casey, T.M., Rincon, G., Kriventseva, E.V., Barris, W.C., Hinrichs, A.S., Molenaar, A.J., Pollard, K.S., Maqbool, N.J., Singh, K., Murney, R., Zdobnov, E.M., Tellam, R.L., Medrano, J.F., German, J.B., and Rijnkels, M. 2009. The bovine lactation genome: insights into the evolution of mammalian milk. Genome Biology. 10: R43
22. Li, C., and Wang, J. 2005. Relative entropy of DNA andits application. Physica A. 347: 465–471.
23. Liou, C.Y., Tseng, S.H., Cheng, W.C., and Tsai, H.Y. 2013. Structural Complexity of DNA Sequence. Computational and Mathematical Methods in Medicine. 2013: 1-11.
24. Liu, B. 2007. Uncertainty Theory, 2nd ed., Springer-Verlag, Berlin.
25. Machado, J.T. 2012. Shannon entropy analysis of the genome code. Mathematical Problems in Engineering. 2012: 1-12.
26. Mohammad Abadi, M.R., Mohammadi, A. 2010a. Study of beta-lactoglobulin genotypes in native and Holstein cattle of Kerman province. Journal of Animal Productions. 12: 61-67.
27. Mohammadabadi, M.R., Nikbakhti, M., Mirzaee, H.R., Shandi, A., Saghi, D.A., Romanov, M.N., and Moiseyeva, I.G. 2010b. Genetic variability in three native Iranian chicken populations of the Khorasan province based on microsatellite markers. Russian Journal of Genetics. 46: 505-509.
28. Mousavizadeh, A., Mohammad Abadi, MR., Torabi, A., Nassiry, MR., Ghiasi, H., and Esmailizadeh, AK. 2009. Genetic polymorphism at the growth hormone locus in Iranian Talli goats by polymerase chain reaction-single strand conformation polymorphism (PCR-SSCP). Iranian Journal of Biotechnology. 7: 51-53.
29. Monge, R.E., and Crespo, J.L. 2014. Comparison of Complexity Measures for DNA Sequence Analysis.. International Work Conference on Bio-inspired Intelligence (IWOBI). Pp: 71-75.
30. Neagoe, I.M., Popescu, D., and Niculescu, V.I.R. 2014. Applications of entropic divergence measures for DNA segmentation into high variable regiones of cryposporidium spp. GP60 gene. Romanian Reports in Physics. 66: 1078–1087.
31. Pham, T.D., Crane, D.I., Tannock, D., and Beck, D. 2004. Kullback-Leibler Dissimilarity of Markov Models for Phylogenetic Tree Reconstruction. Proceeding of international Symposium on Inteligent Multimedia, Video and Speech Processing. October Pp: 20-22 HongKong.
32. Porto-DIaz, L., BolOn-Canedo, V., Alonso-Betanzos, A., and Fontenla-Rome, O. 2011. A study of performance on microarray data sets for a classifier based on information theoretic learning. Neural Networks. 24: 888–896.
33. Ruiz-Marin, M., Matilla-Garcia, M., Cordoba, J.A.G., Susillo-Gonzalez, J.L., Romo-Astorga, A., Gonzalez-Pérez, A., Ruiz, A., and Gayan, J. 2010. An entrpyetest for single-locus genetic association analysis. BMC Genetics. 11: 19.
34. Shannon, C.E. 1948. A mathematical theory of communication. Bell System Technical Journal. 27: 379–423 and 623–656.
35. Sherwin, B.W. 2010. Entropy and information approaches to genetic diversity and its expression: genomic geography. EntropyPp: 1765-1798. Shojaei, M., Mohammad Abadi, MR., Asadi Fozi, M., Dayani, O., Khezri, A., and Akhondi, M. 2010. Association of growth trait and Leptin gene polymorphism in Kermani sheep. Journal of Cell and Molecular Research. 2: 67-73.
36. Sundekilde, U.K., Larsen, L.B., and Bertram, H.C. 2013. NMR-Based Milk Metabolomics. Metabolites. 3: 204-222.
37. Tautz, D., Trick, M., Dover, G.A. 1986. Cryptic simplicity in DNA is a major source of genetic variation. Nature, 322: 652–656.
38. Vinga, S., Almeida, J. 2003. Alignment-free sequence comparison: review. Bioinformatics 19: 513–523.
39. Vinga, S. 2013. Information theory applications for biological sequence analysis. Briefings in bioinformatics. 15: 376-389.
40. Warde-Farley, D., Donaldson, S.L., Comes, O., Zuberi, K., Badrawi, R., Chao, P., Franz, M., Grouios, C., Kazi, F., Lopes, C.T., Maitland, A., Mostafavi, S., Montojo, J., Shao, Q., Wright, G., Bader, G.D., and Morris, Q. 2010, The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Research. 38: W214-W220.
41. Xie, X., Yu, Y., Liu, G., Yuan, Z., and Song, J. 2010. Complexity and Entropy Analysis of DNA Methyltransferase. J Data Mining in Genom Proteomics. Volume 1, Issue 2, 1000105.
42. Zamani, P., Akhondi, M., Mohammadabadi, MR., Saki, A.A., Ershadi, A., Banabazi, M.H., and Abdolmohammadi, AR. 2013. Genetic variation of Mehraban sheep using two intersimple sequence repeat (ISSR) markers. African Journal of Biotechnology. 10: 1812-1817.
43. Zhang, J.L., Zan, L.S., Fang, P., Zhang, F., Shen, GL., and Tian, WQ. 2008. Genetic variation of PRLR gene and association with milk performance traits in dairy cattle. Canadian Journal of Animal Science. 88: 33-39.