Nội dung được dịch bởi AI, chỉ mang tính chất tham khảo
Kỹ thuật suy diễn dựa trên máy học để ước lượng cây phát sinh từ các ma trận khoảng cách không đầy đủ
Tóm tắt
Với tốc độ tăng trưởng nhanh chóng của các bộ gen mới được giải trình tự, việc suy diễn cây loài từ các gen được lấy mẫu từ toàn bộ bộ gen đã trở thành một nhiệm vụ cơ bản trong sinh học so sánh và tiến hóa. Tuy nhiên, vẫn còn nhiều thách thức lớn trong việc tận dụng các dữ liệu phân tử quy mô lớn này. Một trong những thách thức hàng đầu là phát triển các phương pháp hiệu quả có thể xử lý dữ liệu thiếu. Các phương pháp dựa trên khoảng cách phổ biến, chẳng hạn như NJ (phương pháp kết nối hàng xóm) và UPGMA (phương pháp nhóm cặp không trọng số với trung bình số học) yêu cầu các ma trận khoảng cách hoàn chỉnh không có bất kỳ dữ liệu thiếu nào. Chúng tôi giới thiệu hai kỹ thuật suy diễn khoảng cách dựa trên máy học rất chính xác. Các phương pháp này dựa trên phân tích ma trận và kiến trúc học sâu sử dụng bộ mã hóa tự động. Chúng tôi đã đánh giá hai phương pháp này trên một tập hợp dữ liệu mô phỏng và sinh học. Kết quả thực nghiệm cho thấy các phương pháp mà chúng tôi đề xuất đạt được hoặc cải thiện khi so sánh với các kỹ thuật suy diễn khoảng cách thay thế tốt nhất. Hơn nữa, những phương pháp này có thể mở rộng cho các tập dữ liệu lớn với hàng trăm loài và có thể xử lý một lượng lớn dữ liệu thiếu. Nghiên cứu này lần đầu tiên cho thấy sức mạnh và tính khả thi của việc áp dụng các kỹ thuật học sâu trong việc suy diễn các ma trận khoảng cách. Do đó, nghiên cứu này tiến bộ trong lĩnh vực xây dựng cây phát sinh trong sự hiện diện của dữ liệu thiếu. Các phương pháp được đề xuất có sẵn dưới dạng mã nguồn mở tại
https://github.com/Ananya-Bhattacharjee/ImputeDistances
.
Từ khóa
#máy học #suy diễn khoảng cách #cây phát sinh #dữ liệu thiếu #ma trận khoảng cách #học sâuTài liệu tham khảo
Felsenstein J. Inferring Phylogenies. Vol 2. Sunderland: Sinauer Associates; 2004, p. 664.
Drummond AJ, Rambaut A. BEAST: Bayesian evolutionary analysis by sampling trees. BMC Evol Biol. 2007; 7:214.
Kubatko LS, Carstens BC, Knowles LL. STEM: Species tree estimation using maximum likelihood for gene trees under coalescence. Bioinformatics. 2009; 25:971–973.
Liu L, Yu L, Pearl DK, Edwards SV. Estimating species phylogenies using coalescence times among sequences. Syst Biol. 2009; 58(5):468–477.
Larget B, Kotha SK, Dewey CN, Ané C. BUCKy: Gene tree/species tree reconciliation with the Bayesian concordance analysis. Bioinformatics. 2010; 26(22):2910–1.
Liu L. BEST: Bayesian estimation of species trees under the coalescent model. Bioinformatics. 2008; 24:2542–3.
Liu L, Yu L, Edwards SV. A maximum pseudo-likelihood approach for estimating species trees under the coalescent model. BMC Evol Biol. 2010; 10:302.
Reaz R, Bayzid MS, Rahman MS. Accurate phylogenetic tree reconstruction from quartets: A heuristic approach. PLoS One. 2014; 9(8):104008.
Mirarab S, Reaz R, Bayzid MS, Zimmermann T, Swenson MS, Warnow T. ASTRAL: Genome-scale coalescent-based species tree estimation. Bioinformatics. 2014; 30(17):541–8.
Liu L, Yu L. Estimating species trees from unrooted gene trees. Syst Biol. 2011; 60(5):661–7.
Vachaspati P, Warnow T. ASTRID: Accurate species trees from internode distances. BMC Genomics. 2015; 16(10):3.
Islam M, Sarker K, Das T, Reaz R, Bayzid MS. STELAR: A statistically consistent coalescent-based species tree estimation method by maximizing triplet consistency. BMC Genomics. 2020; 21(1):1–13.
Bayzid MS, Warnow T. Naive binning improves phylogenomic analyses. Bioinformatics. 2013; 29(18):2277–84.
Bayzid MS, Hunt T, Warnow T. Disk covering methods improve phylogenomic analyses. BMC Genomics. 2014; 15(6):7.
Sourdis J, Nei M. Relative efficiencies of the maximum parsimony and distance-matrix methods in obtaining the correct phylogenetic tree. Mol Biol Evol. 1988; 5(3):298–311.
Saitou N, Imanishi T. Relative efficiencies of the Fitch-Margoliash, maximum-parsimony, maximum-likelihood, minimum-evolution, and neighbor-joining methods of phylogenetic tree construction in obtaining the correct tree. Mol Biol Evol. 1989; 6(5):514.
Gascuel O. BIONJ: An improved version of the NJ algorithm based on a simple model of sequence data. Mol Biol Evol. 1997; 14(7):685–95.
Rosenberg MS, Kumar S. Traditional phylogenetic reconstruction methods reconstruct shallow and deep evolutionary relationships equally well. Mol Biol Evol. 2001; 18(9):1823–7.
Desper R, Gascuel O. Fast and accurate phylogeny reconstruction algorithms based on the minimum-evolution principle. In: Lecture Notes in Computer Science. Springer: 2002. p. 357–374. https://doi.org/10.1007/3-540-45784-4_27.
Huson D, Nettles S, Warnow T. Disk-Covering, a fast converging method for phylogenetic tree reconstruction. J Comput Biol. 1999; 6(3):369–86.
Huson D, Vawter L, Warnow T. Solving large scale phylogenetic problems using DCM2. In: Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology (ISMB’99). Palo Alto: AAAI Press: 1999. p. 118–129.
Roshan U, Moret BME, Williams TL, Warnow T. Rec-I-DCM3: A fast algorithmic technique for reconstructing large phylogenetic trees. In: Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004.. IEEE: 2004. https://doi.org/10.1109/csb.2004.1332422.
Nakhleh L, Roshan U, James KS, Sun J, Warnow T. Designing fast converging phylogenetic methods. Bioinformatics. 2001; 17:190–8.
Roshan U, Moret BME, Williams TL, Warnow T. Performance of supertree methods on various dataset decompositions In: Bininda-Emonds ORP, editor. Phylogenetic Supertrees: Combining Information to Reveal The Tree of Life. Dordrecht: 2004. p. 301–328. Volume 3 of Computational Biology, Kluwer Academics, (Andreas Dress, series editor).
Deng R, Huang M, Wang J, Huang Y, Yang J, Feng J, Wang X. PTreeRec: Phylogenetic tree reconstruction based on genome blast distance. Comput Biol Chem. 2006; 30(4):300–2.
Auch AF, Henz SR, Holland BR, Göker M. Genome BLAST distance phylogenies inferred from whole plastid and whole mitochondrion genome sequences. BMC Bioinformatics. 2006; 7(1):350.
Gao L, Qi J. Whole genome molecular phylogeny of large dsDNA viruses using composition vector method. BMC Evol Biol. 2007; 7(1):41.
Sokal RR. A statistical method for evaluating systematic relationships. Univ Kansas Sci Bull. 1958; 38:1409–38.
Desper R, Gascuel O. Fast and accurate phylogeny reconstruction algorithms based on the minimum-evolution principle. In: International Workshop on Algorithms in Bioinformatics. Springer: 2002. p. 357–374. https://doi.org/10.1007/3-540-45784-4_27.
Desper R, Gascuel O. Theoretical foundation of the balanced minimum evolution method of phylogenetic inference and its relationship to weighted least-squares tree fitting. Mol Biol Evol. 2004; 21(3):587–98.
Cao MD, Allison L, Dix TI, Bodén M. Robust estimation of evolutionary distances with information theory. Mol Biol Evol. 2016; 33(5):1349–57.
Bogusz M, Whelan S. Phylogenetic tree estimation with and without alignment: New distance methods and benchmarking. Syst Biol. 2017; 66(2):218–31.
Balaban M, Sarmashghi S, Mirarab S. APPLES: Scalable Distance-Based Phylogenetic Placement with or without Alignments. Syst Biol. 2019; 69(3):566–78.
Moshiri N. TreeN93: A non-parametric distance-based method for inferring viral transmission clusters. bioRxiv. 2018. https://doi.org/10.1101/383190.
Allman ES, Long C, Rhodes JA. Species tree inference from genomic sequences using the log-det distance. SIAM J Appl Algebra Geom. 2019; 3(1):107–27.
Kettleborough G, Dicks J, Roberts IN, Huber KT. Reconstructing (super) trees from data sets with missing distances: not all is lost. Mol Biol Evol. 2015; 32(6):1628–42.
Joly S, Bryant D, Lockhart PJ. Flexible methods for estimating genetic distances from single nucleotide polymorphisms. Methods Ecol Evol. 2015; 6(8):938–948.
Sanderson MJ, Purvis A, Henze C. Phylogenetic supertrees: Assembling the trees of life. Trends Ecol Evol. 1998; 13(3):105–9.
Wiens JJ. Missing data and the design of phylogenetic analyses. J Biomed Inform. 2006; 39(1):34–42.
Bayzid MS, Warnow T. Estimating optimal species trees from incomplete gene trees under deep coalescence. J Comput Biol. 2012; 19(6):591–605.
Christensen S, Molloy EK, Vachaspati P, Warnow T. OCTAL: Optimal completion of gene trees in polynomial time. Algoritm Mol Biol. 2018; 13(1):6.
Huelsenbeck JP. When are fossils better than extant taxa in phylogenetic analysis?. Syst Biol. 1991; 40(4):458–69.
Makarenkov V, Lapointe F-J. A weighted least-squares approach for inferring phylogenies from incomplete distance matrices. Bioinformatics. 2004; 20(13):2113–21.
Lemmon AR, Brown JM, Stanger-Hall K, Lemmon EM. The effect of ambiguous data on phylogenetic estimates obtained by maximum likelihood and bayesian inference. Syst Biol. 2009; 58(1):130–45.
Gauthier J. Saurischian monophyly and the origin of birds. Mem Calif Acad Sci. 1986; 8:1–55.
Langer MC, Ferigolo J, Schultz CL. Heterochrony and tooth evolution in hyperodapedontine rhynchosaurs (reptilia, diapsida). Lethaia. 2000; 33(2):119–28.
Xia X. Imputing missing distances in molecular phylogenetics. PeerJ. 2018; 6:5321.
Guénoche A, Leclerc B. The triangles method to build X-trees from incomplete distance matrices. RAIRO Oper Res. 2001; 35(2):283–300.
De Soete G. Additive-tree representations of incomplete dissimilarity data. Qual Quant. 1984; 18(4):387–93.
Lapointe FJ, Kirsch JA. Estimating phylogenies from lacunose distance matrices, with special reference to DNA hybridization data. Mol Biol Evol. 1995; 12:266–84.
Robinson NE, Robinson AB. Molecular clocks. Proc Nat Acad Sci. 2001; 98(3):944–9.
Ho S. The molecular clock and estimating species divergence. Nat Educ. 2008; 1(1):1–2.
Koren Y, Bell R, Volinsky C. Matrix factorization techniques for recommender systems. Computer. 2009; 42(8):30–7. https://doi.org/10.1109/mc.2009.263.
Goodfellow I, Bengio Y, Courville A. Deep Learning. Adaptive Computation and Machine Learning series. Cambridge: MIT press; 2016.
Xia X, Xie Z. DAMBE: Software package for data analysis in molecular biology and evolution. J Hered. 2001; 92(4):371–3.
Xia X. DAMBE7: New and improved tools for data analysis in molecular biology and evolution. Mol Biol Evol. 2018; 35(6):1550–2.
The UEA Computational Biology Laboratory. https://www.uea.ac.uk/computing/lasso. Accessed 08 July 2019.
Robinson DF, Foulds LR. Comparison of phylogenetic trees. Math Biosci. 1981; 53(1-2):131–47.
Tamura K, Nei M. Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol Biol Evol. 1993; 10(3):512–26.
Tamura K, Kumar S. Evolutionary distance estimation under heterogeneous substitution pattern among lineages. Mol Biol Evol. 2002; 19(10):1727–36.
Tamura K, Dudley J, Nei M, Kumar S. MEGA4: Molecular evolutionary genetics analysis (MEGA) software version 4.0. Mol Biol Evol. 2007; 24(8):1596–9.
Lockhart PJ, Steel MA, Hendy MD, Penny D. Recovering evolutionary trees under a more realistic model of sequence evolution. Mol Biol Evol. 1994; 11(4):605–12.
Steel M. Recovering a tree from the leaf colourations it generates under a markov model. Appl Math Lett. 1994; 7(2):19–23.
Kumar S, Stecher G, Li M, Knyaz C, Tamura K. MEGA X: Molecular evolutionary genetics analysis across computing platforms. Mol Biol Evol. 2018; 35(6):1547–9.
Tamura K, Stecher G, Peterson D, Filipski A, Kumar S. MEGA6: Molecular evolutionary genetics analysis version 6.0. Mol Biol Evol. 2013; 30(12):2725–9.
Hasegawa M, Kishino H, Yano T-a. Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol. 1985; 22(2):160–74.
Song S, Liu L, Edwards SV, Wu S. Resolving conflict in eutherian mammal phylogeny using phylogenomics and the multispecies coalescent model. Proc Nat Acad Sci. 2012; 109(37):14942–7.
Mirarab S, Bayzid MS, Warnow T. Evaluating summary methods for multilocus species tree estimation in the presence of incomplete lineage sorting. Syst Biol. 2014; 65(3):366–80.
Mirarab S, Bayzid MS, Boussau B, Warnow T. Statistical binning enables an accurate coalescent-based estimation of the avian tree. Science. 2014; 346(6215):1250463.
Kingman JFC. The coalescent. Stoch Process Appl. 1982; 13:235–48.
Maddison WP. Gene trees in species trees. Syst Biol. 1997; 46:523–36.
Mirarab S, Warnow T. ASTRAL-II: Coalescent-based species tree estimation with many hundreds of taxa and thousands of genes. Bioinformatics. 2015; 31(12):44–52.
Xia X. Information-theoretic indices and an approximate significance test for testing the molecular clock hypothesis with genetic distances. Mol Phylogenet Evol. 2009; 52(3):665–76.
Xia X. Rapid evolution of animal mitochondrial DNA. Rapidly Evolving Genes Genet Syst. 2012:73–82. https://doi.org/10.1093/acprof:oso/9780199642274.003.0008.
Funk S.Netflix Update: Try This at Home. https://sifter.org/~simon/journal/20061211.html. Accessed 08 July 2019.
Ricci F, Rokach L, Shapira B. Introduction to recommender systems handbook. In: Recommender Systems Handbook. Springer: 2011. p. 1–35. https://doi.org/10.1007/978-0-387-85820-3_1.
Terveen L, Hill W. Beyond recommender systems: Helping people help each other. HCI New Millennium. 2001; 1(2001):487–509.
Linderman GC, Zhao J, Kluger Y. Zero-preserving imputation of scrna-seq data using low-rank approximation. bioRxiv. 2018. https://doi.org/10.1101/397588.
Jiang B, Ma S, Causey J, Qiao L, Hardin MP, Bitts I, Johnson D, Zhang S, Huang X. SparRec: An effective matrix completion framework of missing data imputation for GWAS. Sci Rep. 2016; 6:35534.
Ma S, Johnson D, Ashby C, Xiong D, Cramer CL, Moore JH, Zhang S, Huang X. SPARCoC: A new framework for molecular pattern discovery and cancer gene identification. PloS One. 2015; 10(3):0117135.
Töscher A, Jahrer M. The bigchaos solution to the netflix prize 2008. Netflix Prize, Report. 2008.
Hinton GE, Salakhutdinov RR. Reducing the dimensionality of data with neural networks. Science. 2006; 313(5786):504–7.
Ding MQ, Chen L, Cooper GF, Young JD, Lu X. Precision oncology beyond targeted therapy: Combining omics data with machine learning matches the majority of cancer cells to effective therapeutics. Mol Cancer Res. 2018; 16(2):269–278.
Chaudhary K, Poirion OB, Lu L, Garmire LX. Deep learning–based multi-omics integration robustly predicts survival in liver cancer. Clin Cancer Res. 2018; 24(6):1248–59.
Talwar D, Mongia A, Sengupta D, Majumdar A. AutoImpute: Autoencoder based imputation of single-cell RNA-seq data. Sci Rep. 2018; 8(1):16329.
Beaulieu-Jones BK, Moore JH. Missing data imputation in the electronic health record using deeply learned autoencoders. In: Pacific Symposium on Biocomputing 2017. Singapore: World Scientific: 2017. p. 207–218.
Gondara L, Wang K. Mida: Multiple imputation using denoising autoencoders. In: Advances in Knowledge Discovery and Data Mining. Springer: 2018. p. 260–272. https://doi.org/10.1007/978-3-319-93040-4_21.
Rubinsteyn A. https://github.com/iskandr/fancyimpute. Accessed 08 July 2019.
Hahnloser RH, Sarpeshkar R, Mahowald MA, Douglas RJ, Seung HS. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature. 2000; 405(6789):947.
Han J, Moraga C. The influence of the sigmoid function parameters on the speed of backpropagation learning. In: Lecture Notes in Computer Science. Springer: 1995. p. 195–201. https://doi.org/10.1007/3-540-59497-3_175.