Scaffold filling, contig fusion and comparative gene order inference
Tóm tắt
There has been a trend in increasing the phylogenetic scope of genome sequencing without finishing the sequence of the genome. Increasing numbers of genomes are being published in scaffold or contig form. Rearrangement algorithms, however, including gene order-based phylogenetic tools, require whole genome data on gene order or syntenic block order. How then can we use rearrangement algorithms to compare genomes available in scaffold form only? Can the comparative evidence predict the location of unsequenced genes? Our method involves optimally filling in genes missing from the scaffolds, while incorporating the augmented scaffolds directly into the rearrangement algorithms as if they were chromosomes. This is accomplished by an exact, polynomial-time algorithm. We then correct for the number of extra fusion/fission operations required to make scaffolds comparable to full assemblies. We model the relationship between the ratio of missing genes actually absent from the genome versus merely unsequenced ones, on one hand, and the increase of genomic distance after scaffold filling, on the other. We estimate the parameters of this model through simulations and by comparing the angiosperm genomes Ricinus communis and Vitis vinifera. The algorithm solves the comparison of genomes with 18,300 genes, including 4500 missing from one genome, in less than a minute on a MacBook, putting virtually all genomes within range of the method.
Tài liệu tham khảo
Chain PS, Grafham DV, Fulton RS, Fitzgerald MG, Hostetler J, Muzny D, Ali J, Birren B, Bruce DC, Buhay C, Cole JR, Ding Y, Dugan S, Field D, Garrity GM, Gibbs R, Graves T, Han CS, Harrison SH, Highlander S, Hugenholtz P, Khouri HM, Kodira CD, Kolker E, Kyrpides NC, Lang D, Lapidus A, Malfatti SA, Markowitz V, Metha T, Nelson KE, Parkhill J, Pitluck S, Qin X, Read TD, Schmutz J, Sozhamannan S, Sterk P, Strausberg RL, Sutton G, Thomson NR, Tiedje JM, Weinstock G, Wollam A, Genomic Standards Consortium Human Microbiome Project Jumpstart Consortium, Detter JC: Genome project standards in a new era of sequencing. Science 2009, 326: 236–237. 10.1126/science.1180614
Sankoff D, Zheng C, Wall PK, dePamphilis C, Leebens-Mack J, Albert VA: Towards improved reconstruction of ancestral gene order in angiosperm phylogeny. Journal of Computational Biology 2009, 16: 1353–1367. 10.1089/cmb.2009.0103
Muñoz A, Sankoff D: Rearrangement phylogeny of genomes in contig form. Transactions in Computational Biology and Bioinformatics 2010, in press.
Yancopoulos S, Attie O, Friedberg R: Efficient sorting of genomic permutations by translocation, inversion and block interchange. Bioinformatics 2005, 21: 3340–6. 10.1093/bioinformatics/bti535
Bergeron A, Mixtacki J, Stoye J: A unifying view of genome rearrangements. In Algorithms in Bioinformatics. Proceedings of WABI 2006. Lecture Notes in Computer Science 2006, 4175: 163–173.
Tesler G: Efficient algorithms for multichromosomal genome rearrangements. Journal of Computer and System Sciences 2002, 65: 587–609. 10.1016/S0022-0000(02)00011-9
Zheng C, Wall PK, Leebens-Mack J, dePamphilis C, Albert VA, Sankoff D: Gene loss under neighbourhood selection following whole genome duplication and the reconstruction of the ancestral populus genome. Journal of Bioinformatics and Computational Biology 2009, 27: 499–520. 10.1142/S0219720009004199
Seguin M, Flori A, Legnaté H, Clément-Demange A: Rubber tree. In Genetic diversity of cultivated tropical plants. Edited by: Hamon P, Seguin M, Perrier X, Glaszmann C. Montpellier: CIRAD; 2003:277–306.
Burleigh JG, Bansal MS, Wehe A, Eulenstein O: Locating large-scale gene duplication events through reconciled trees: implications for identifying ancient polyploidy events in plants. Journal of Computational Biology 2009, 16: 1071–1083. 10.1089/cmb.2009.0139
Velasco R, Zharkikh A, Troggio M, Cartwright DA, Cestaro A, Pruss D, Pindo M, Fitzgerald LM, Vezzulli S, Reid J, Malacarne G, Iliev D, Coppola G, Wardell B, Micheletti D, Macalma T, Facci M, Mitchell JT, Perazzolli M, Eldredge G, Gatto P, Oyzerski R, Moretto M, Gutin N, Stefanini M, Chen Y, Segala C, Davenport C, Demattè L, Mraz A, Battilana J, Stormo K, Costa F, Tao Q, Si-Ammour A, Harkins T, Lackey A, Perbost C, Taillon B, Stella A, Solovyev V, Fawcett JA, Sterck L, Vandepoele K, Grando SM, Toppo S, Moser C, Lanchbury J, Bogden R, Skolnick M, Sgaramella V, Bhatnagar SK, Fontana P, Gutin A, Van de Peer Y, Salamini F, Viola R: A high quality draft consensus sequence of the genome of a heterozygous grapevine variety. PLoS ONE 2007, 2: e1326. 10.1371/journal.pone.0001326
Soltis DE, Albert VA, Leebens-Mack J, Bell CD, Paterson AH, Zheng C, Sankoff D, dePamphilis CW, Wall PK, Soltis PS: Polyploidy and angiosperm diversification. American Journal of Botany 2009, 96: 336–348. 10.3732/ajb.0800079
Jaillon O, Aury JM, Noel B, Policriti A, Clepet C, Casagrande A, Choisne N, Aubourg S, Vitulo N, Jubin C, Vezzi A, Legeai F, Hugueney P, Dasilva C, Horner D, Mica E, Jublot D, Poulain J, Bruyère C, Billault A, Segurens B, Gouyvenoux M, Ugarte E, Cattonaro F, Anthouard V, Vico V, Del Fabbro C, Alaux M, Di Gaspero G, Dumas V, Felice N, Paillard S, Juman I, Moroldo M, Scalabrin S, Canaguier A, Le Clainche I, Malacrida G, Durand E, Pesole G, Laucou V, Chatelet P, Merdinoglu D, Delledonne M, Pezzotti M, Lecharny A, Scarpelli C, Artiguenave F, Pè ME, Valle G, Morgante M, Caboche M, Adam-Blondon AF, Weissenbach J, Quétier F, Wincker P, French-Italian Public Consortium for Grapevine Genome Characterization: The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 2007, 449: 463–467. 10.1038/nature06148
Li L, Stoeckert CJ Jr, Roos DS: OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Research 2003, 13: 2178–2189. 10.1101/gr.1224503
O'Brien KP, Remm M, Sonnhammer EL: Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Research 2005, 33: D476–80. 10.1093/nar/gki107
MSOAR. A high-throughput ortholog assignment system[http://msoar.cs.ucr.edu/]