MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects
Tóm tắt
Second-generation sequencing technologies are precipitating major shifts with regards to what kinds of genomes are being sequenced and how they are annotated. While the first generation of genome projects focused on well-studied model organisms, many of today's projects involve exotic organisms whose genomes are largely terra incognita. This complicates their annotation, because unlike first-generation projects, there are no pre-existing 'gold-standard' gene-models with which to train gene-finders. Improvements in genome assembly and the wide availability of mRNA-seq data are also creating opportunities to update and re-annotate previously published genome annotations. Today's genome projects are thus in need of new genome annotation tools that can meet the challenges and opportunities presented by second-generation sequencing technologies. We present MAKER2, a genome annotation and data management tool designed for second-generation genome projects. MAKER2 is a multi-threaded, parallelized application that can process second-generation datasets of virtually any size. We show that MAKER2 can produce accurate annotations for novel genomes where training-data are limited, of low quality or even non-existent. MAKER2 also provides an easy means to use mRNA-seq data to improve annotation quality; and it can use these data to update legacy annotations, significantly improving their quality. We also show that MAKER2 can evaluate the quality of genome annotations, and identify and prioritize problematic annotations for manual review. MAKER2 is the first annotation engine specifically designed for second-generation genome projects. MAKER2 scales to datasets of any size, requires little in the way of training data, and can use mRNA-seq data to improve annotation quality. It can also update and manage legacy genome annotation datasets.
Tài liệu tham khảo
Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG, Scherer SE, Li PW, Hoskins RA, Galle RF, et al.: The genome sequence of Drosophila melanogaster. Science 2000, 287(5461):2185–2195. 10.1126/science.287.5461.2185
The C. elegans Sequencing Consortium: Genome Sequence of the Nematode C. elegans: A Platform for Investigating Biology. Science 1998, 282(5396):2012–2018.
Waterston RH, Lindblad-Toh K, Birney E, Rogers J, Abril JF, Agarwal P, Agarwala R, Ainscough R, Alexandersson M, An P, et al.: Initial sequencing and comparative analysis of the mouse genome. Nature 2002, 420(6915):520–562. 10.1038/nature01262
Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, et al.: The Sequence of the Human Genome. Science 2001, 291(5507):1304–1351. 10.1126/science.1058040
Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, et al.: Initial sequencing and analysis of the human genome. Nature 2001, 409(6822):860–921. 10.1038/35057062
Cantarel BL, Korf I, Robb SMC, Parra G, Ross E, Moore B, Holt C, Sanchez Alvarado A, Yandell M: MAKER: An easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res 2008, 18(1):188–196.
Suen G, Teiling C, Li L, Holt C, Abouheif E, Bornberg-Bauer E, Bouffard P, Caldera EJ, Cash E, Cavanaugh A, et al.: The Genome Sequence of the Leaf-Cutter Ant Atta cephalotes Reveals Insights into Its Obligate Symbiotic Lifestyle. PLoS Genet 2011, 7(2):e1002007. 10.1371/journal.pgen.1002007
Smith CR, Smith CD, Robertson HM, Helmkampf M, Zimin A, Yandell M, Holt C, Hu H, Abouheif E, Benton R, et al.: Draft genome of the red harvester ant Pogonomyrmex barbatus. Proceedings of the National Academy of Sciences 2011, 108(14):5667–5672. 10.1073/pnas.1007901108
Smith CD, Zimin A, Holt C, Abouheif E, Benton R, Cash E, Croset V, Currie CR, Elhaik E, Elsik CG, et al.: Draft genome of the globally widespread and invasive Argentine ant (Linepithema humile). Proceedings of the National Academy of Sciences 2011, 108(14):5673–5678. 10.1073/pnas.1008617108
Levesque CA, Brouwer H, Cano L, Hamilton J, Holt C, Huitema E, Raffaele S, Robideau G, Thines M, Win J, et al.: Genome sequence of the necrotrophic plant pathogen Pythium ultimum reveals original pathogenicity mechanisms and effector repertoire. Genome biology 2010, 11(7):R73. 10.1186/gb-2010-11-7-r73
Baxter SW, Nadeau NJ, Maroja LS, Wilkinson P, Counterman BA, Dawson A, Beltran M, Perez-Espona S, Chamberlain N, Ferguson L, et al.: Genomic Hotspots for Adaptation: The Population Genetics of Mullerian Mimicry in the Heliconius melpomene Clade. PLoS Genet 2010, 6(2):e1000794. 10.1371/journal.pgen.1000794
Ferguson L, Lee SF, Chamberlain N, Nadeau N, Joron M, Baxter S, Wilkinson P, Papanicolaou A, Kumar S, Kee T-J, et al.: Characterization of a hotspot for mimicry: assembly of a butterfly wing transcriptome to genomic sequence at the HmYb/Sb locus. Molecular Ecology 2010, 19: 240–254.
Kovach A, Wegrzyn J, Parra G, Holt C, Bruening G, Loopstra C, Hartigan J, Yandell M, Langley C, Korf I, et al.: The Pinus taeda genome is characterized by diverse and highly diverged repetitive sequences. BMC Genomics 2010, 11(1):420. 10.1186/1471-2164-11-420
MacDonald J, Doering M, Canam T, Gong Y, Guttman DS, Campbell MM, Master ER: Transcriptomic responses of the softwood-degrading white-rot fungus Phanerochaete carnosa during growth on coniferous and deciduous wood. Appl Environ Microbiol 2011, AEM.02490–02410.
Legeai F, Shigenobu S, Gauthier JP, Colbourne J, Rispe C, Collin O, Richards S, Wilson ACC, Murphy T, Tagu D: AphidBase: a centralized bioinformatic resource for annotation of the pea aphid genome. Insect Molecular Biology 2010, 19: 5–12.
Martin J, Abubucker S, Wylie T, Yin Y, Wang Z, Mitreva M: Nematode.net update 2008: improvements enabling more efficient data mining and comparative nematode genomics. Nucleic acids research 2009, 37(suppl 1):D571-D578.
Robb S, Ross E, Alvarado A: SmedGD: the Schmidtea mediterranea genome database. Nucleic Acids Res 2007, (36 Database):D599–606.
Wurm Y, Wang J, Riba-Grognuz O, Corona M, Nygaard S, Hunt BG, Ingram KK, Falquet L, Nipitwattanaphon M, Gotzek D, et al.: The genome of the fire ant Solenopsis invicta. Proceedings of the National Academy of Sciences 2011, 108(14):5679–5684. 10.1073/pnas.1009690108
Hauser PM, Burdet FX, Cisse OH, Keller L, Taffe P, Sanglard D, Pagni M: Comparative Genomics Suggests that the Fungal Pathogen Pneumocystis Is an Obligate Parasite Scavenging Amino Acids from Its Host's Lungs. PLoS ONE 2010, 5(12):e15152. 10.1371/journal.pone.0015152
Eilbeck K, Moore B, Holt C, Yandell M: Quantitative measures for the management and comparison of annotated genomes. BMC Bioinformatics 2009, 10(1):67. 10.1186/1471-2105-10-67
Eilbeck K, Lewis S, Mungall C, Yandell M, Stein L, Durbin R, Ashburner M: The Sequence Ontology: a tool for the unification of genome annotations. Genome biology 2005, 6(5):R44. 10.1186/gb-2005-6-5-r44
The Arabidopsis Genome Initiative: Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 2000, 408(6814):796–815. 10.1038/35048692
Korf I: Gene finding in novel genomes. BMC Bioinformatics 2004, 5: 59. 10.1186/1471-2105-5-59
Stanke M, Waack S: Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 2003, 19(suppl_2):ii215–225.
Lomsadze A, Ter-Hovhannisyan V, Chernoff YO, Borodovsky M: Gene identification in novel eukaryotic genomes by self-training algorithm. Nucl Acids Res 2005, 33(20):6494–6506. 10.1093/nar/gki937
Boguski MS, Lowe TMJ, Tolstoshev CM: dbEST - database for expressed sequence tags. Nat Genet 1993, 4(4):332–333. 10.1038/ng0893-332
Holt RA, Subramanian GM, Halpern A, Sutton GG, Charlab R, Nusskern DR, Wincker P, Clark AG, Ribeiro JC, Wides R, et al.: The Genome Sequence of the Malaria Mosquito Anopheles gambiae. Science 2002, 298(5591):129–149. 10.1126/science.1076181
Bairoch A, Apweiler R: The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucl Acids Res 2000, 28(1):45–48. 10.1093/nar/28.1.45
Consortium TU: Ongoing and future developments at the Universal Protein Resource. Nucleic Acids Research 2011, 39(suppl 1):D214-D219.
Stein LD, Bao Z, Blasiar D, Blumenthal T, Brent MR, Chen N, Chinwalla A, Clarke L, Clee C, Coghlan A, et al.: The genome sequence of Caenorhabditis briggsae: a platform for comparative genomics. PLoS Biol 2003, 1(2):E45.
Goff SA, Ricke D, Lan T-H, Presting G, Wang R, Dunn M, Glazebrook J, Sessions A, Oeller P, Varma H, et al.: A Draft Sequence of the Rice Genome (Oryza sativa L. ssp. japonica). Science 2002, 296(5565):92–100. 10.1126/science.1068275
Keibler E, Brent M: Eval: A software package for analysis of genome annotations. BMC Bioinformatics 2003, 4(1):50. 10.1186/1471-2105-4-50
Parra G, Bradnam K, Korf I: CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 2007, 23(9):1061–1067. 10.1093/bioinformatics/btm071
Berriman M, Haas BJ, LoVerde PT, Wilson RA, Dillon GP, Cerqueira GC, Mashiyama ST, Al-Lazikani B, Andrade LF, Ashton PD, et al.: The genome of the blood fluke Schistosoma mansoni. Nature 2009, 460(7253):352–358. 10.1038/nature08160
Benson D, Karsch-Mizrachi I, Lipman D, Ostell J, Wheeler D: GenBank. Nucleic acids research 2007, (35 Database):D21–25.
Trapnell C, Pachter L, Salzberg SL: TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 2009, 25(9):1105–1111. 10.1093/bioinformatics/btp120
Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotech 2010, 28(5):511–515. 10.1038/nbt.1621
Werren JH, Richards S, Desjardins CA, Niehuis O, Gadau Jr, Colbourne JK, Group TNGW: Functional and Evolutionary Insights from the Genomes of Three Parasitoid Nasonia Species. Science 327(5963):343–348.
Insights into social insects from the genome of the honeybee Apis mellifera Nature 2006, 443(7114):931–949. 10.1038/nature05260
Quevillon E, Silventoinen V, Pillai S, Harte N, Mulder N, Apweiler R, Lopez R: InterProScan: protein domains identifier. Nucl Acids Res 2005, 33(suppl_2):W116–120.
Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, et al.: Pfam: clans, web tools and services. Nucl Acids Res 2006, 34(suppl_1):D247–251.
Cherry JM, Ball C, Weng S, Juvik G, Schmidt R, Adler C, Dunn B, Dwight S, Riles L, Mortimer RK, et al.: Genetic and physical maps of Saccharomyces cerevisiae. Nature 1997, 387(6632 Suppl):67–73.
Guigo R, Dermitzakis ET, Agarwal P, Ponting CP, Parra G, Reymond A, Abril JF, Keibler E, Lyle R, Ucla C, et al.: Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes. Proceedings of the National Academy of Sciences 2003, 100(3):1140–1145. 10.1073/pnas.0337561100
BioPerl[http://www.bioperl.org]
Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O, Walichiewicz J: Repbase Update, a database of eukaryotic repetitive elements. Cytogenetic and Genome Research 2005, 110(1–4):462–467. 10.1159/000084979
Altschul SF, Gish W, Miller W, Meyers EW, Lipman DJ: Basic Local Alignment Search Tool. Journal of Molecular Biology 1990, 215: 403–410.
Wei F, Stein JC, Liang C, Zhang J, Fulton RS, Baucom RS, De Paoli E, Zhou S, Yang L, Han Y, et al.: Detailed Analysis of a Contiguous 22-Mb Region of the Maize Genome. PLoS Genet 2009, 5(11):e1000728. 10.1371/journal.pgen.1000728
Maize Classical Gene List[http://synteny.cnr.berkeley.edu/wiki/index.php/Classical_Maize_Genes]
Soderlund C, Descour A, Kudrna D, Bomhoff M, Boyd L, Currie J, Angelova A, Collura K, Wissotski M, Ashley E, et al.: Sequencing, Mapping, and Analysis of 27,455 Maize Full-Length cDNAs. PLoS Genet 2009, 5(11):e1000740. 10.1371/journal.pgen.1000740
Maize Transposable Element Database[http://maizetedb.org/]
Bonasio R, Zhang G, Ye C, Mutti NS, Fang X, Qin N, Donahue G, Yang P, Li Q, Li C, et al.: Genomic Comparison of the Ants Camponotus floridanus and Harpegnathos saltator. Science 329(5995):1068–1071.
Munoz-Torres MC, Reese JT, Childers CP, Bennett AK, Sundaram JP, Childs KL, Anzola JM, Milshina N, Elsik CG: Hymenoptera Genome Database: integrated community resources for insect species of the order Hymenoptera. Nucleic acids research 39(suppl 1):D658-D662.
Coghlan A, Fiedler T, McKay S, Flicek P, Harris T, Blasiar D, the nGC, Stein L: nGASP - the nematode genome annotation assessment project. BMC Bioinformatics 2008, 9(1):549. 10.1186/1471-2105-9-549
Guigo R, Flicek P, Abril JF, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E, et al.: EGASP: the human ENCODE Genome Annotation Assessment Project. Genome Biol 2006, 7(Suppl 1):1–31. 10.1186/gb-2006-7-s1-s1
Burset M, Guigo R: Evaluation of gene structure prediction programs. Genomics 1996, 34(3):353–367. 10.1006/geno.1996.0298
Zmasek C, Godzik A: Strong functional patterns in the evolution of eukaryotic genomes revealed by the reconstruction of ancestral protein domain repertoires. Genome biology 2011, 12(1):R4. 10.1186/gb-2011-12-1-r4
Pruitt KD, Tatusova T, Maglott DR: NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 2007, (35 Database):D61–65.
Mungall C, Emmert D: A Chado case study: an ontology-based modular schema for representing genome-associated biological information. Bioinformatics (Oxford, England) 2007, 23(13):i337–346. 10.1093/bioinformatics/btm189
Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, Zhang Y, Blankenberg D, Albert I, Taylor J, et al.: Galaxy: A platform for interactive large-scale genome analysis. Genome research 2005, 15(10):1451–1455. 10.1101/gr.4086505
Stein LD, Mungall C, Shu S, Caudy M, Mangone M, Day A, Nickerson E, Stajich JE, Harris TW, Arva A, et al.: The Generic Genome Browser: A Building Block for a Model Organism System Database. Genome Res 2002, 12(10):1599–1610. 10.1101/gr.403602
