False gene and chromosome losses in genome assemblies caused by GC content variation and repeats
Tóm tắt
Many short-read genome assemblies have been found to be incomplete and contain mis-assemblies. The Vertebrate Genomes Project has been producing new reference genome assemblies with an emphasis on being as complete and error-free as possible, which requires utilizing long reads, long-range scaffolding data, new assembly algorithms, and manual curation. A more thorough evaluation of the recent references relative to prior assemblies can provide a detailed overview of the types and magnitude of improvements. Here we evaluate new vertebrate genome references relative to the previous assemblies for the same species and, in two cases, the same individuals, including a mammal (platypus), two birds (zebra finch, Anna’s hummingbird), and a fish (climbing perch). We find that up to 11% of genomic sequence is entirely missing in the previous assemblies. In the Vertebrate Genomes Project zebra finch assembly, we identify eight new GC- and repeat-rich micro-chromosomes with high gene density. The impact of missing sequences is biased towards GC-rich 5′-proximal promoters and 5′ exon regions of protein-coding genes and long non-coding RNAs. Between 26 and 60% of genes include structural or sequence errors that could lead to misunderstanding of their function when using the previous genome assemblies. Our findings reveal novel regulatory landscapes and protein coding sequences that have been greatly underestimated in previous assemblies and are now present in the Vertebrate Genomes Project reference genomes.
Tài liệu tham khảo
De Lorenzi L, Parma P. Identification of some errors in the genome assembly of Bovidae by FISH. Cytogenetic and Genome Research. 2020;160:85–93.
Korlach J, Gedman G, Kingan SB, Chin C-S, Howard JT, Audet J-N, et al. De novo PacBio long-read and phased avian genome assemblies correct and add to reference genes generated with intermediate and short reads. Gigascience. 2017;6:gix085.
Peona V, Weissensteiner MH, Suh A. How complete are “complete” genome assemblies?—An avian perspective: Wiley Online Library; 2018.
Zhang G, Li C, Li Q, Li B, Larkin DM, Lee C, et al. Comparative genomics reveals insights into avian genome evolution and adaptation. Science. 2014;346:1311–20.
Lindblad-Toh K, Garber M, Zuk O, Lin MF, Parker BJ, Washietl S, et al. A high-resolution map of human evolutionary constraint using 29 mammals. Nature. 2011;478:476–82.
Rhie A, McCarthy SA, Fedrigo O, Damas J, Formenti G, Koren S, et al. Towards complete and error-free genome assemblies of all vertebrate species. Nature. 2021;592:737–46.
Jarvis ED. Perspectives from the avian phylogenomics project: questions that can be answered with sequencing all genomes of a vertebrate class. Ann Rev Anim Biosci. 2016;4:45–59.
Rhoads A, Au KF. PacBio sequencing and its applications. Genomics Proteomics Bioinform. 2015;13:278–89.
Weisenfeld NI, Kumar V, Shah P, Church DM, Jaffe DB. Direct determination of diploid genome sequences. Genome Res. 2017;27:757–67.
Lam ET, Hastie A, Lin C, Ehrlich D, Das SK, Austin MD, et al. Genome mapping on nanochannel arrays for structural variation analysis and sequence assembly. Nat Biotechnol. 2012;30:771–6.
Lieberman-Aiden E, Van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science. 2009;326:289–93.
Warren WC, Clayton DF, Ellegren H, Arnold AP, Hillier LW, Künstner A, et al. The genome of a songbird. Nature. 2010;464:757–62.
Warren WC, Hillier LW, Graves JAM, Birney E, Ponting CP, Grützner F, et al. Genome analysis of the platypus reveals unique signatures of evolution. Nature. 2008;453:175.
Malmstrøm M, Matschiner M, Tørresen OK, Star B, Snipen LG, Hansen TF, et al. Evolution of the immune system influences speciation rates in teleost fishes. Nat Genet. 2016;48:1204–10.
Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34:3094–100.
Paten B, Earl D, Nguyen N, Diekhans M, Zerbino D, Haussler D. Cactus: Algorithms for genome multiple sequence alignment. Genome Res. 2011;21:1512–28.
Ko BJ, Lee C, Kim J, Rhie A, Yoo D, Howe K, et al. Widespread false gene gains caused by duplication errors in genome assemblies. Genome Biol. 2022. https://doi.org/10.1186/s13059-022-02764-1.
Peona V, Blom MP, Xu L, Burri R, Sullivan S, Bunikis I, et al. Identifying the causes and consequences of assembly gaps using a multiplatform genome assembly of a bird-of-paradise. Mol Ecol Resources. 2021;21:263–86.
Costantini M, Auletta F, Bernardi G. Isochore patterns and gene distributions in fish genomes. Genomics. 2007;90:364–71.
Kerpedjiev P, Abdennur N, Lekschas F, McCallum C, Dinkla K, Strobelt H, et al. HiGlass: web-based visual exploration and analysis of genome interaction maps. Genome Biol. 2018;19:1–12.
Smith J, Bruley C, Paton I, Dunn I, Jones C, Windsor D, et al. Differences in gene density on chicken macrochromosomes and microchromosomes. Anim Genet. 2000;31:96–103.
Consortium ICGS. Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature. 2004;432:695–716.
Knief U, Forstmeier W. Mapping centromeres of microchromosomes in the zebra finch (Taeniopygia guttata) using half-tetrad analysis. Chromosoma. 2016;125:757–68.
Hu Y, Yan C, Hsu C-H, Chen Q-R, Niu K, Komatsoulis GA, et al. OmicCircos: a simple-to-use R package for the circular visualization of multidimensional omics data. Cancer Inform. 2014;13 CIN. S13495:13–20.
Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26:841–2.
Zhou Y, Shearwin-Whyatt L, Li J, Song Z, Hayakawa T, Stevens D, et al. Platypus and echidna genomes reveal mammalian biology and evolution. Nature. 2021;592:756–62.
Kabir MA, Habib MA, Hasan M, Alam SS. Genetic diversity in three forms of Anabas testudineus Bloch. Cytologia. 2012;77:231–7.
McCarthy JJ, Hilfiker R. The use of single-nucleotide polymorphism maps in pharmacogenomics. Nat Biotechnol. 2000;18:505–8.
Mullaney JM, Mills RE, Pittard WS, Devine SE. Small insertions and deletions (INDELs) in human genomes. Human Mol Genet. 2010;19:R131–6.
Leaché AD, Oaks JR. The utility of single nucleotide polymorphism (SNP) data in phylogenetics. Ann Rev Ecol Evol Syst. 2017;48:69–84.
Minoche AE, Dohm JC, Himmelbauer H. Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and genome analyzer systems. Genome Biol. 2011;12:1–15.
Fiddes IT, Armstrong J, Diekhans M, Nachtweide S, Kronenberg ZN, Underwood JG, et al. Comparative Annotation Toolkit (CAT)—simultaneous clade and personal genome annotation. Genome Res. 2018;28:1029–38.
Zhang L, Kasif S, Cantor CR, Broude NE. GC/AT-content spikes as genomic punctuation marks. Proc Natl Acad Sci. 2004;101:16855–60.
Haug-Baltzell A, Jarvis ED, McCarthy FM, Lyons E. Identification of dopamine receptors across the extant avian family tree and analysis with other clades uncovers a polyploid expansion among vertebrates. Front Neurosci. 2015;9:361.
Speidel D, Bruederle CE, Enk C, Voets T, Varoqueaux F, Reim K, et al. CAPS1 regulates catecholamine loading of large dense-core vesicles. Neuron. 2005;46:75–88.
Lovell PV, Clayton DF, Replogle KL, Mello CV. Birdsong “transcriptomics”: neurochemical specializations of the oscine song system. PloS one. 2008;3:e3440.
Lovell PV, Wirthlin M, Wilhelm L, Minx P, Lazar NH, Carbone L, et al. Conserved syntenic clusters of protein coding genes are missing in birds. Genome Biol. 2014;15:1–27.
Lovell PV, Mello CV. Correspondence on Lovell et al.: response to Bornelöv et al. Genome Biology. 2017;18:113.
Warren WC, Hillier LW, Tomlinson C, Minx P, Kremitzki M, Graves T, et al. A new chicken genome assembly provides insight into avian genome structure. G3 Genes|Genomes|Genetics. 2017;7:109–17.
Colquitt BM, Mets DG, Brainard MS. Draft genome assembly of the Bengalese finch, Lonchura striata domestica, a model for motor skill variability and learning. GigaScience. 2018;7:1–6.
Dutta S, Dawid IB. Kctd15 inhibits neural crest formation by attenuating Wnt/β-catenin signaling output. Development. 2010;137:3013–8.
Pfenning AR, Hara E, Whitney O, Rivas MV, Wang R, Roulhac PL, et al. Convergent transcriptional specializations in the brains of humans and song-learning birds. Science. 2014;346:1256846.
Bahudhanapati H, Bhattacharya S, Wei S. Evolution of vertebrate Adam genes; duplication of testicular adams from ancient Adam9/9-like loci. PLOS ONE. 2015;10:e0136281.
Wart HEV, Birkedal-Hansen H. The cysteine switch: a principle of regulation of metalloproteinase activity with potential applicability to the entire matrix metalloproteinase gene family. Proc Natl Acad Sci. 1990;87:5578–82.
Hoshina N, Tanimura A, Yamasaki M, Inoue T, Fukabori R, Kuroda T, et al. Protocadherin 17 regulates presynaptic assembly in topographic corticobasal ganglia circuits. Neuron. 2013;78:839–54.
Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31:3210–2.
Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, et al. The human genome browser at UCSC. Genome Res. 2002;12:996–1006.
Koren S, Rhie A, Walenz BP, Dilthey AT, Bickhart DM, Kingan SB, et al. De novo assembly of haplotype-resolved genomes with trio binning. Nat Biotechnol. 2018;36:1174–82.
Rice ES, Koren S, Rhie A, Heaton MP, Kalbfleisch TS, Hardy T, et al. Continuous chromosome-scale haplotypes assembled from a single interspecies F1 hybrid of yak and cattle. GigaScience. 2020;9:1–9.
Botero-Castro F, Figuet E, Tilak M-K, Nabholz B, Galtier N. Avian genomes revisited: hidden genes uncovered and the rates versus traits paradox in birds. Mol Biol Evol. 2017;34:3123–31.
Hron T, Pajer P, Pačes J, Bartůněk P, Elleder D. Hidden genes in birds. Genome Biol. 2015;16:164.
Haerty W, Ponting CP. Unexpected selection to retain high GC content and splicing enhancers within exons of multiexonic lncRNA loci. RNA. 2015;21:333–46.
Ressayre A, Glémin S, Montalent P, Serre-Giardi L, Dillmann C, Joets J. Introns structure patterns of variation in nucleotide composition in Arabidopsis thaliana and rice protein-coding genes. Genome Biol Evol. 2015;7:2913–28.
Lemaire S, Fontrodona N, Aubé F, Claude J-B, Polvèche H, Modolo L, et al. Characterizing the interplay between gene nucleotide composition bias and splicing. Genome Biol. 2019;20:259.
Gregory T R: Animal Genome Size Database. http://www.genomesize.com 2002.
Wright NA, Gregory TR, Witt CC. Metabolic ‘engines’ of flight drive genome size reduction in birds. Proc Biol Sci. 2014;281:20132780.
Dolezel J. Nuclear DNA content and genome size of trout and human. Cytometry Part A. 2003;51:127–8.
Kieleczawa J, et al. J Biomol Tech. 2006;17:207–17.
Tilak M-K, Botero-Castro F, Galtier N, Nabholz B. Illumina library preparation for sequencing the GC-rich fraction of heterogeneous genomic DNA. Genome Biol Evol. 2018;10:616–22.
Strien J, Sanft J, Mall G. Enhancement of PCR amplification of moderate GC-containing and highly GC-rich DNA sequences. Mol Biotechnol. 2013;54:1048–54.
Guiblet WM, Cremona MA, Harris RS, Chen D, Eckert KA, Chiaromonte F, et al. Non-B DNA: a major contributor to small- and large-scale variation in nucleotide substitution frequencies across the genome. Nucleic Acids Res. 2021;49:1497–516.
Benjamini Y, Speed TP. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 2012;40:e72.
Chaisson MJP, Wilson RK, Eichler EE. Genetic variation and the de novo assembly of human genomes. Nat Rev Genet. 2015;16:627–40.
Logsdon GA, Vollger MR, Eichler EE. Long-read human genome sequencing and its applications. Nat Rev Genet. 2020;21:597–614.
Sedlazeck FJ, Lee H, Darby CA, Schatz MC. Piercing the dark matter: bioinformatics of long-range sequencing and mapping. Nat Rev Genet. 2018;19:329–46.
Cheng H, Jarvis ED, Fedrigo O, Koepfli KP, Urban L, Gemmell NJ, et al. Haplotype-resolved assembly of diploid genomes without parental data. Nat Biotechnol. 2022;40:1332–5.
Miga KH, Koren S, Rhie A, Vollger MR, Gershman A, Bzikadze A, et al. Telomere-to-telomere assembly of a complete human X chromosome. Nature. 2020;585:79–84.
Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007;35:D61–5.
Benson DA, Cavanaugh M, Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, et al. GenBank. Nucleic Acids Res. 2012;41:D36–42.
Hickey G, Paten B, Earl D, Zerbino D, Haussler D. HAL: a hierarchical format for storing and analyzing multiple genome alignments. Bioinformatics. 2013;29:1341–2.
Morgulis A, Gertz EM, Schäffer AA, Agarwala R. WindowMasker: window-based masker for sequenced genomes. Bioinformatics. 2005;22:134–41.
Ginestet C. ggplot2: Elegant Graphics for Data Analysis. J Royal Stat Soc Series A. 2011;174:245.
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10:421.
Information NCfB, Camacho C. BLAST (r) Command Line Applications User Manual: National Center for Biotechnology Information (US); 2008.
Vernimmen D, Bickmore WA. The hierarchy of transcriptional activation: from enhancer to promoter. Trends Genet. 2015;31:696–708.
Rice P, Longden I, Bleasby A. EMBOSS: the European molecular biology open software suite. Trends Genet. 2000;16:276–7.
Hickey G, Heller D, Monlong J, Sibbesen JA, Sirén J, Eizenga J, et al. Genotyping structural variants in pangenome graphs using the vg toolkit. Genome Biol. 2020;21:1–17.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25:2078–9.
Kent WJ. BLAT—the BLAST-like alignment tool. Genome research. 2002;12:656–64.
Shajii A, Numanagić I, Whelan C, Berger B. Statistical binning for barcoded reads improves downstream analyses. Cell Syst. 2018;7:219–26 e215.
Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G, et al. Integrative genomics viewer. Nat Biotechnol. 2011;29:24–6.
Ankenbrand MJ, Hohlfeld S, Hackl T, Förster F. AliTV—interactive visualization of whole genome comparisons. PeerJ Computer Science. 2017;3:e116.
Chen N. Using Repeat Masker to identify repetitive elements in genomic sequences. Current protocols in bioinformatics. 2004;5:4.10. 11–14.10. 14.
Kassambara A. ggpubr:‘ggplot2’based publication ready plots. R package version 0.4. 0; 2020.
Vertebrate Genomes Project, Taeniopygia guttata. bTaeGut1_v1.p. NCBI Assembly: GCA_003957565.1. [https://www.ncbi.nlm.nih.gov/assembly/GCA_003957565.1]
Vertebrate Genomes Project, Taeniopygia guttata. bTaeGut1_v1.h. NCBI Assembly: GCA_003957525.1. [https://www.ncbi.nlm.nih.gov/assembly/GCA_003957525.1]
Washington University Genome Sequencing Center, Taeniopygia guttata. Taeniopygia_guttata-3.2.4. NCBI Assembly: GCA_000151805.2. [https://www.ncbi.nlm.nih.gov/assembly/GCA_000151805.2]
Vertebrate Genomes Project, Taeniopygia guttata. bTaeGut2.pat.W.v2. NCBI Assembly: GCA_008822105.2. [https://www.ncbi.nlm.nih.gov/assembly/GCA_008822105.2]
Vertebrate Genomes Project, Calypte anna. bCalAnn1_v1.p. NCBI Assembly: GCA_003957555.1. https://www.ncbi.nlm.nih.gov/assembly/GCA_003957555.1.
Vertebrate Genomes Project, Calypte anna. bCalAnn1_v1.h. NCBI Assembly: GCA_003957575.1. [https://www.ncbi.nlm.nih.gov/assembly/GCA_003957575.1]
BGI, Calypte anna. ASM69908v1. NCBI Assembly: GCA_000699085.1. [https://www.ncbi.nlm.nih.gov/assembly/GCA_000699085.1]
Vertebrate Genomes Project, Ornithorhynchus anatinus. mOrnAna1.p.v1. NCBI Assembly:GCA_004115215.1. [https://www.ncbi.nlm.nih.gov/assembly/GCA_004115215.1]
Vertebrate Genomes Project, Ornithorhynchus anatinus. mOrnAna1.h.v1. NCBI Assembly:GCA_004115175.1. [https://www.ncbi.nlm.nih.gov/assembly/GCA_004115175.1]
Washington University, Ornithorhynchus anatinus. Ornithorhynchus_anatinus_5.0.1_genomic. NCBI Assembly:GCF_000002275.2. [https://www.ncbi.nlm.nih.gov/assembly/GCF_000002275.2]
Vertebrate Genomes Project, Anabas testudineus. fAnaTes1.2. NCBI Assembly:GCF_900324465.2. [https://www.ncbi.nlm.nih.gov/assembly/GCF_900324465.2]
Vertebrate Genomes Project, Anabas testudineus. fAnaTes1.2_alternate_haplotype. NCBI Assembly:GCA_900650485.1. [https://www.ncbi.nlm.nih.gov/assembly/GCA_900650485.1]
CEES, Anabas testudineus. ASM90030266v1. NCBI Assembly:GCA_900302665.1. [https://www.ncbi.nlm.nih.gov/assembly/GCA_900302665.1]
Genome ark [https://vgp.github.io/genomeark/]
HHMI/UCSF, Lonchura striata domestica. lonStrDom2. NCBI Assembly: GCF_005870125.1. [ https://www.ncbi.nlm.nih.gov/assembly/GCF_005870125.1]
Uppsala University Ficedula albicollis. FicAlb1.5. NCBI Assembly: GCF_000247815.1. [https://www.ncbi.nlm.nih.gov/assembly/GCF_000247815.1]
The Bald Eagle Consortium, Haliaeetus leucocephalus. Haliaeetus_leucocephalus-4.0. NCBI Assembly: GCF_000737465.1. [https://www.ncbi.nlm.nih.gov/assembly/GCF_000737465.1]
Aquila chrysaetos canadensis. Aquila_chrysaetos-1.0.2. NCBI Assembly: GCF_000766835.1. [https://www.ncbi.nlm.nih.gov/assembly/GCF_000766835.1]
Howe KL, Achuthan P, Allen J, Allen J, Alvarez-Jarreta J, Amode MR, et al. Ensembl 2021. Nucleic acids research. 2021;49:D884–91.
Lee C, Kim J, et al. False gene losses. Github, https://github.com/chulbioinfo/FalseGeneLoss.git. 2022.
Lee C, Kim J, et al. False gene losses. zenodo; 2022. https://doi.org/10.5281/zenodo.6534420.