A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data

Bioinformatics (Oxford, England) - Tập 27 Số 21 - Trang 2987-2993 - 2011
Heng Li1
1Medical Population Genetics Program, Broad Institute, 7 Cambridge Center, Cambridge, MA 02142, USA

Tóm tắt

Abstract Motivation: Most existing methods for DNA sequence analysis rely on accurate sequences or genotypes. However, in applications of the next-generation sequencing (NGS), accurate genotypes may not be easily obtained (e.g. multi-sample low-coverage sequencing or somatic mutation discovery). These applications press for the development of new methods for analyzing sequence data with uncertainty. Results: We present a statistical framework for calling SNPs, discovering somatic mutations, inferring population genetical parameters and performing association tests directly based on sequencing data without explicit genotyping or linkage-based imputation. On real data, we demonstrate that our method achieves comparable accuracy to alternative methods for estimating site allele count, for inferring allele frequency spectrum and for association mapping. We also highlight the necessity of using symmetric datasets for finding somatic mutations and confirm that for discovering rare events, mismapping is frequently the leading source of errors. Availability:  http://samtools.sourceforge.net Contact:  [email protected]

Từ khóa


Tài liệu tham khảo

1000 Genomes Project Consortium, 2010, A map of human genome variation from population-scale sequencing, Nature, 467, 1061, 10.1038/nature09534

Ajay, 2011, Accurate and comprehensive sequencing of personal genomes, Genome Res., 21, 1498, 10.1101/gr.123638.111

Bentley, 2008, Accurate whole human genome sequencing using reversible terminator chemistry, Nature, 456, 53, 10.1038/nature07517

Brent, 1973, Algorithms for Minimization without Derivatives.

Browning, 2009, Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies, Am. J. Hum. Genet., 85, 847, 10.1016/j.ajhg.2009.11.004

Conrad, 2011, Variation in genome-wide mutation rates within and between human families, Nat. Genet., 43, 712, 10.1038/ng.862

Danecek, 2011, The variant call format and vcftools, Bioinformatics, 27, 2156, 10.1093/bioinformatics/btr330

Depristo, 2011, A framework for variation discovery and genotyping using next-generation DNA sequencing data, Nat. Genet., 43, 491, 10.1038/ng.806

Drmanac, 2010, Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays, Science, 327, 78, 10.1126/science.1181498

Durbin, 1998, Biological Sequence Analysis., 10.1017/CBO9780511790492

Excoffier, 1995, Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population, Mol. Biol. Evol., 12, 921

Hodgkinson, 2010, Human triallelic sites: evidence for a new mutational mechanism?, Genetics, 184, 233, 10.1534/genetics.109.110510

Howie, 2009, A flexible and accurate genotype imputation method for the next generation of genome-wide association studies, PLoS Genet., 5, e1000529, 10.1371/journal.pgen.1000529

Kim, 2010, Design of association studies with pooled or un-pooled next-generation sequencing data, Genet. Epidemiol., 34, 479, 10.1002/gepi.20501

Kim, 2011, Estimation of allele frequency and association mapping using next-generation sequencing data, BMC Bioinformatics, 12, 231, 10.1186/1471-2105-12-231

Le, 2010, SNP detection and genotyping from low-coverage sequencing data on multiple diploid samples, Genome Res., 21, 952, 10.1101/gr.113084.110

Ley, 2008, DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome, Nature, 456, 66, 10.1038/nature07485

Li, 2009, Fast and accurate short read alignment with burrows-wheeler transform, Bioinformatics, 25, 1754, 10.1093/bioinformatics/btp324

Li, 2010, Fast and accurate long-read alignment with burrows-wheeler transform, Bioinformatics, 26, 589, 10.1093/bioinformatics/btp698

Li, 2008, Mapping short DNA sequencing reads and calling variants using mapping quality scores, Genome Res., 18, 1851, 10.1101/gr.078212.108

Li, 2009, The sequence alignment/map format and samtools, Bioinformatics, 25, 2078, 10.1093/bioinformatics/btp352

Li, 2011, Improving SNP discovery by base alignment quality, Bioinformatics, 27, 1157, 10.1093/bioinformatics/btr076

Li, 2009, Genotype imputation, Annu. Rev. Genomics Hum. Genet., 10, 387, 10.1146/annurev.genom.9.081307.164242

Li, 2010, MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes, Genet. Epidemiol., 34, 816, 10.1002/gepi.20533

Li, 2010, Resequencing of 200 human exomes identifies an excess of low-frequency non-synonymous coding variants, Nat. Genet., 42, 969, 10.1038/ng.680

Li, 2011, Low-coverage sequencing: Implications for design of complex trait association studies, Genome Res., 21, 940, 10.1101/gr.117259.110

Mardis, 2009, Recurring mutations found by sequencing an acute myeloid leukemia genome, N. Engl. J. Med., 361, 1058, 10.1056/NEJMoa0903840

Martin, 2010, SeqEM: an adaptive genotype-calling approach for next-generation sequencing studies, Bioinformatics, 26, 2803, 10.1093/bioinformatics/btq526

Nakamura, 2011, Sequence-specific error profile of illumina sequencers, Nucleic Acids Res., 39, e90, 10.1093/nar/gkr344

Nielsen, 2011, Genotype and SNP calling from next-generation sequencing data, Nat. Rev. Genet., 12, 443, 10.1038/nrg2986

Paten, 2008, Enredo and pecan: genome-wide mammalian consistency-based multiple alignment with paralogs, Genome Res., 18, 1814, 10.1101/gr.076554.108

Pleasance, 2010, A comprehensive catalogue of somatic mutations from a human cancer genome, Nature, 463, 191, 10.1038/nature08658

Pleasance, 2010, A small-cell lung cancer genome with complex signatures of tobacco exposure, Nature, 463, 184, 10.1038/nature08629

Roach, 2010, Analysis of genetic inheritance in a family quartet by whole-genome sequencing, Science, 328, 636, 10.1126/science.1186802

Robison, 2010, Application of second-generation sequencing to cancer genomics, Brief. Bioinformatics, 11, 524, 10.1093/bib/bbq013

Schaid, 2002, Score tests for association between traits and haplotypes when linkage phase is ambiguous, Am. J. Hum. Genet., 70, 425, 10.1086/338688

Shah, 2009, Mutational evolution in a lobular breast tumour profiled at single nucleotide resolution, Nature, 461, 809, 10.1038/nature08489

Yi, 2010, Sequencing of 50 human exomes reveals adaptation to high altitude, Science, 329, 75, 10.1126/science.1190371