Understanding the accuracy of statistical haplotype inference with sequence data of known phase

Genetic Epidemiology - Tập 31 Số 7 - Trang 659-671 - 2007
Aida M. Andrés1,2, Andrew G. Clark3, Lawrence C. Shimmin4, Eric Boerwinkle5, Christian Ehnholm6, James E. Hixson5
1Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York;
2National Human Genome Research Institute, National Institutes of Health, 50 South Drive, Building 50 Room 5527, Bethesda, MD 20892
3(Cornell University
4University of Texas Health Science Center at Houston
5Human Genetics Center, University of Texas Health Science Center, Houston, Texas
6Department of Human Genetics, University of Michigan, Ann Arbor, Michigan

Tóm tắt

Abstract

Statistical methods for haplotype inference from multi‐site genotypes of unrelated individuals have important application in association studies and population genetics. Understanding the factors that affect the accuracy of this inference is important, but their assessment has been restricted by the limited availability of biological data with known phase. We created hybrid cell lines monosomic for human chromosome 19 and produced single‐chromosome complete sequences of a 48 kb genomic region in 39 individuals of African American (AA) and European American (EA) origin. We employ these phase‐known genotypes and coalescent simulations to assess the accuracy of statistical haplotype reconstruction by several algorithms. Accuracy of phase inference was considerably low in our biological data even for regions as short as 25–50 kb, suggesting that caution is needed when analyzing reconstructed haplotypes. Moreover, the reliability of estimated confidence in phase inference is not high enough to allow for a reliable incorporation of site‐specific uncertainty information in subsequent analyses. We show that, in samples of certain mixed ancestry (AA and EA populations), the most accurate haplotypes are probably obtained when increasing sample size by considering the largest, pooled sample, despite the hypothetical problems associated with pooling across those heterogeneous samples. Strategies to improve confidence in reconstructed haplotypes, and realistic alternatives to the analysis of inferred haplotypes, are discussed. Genet. Epidemiol. © 2007 Wiley‐Liss, Inc.

Từ khóa


Tài liệu tham khảo

10.1093/bioinformatics/bth457

10.1086/381000

10.1093/bioinformatics/btg078

Clark AG, 1990, Inference of haplotypes from PCR‐amplified samples of diploid populations, Mol Biol Evol, 7, 111

10.1002/gepi.20025

10.1086/301977

10.1002/gepi.20032

10.1038/ng582

Excoffier L, 1995, Maximum‐likelihood estimation of molecular haplotype frequencies in a diploid population, Mol Biol Evol, 12, 921

10.1186/1479-7364-1-1-7

10.1126/science.1069424

10.1093/bioinformatics/bth149

10.1093/oxfordjournals.jhered.a111613

10.1038/hdy.1974.89

10.1093/bioinformatics/18.2.337

10.1038/ng1001-233

10.1093/jhered/esh060

10.1073/pnas.0404730102

10.1101/gr.4371105

10.1086/344347

Long JC, 1995, An E‐M algorithm and testing strategy for multiple‐locus haplotypes, Am J Hum Genet, 56, 799

10.1086/500808

10.1534/genetics.166.1.351

10.1101/gr.GR-1677RR

10.1086/316940

10.1086/420773

10.1126/science.1117196

10.1002/gepi.20024

10.1086/338446

10.1126/science.1065573

10.1016/j.ygeno.2005.08.013

10.1038/35075590

10.1038/nature01140

10.1186/1479-7364-2-1-39

10.1534/genetics.166.1.505

10.1086/502802

10.1534/genetics.105.042762

10.1086/379378

10.1086/428594

Stephens JC, 1990, Theoretical underpinning of the single‐molecule‐dilution. SMD. method of direct haplotype resolution, Am J Hum Genet, 46, 1149

10.1086/319501

10.1038/nature04226

10.1126/science.271.5254.1380

10.1515/9781400859832-007

10.1038/35002251