Clumpak: a program for identifying clustering modes and packaging population structure inferences across K

Molecular Ecology Resources - Tập 15 Số 5 - Trang 1179-1191 - 2015
Naama M. Kopelman1, Jonathan Mayzel1, Mattias Jakobsson2, Noah A. Rosenberg3, Itay Mayrose1
1Department of Molecular Biology and Ecology of Plants, Tel Aviv University, Ramat Aviv, 69978, Israel.
2Department of Evolutionary Biology and SciLife Lab Uppsala University Uppsala 75236 Sweden
3Department of Biology, Stanford University, Stanford, CA 94305, USA

Tóm tắt

AbstractThe identification of the genetic structure of populations from multilocus genotype data has become a central component of modern population‐genetic data analysis. Application of model‐based clustering programs often entails a number of steps, in which the user considers different modelling assumptions, compares results across different predetermined values of the number of assumed clusters (a parameter typically denoted K), examines multiple independent runs for each fixed value of K, and distinguishes among runs belonging to substantially distinct clustering solutions. Here, we present Clumpak (Cluster Markov Packager Across K), a method that automates the postprocessing of results of model‐based population structure analyses. For analysing multiple independent runs at a single K value, Clumpak identifies sets of highly similar runs, separating distinct groups of runs that represent distinct modes in the space of possible solutions. This procedure, which generates a consensus solution for each distinct mode, is performed by the use of a Markov clustering algorithm that relies on a similarity matrix between replicate runs, as computed by the software Clumpp. Next, Clumpak identifies an optimal alignment of inferred clusters across different values of K, extending a similar approach implemented for a fixed K in Clumpp and simplifying the comparison of clustering results across different K values. Clumpak incorporates additional features, such as implementations of methods for choosing K and comparing solutions obtained by different programs, models, or data subsets. Clumpak, available at http://clumpak.tau.ac.il, simplifies the use of model‐based analyses of population structure in population genetics and molecular ecology.

Từ khóa


Tài liệu tham khảo

10.1186/1471-2105-12-246

10.1101/gr.094052.109

10.1534/genetics.105.044586

10.1111/j.1471-8286.2007.01769.x

10.1093/genetics/163.1.367

10.1093/bioinformatics/bth250

10.1186/1471-2105-9-539

10.1007/s00180-007-0072-x

10.1017/S001667230100502X

10.1126/science.1139518

10.1093/molbev/msp106

10.1007/s12686-011-9548-7

10.1093/nar/30.7.1575

10.1111/j.1365-294X.2005.02553.x

10.1111/j.1365-294X.2005.02553.x

10.1111/j.1471-8286.2007.01758.x

10.1371/journal.pcbi.1002606

10.1534/genetics.106.059923

10.1534/genetics.113.160572

10.1534/genetics.107.072371

10.1111/j.1365-294X.2012.05754.x

10.1111/mec.12488

10.1093/bioinformatics/btn419

10.1046/j.1365-294x.2001.01191.x

10.1086/375613

10.1017/CBO9780511840371

10.1111/j.1755-0998.2009.02591.x

10.1534/genetics.106.061317

10.4137/EBO.S6761

Jain AK, 1988, Algorithms for clustering data

10.1093/bioinformatics/btm233

10.1038/nature06742

10.1002/9780470316801

10.1186/1471-2156-10-80

10.1016/j.tree.2004.12.004

10.1126/science.1097406

10.1139/f05-224

10.1006/tpbi.2001.1543

10.1111/j.1365-294X.2004.02396.x

10.1371/journal.pone.0066213

10.1046/j.1471-8286.2003.00566.x

10.1093/genetics/159.2.699

10.1073/pnas.98.3.858

10.1126/science.1078311

10.1534/genetics.108.100222

10.1002/gepi.20064

10.1038/90135

Van DongenS 2000Graph clustering by flow simulation. PhD thesis University of Utrecht Utrecht.

10.1137/040608635

10.1371/journal.pgen.0030185

10.1101/gr.076539.108

10.1126/science.1132772

10.1038/ng.2494