Global-scale phylogenetic linguistic inference from lexical resources

Scientific data - Tập 5 Số 1
Gerhard Jäger1
1Tübingen University, Institute of Linguistics, Wilhelmstr. 19, Tübingen, 72074, Germany

Tóm tắt

Abstract

Automatic phylogenetic inference plays an increasingly important role in computational historical linguistics. Most pertinent work is currently based onexpert cognate judgments. This limits the scope of this approach to a small number of well-studied language families. We used machine learning techniques to compile data suitable for phylogenetic inference from the ASJP database, a collection of almost 7,000 phonetically transcribed word lists over 40 concepts, covering two thirds of the extant world-wide linguistic diversity. First, we estimatedPointwise Mutual Informationscores between sound classes using weighted sequence alignment and general-purpose optimization. From this we computed a dissimilarity matrix over all ASJP word lists. This matrix is suitable fordistance-basedphylogenetic inference. Second, we appliedcognate clusteringto the ASJP data, using supervised training of an SVM classifier on expert cognacy judgments. Third, we defined two types of binarycharacters, based on automatically inferred cognate classes and on sound-class occurrences. Several tests are reported demonstrating the suitability of these characters forcharacter-basedphylogenetic inference.

Từ khóa


Tài liệu tham khảo

Atkinson, Q. D. & Gray, R. Curious parallels and curious connections — phylogenetic thinking in biology and historical linguistics. Systematic Biology 54, 513–526 (2005).

Levinson, S. C., D., R. & Gray, R. Tools from evolutionary biology shed new light on the diversification of languages. Trends in Cognitive Sciences 16, 167–173 (2012).

Gray, R. D. & Jordan, F. M. Language trees support the express-train sequence of Austronesian expansion. Nature 405, 1052–1055 (2000).

Dunn, M., Terrill, A., Reesink, G., Foley, R. A. & Levinson, S. C. Structural phylogenetics and the reconstruction of ancient language history. Science 309, 2072–2075 (2005).

Pagel, M., Atkinson, Q. D. & Meade, A. Frequency of word-use predicts rates of lexical evolution throughout Indo-European history. Nature 449, 717–720 (2007).

Brown, C. H., Holman, E. W., Wichmann, S. & Velupillai, V. Automated classification of the world’s languages: A description of the method and preliminary results. STUF — Language Typology and Universals 4, 285–308 (2008).

Gray, R. D., Drummond, A. J. & Greenhill, S. J. Language phylogenies reveal expansion pulses and pauses in Pacific settlement. Science 323, 479–483 (2009).

Dunn, M., Greenhill, S. J., Levinson, S. & Gray, R. D. Evolved structure of language shows lineage-specific trends in word-order universals. Nature 473, 79–82 (2011).

Bouckaert, R. et al. Mapping the origins and expansion of the Indo-European language family. Science 337, 957–960 (2012).

Bowern, C. & Atkinson, Q. Computational phylogenetics and the internal structure of Pama-Nyungan. Language 88, 817–845 (2012).

Bouchard-Côté, A., Hall, D., Griffiths, T. L. & Klein, D. Automated reconstruction of ancient languages using probabilistic models of sound change. Proceedings of the National Academy of Sciences 36, 141–150 (2013).

Pagel, M., Atkinson, Q. D., Calude, A. S. & Meade, A. Ultraconserved words point to deep language ancestry across Eurasia. Proceedings of the National Academy of Sciences 110, 8471–8476 (2013).

Hruschka, D. J. et al. Detecting regular sound changes in linguistics as events of concerted evolution. Current Biology 25, 1–9 (2015).

Jäger, G. Support for linguistic macrofamilies from weighted sequence alignment. Proceedings of the National Academy of Sciences 112, 12752–12757 (2015). Doi: 10.1073/pnas.1500331112.

Greenhill, S. J., Blust, R. & Gray, R. D. The Austronesian Basic Vocabulary Database: From bioinformatics to lexomics. Evolutionary Bioinformatics 4, 271–283 (2008).

Wichmann, S., Holman, E. W. Languages with longer words have more lexical change. In Borin L. & Saxena A. eds. Approaches to Measuring Linguistic Differences 249–284 (Mouton de Gruyter Berlin, 2013).

List, J.-M. Data from: Sequence comparison in historical linguistics GitHub Repository http://github.com/SequenceComparison/SupplementaryMaterial (2014).

Mennecier, P., Nerbonne, J., Heyer, E. & Manni, F. A Central Asian language survey: Collecting data, measuring relatedness and detecting loans. Language Dynamics and Change 6, 57–98 (2016).

Jäger, G. Phylogenetic inference from word lists using weighted alignment with empirically determined weights. Language Dynamics and Change 3, 245–291 (2013).

Jäger, G., Sofroniev, P. Automatic cognate classification with a Support Vector Machine. In Dipper S., Neubarth F. & Zinsmeister H. eds. Proceedings of the 13th Conference on Natural Language Processing, vol. 16 of Bochumer Linguistische Arbeitsberichte 128–134 Ruhr Universität Bochum, (2016).

Jäger, G., List, J.-M. & Sofroniev, P. Using support vector machines and state-of-the-art algorithms for phonetic alignment to identify cognates in multi-lingual wordlists. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. (ACL, 2017).

Nunn, C. L. The Comparative Approach in Evolutionary Anthropology and Biology. The University of Chicago Press Chicago, (2011).

Needleman, S. B. & Wunsch, C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48, 443–453 (1970).

Cock, P. J. A. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423 (2009).

Holman, E. W. et al. Advances in automated language classification. In Arppe A., Sinnemäki K., Nikanne U. eds. Quantitative Investigations in Theoretical Linguistics 40–43 (University of Helsinki, 2008).

Dempster, A. P., Laird, N. M. & Rubin, D. B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society. Series B (methodological) 29, 1–38 (1977).

Nelder, J. A. & Mead, R. A simplex method for function minimization. The computer journal 7, 308–313 (1965).

Kroonen, G. Etymological Dictionary of Proto-Germanic. (Brill Leiden: Boston, 2013).

Fisher, R. A. Statistical methods for research workers. (Genesis Publishing Pvt Ltd, 1925).

Platt, J. C. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in large margin classifiers 61–74 (MIT Press, 1999).

Bagga, A. & Baldwin, B. Entity-based cross-document coreferencing using the vector space model. In Proceedings of the 17th International Conference on Computational Linguistics-Volume 1 79–85 Association for Computational Linguistics (1998).

List, J.-M. Lexstat: Automatic detection of cognates in multilingual wordlists. In Butt M. & Prokić J. eds Proceedings of LINGVIS & UNCLH, Workshop at EACL 2012 117–125 (Avignon, 2012).

Raghavan, U. N., Albert, R. & Kumara, S. Near linear time algorithm to detect community structures in large-scale networks. Physical Review E 76, 036106 (2007).

Gascuel, O. BIONJ: An improved version of the NJ algorithm based on a simple model of sequence data. Molecular Biology and Evolution 14, 685–695 (1997).

Lewis, P. O. A likelihood approach to estimating phylogeny from discrete morphological character data. Systematic Biology 50, 913–925 (2001).

Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30, 1312–1313 (2014).

Ronquist, F. & Huelsenbeck, J. P. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics 19, 1572–1574 (2003).

Drummond, A. J., Suchard, M. A., Xie, D. & Rambaut, A. Bayesian phylogenetics with BEAUti and the BEAST 1.7. Molecular biology and evolution 29, 1969–1973 (2012).

Pagel, M. & Meade, A. BayesPhylogenies 2.0. software distributed by the authors (2015).

Pompei, S., Loreto, V. & Tria, F. On the accuracy of language trees. PLoS One 6, e20109 (2011).

Blasi, D. E., Wichmann, S., Hammarström, H., Stadler, P. F. & Christiansen, M. H. Sound-meaning association biases evidenced across thousands of languages. Proceedings of the National Academy of Sciences 113, 10818–10823 (2016).

Legendre, P. & Legendre, L. F. J. Numerical Ecology. Elsevier: Amsterdam/Oxford, (2012).

Atkinson, Q. D., Meade, A., Venditti, C., Greenhill, S. J. & Pagel, M. Languages evolve in punctuational bursts. Science 319, 588–588 (2008).

Gould, S. J. & Eldredge, N. Punctuated equilibria: the tempo and mode of evolution reconsidered. Paleobiology 3, 115–151 (1977).

Pagel, M., Venditti, C. & Meade, A. Large punctuational contribution of speciation to evolutionary divergence at the molecular level. Science 314, 119–121 (2006).

Venditti, C., Meade, A. & Pagel, M. Detecting the node-density artifact in phylogeny reconstruction. Systematic Biology 55, 637–643 (2006).

Grafen, A. The phylogenetic regression. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences 119–157 (1989).

Militarev, A. I. Towards the chronology of Afrasian (Afroasiatic) and its daughter families. McDonald Institute for Archaelogical Research Cambridge, (2000).

Běijīng Dàxué Hànyŭ fāngyán cíhuì [Chinese dialect vocabularies]. (Wénzì Géigé, 1964).

McElhanon, K. A. Preliminary observations on Huon Peninsula languages. Oceanic Linguistics 6, 1–45 (1967).

Hattori, S. Japanese dialects. In Hoenigswald H. M. & Langacre R. H. (eds) Diachronic, areal and typological linguistics 368–400 (Mouton The Hague and Paris, 1973).

Peiros, I. Comparative linguistics in Southeast Asia. Pacific Linguistics 142 (1998).

Sanders, J. & Sanders, A. G. Dialect survey of the Kamasau language. Pacific Linguistics. Series A. Occasional Papers 56, 137 (1980).

Cysouw, M., Wichmann, S. & Kamholz, D. A critique of the separation base method for genealogical subgrouping. Journal of Quantitative Linguistics 13, 225–264 (2006).

Wichmann, S., Holman, E. W., & Brown, C. H. The ASJP Database (version 17) http://asjp.clld.org/static/listss17.zip (2016)

Hammarström, H., Forkel, R., & Haspelmath, M. Zenodo https://doi.org/10.5281/zenodo.1321024 (2018)

Jäger, G. Open Science Framework https://doi.org/10.17605/OSF.IO/CUFV7 (2018)