Zipf’s law revisited: Spoken dialog, linguistic units, parameters, and the principle of least effort

Guido M. Linders1, Max M. Louwerse1
1Department of Cognitive Science & Artificial Intelligence, Tilburg University, Tilburg, The Netherlands

Tóm tắt

AbstractThe ubiquitous inverse relationship between word frequency and word rank is commonly known as Zipf’s law. The theoretical underpinning of this law states that the inverse relationship yields decreased effort in both the speaker and hearer, the so-called principle of least effort. Most research has focused on showing an inverse relationship only for written monolog, only for frequencies and ranks of one linguistic unit, generally word unigrams, with strong correlations of the power law to the observed frequency distributions, with limited to no attention to psychological mechanisms such as the principle of least effort. The current paper extends the existing findings, by not focusing on written monolog but on a more fundamental form of communication, spoken dialog, by not only investigating word unigrams but also units quantified on syntactic, pragmatic, utterance, and nonverbal communicative levels by showing that the adequacy of Zipf’s formula seems ubiquitous, but the exponent of the power law curve is not, and by placing these findings in the context of Zipf’s principle of least effort through redefining effort in terms of cognitive resources available for communication. Our findings show that Zipf’s law also applies to a more natural form of communication—that of spoken dialog, that it applies to a range of linguistic units beyond word unigrams, that the general good fit of Zipf’s law needs to be revisited in light of the parameters of the formula, and that the principle of least effort is a useful theoretical framework for the findings of Zipf’s law.

Từ khóa


Tài liệu tham khảo

Adamic, L. A., & Huberman, B. (2002). Zipf’s law and the internet. Glottometrics, 3(1), 143–150.

Anderson, A. H., Bader, M., Bard, E. G., Boyle, E., Doherty, G., Garrod, S., ... Weinert, R. (1991). The HCRC map task corpus. Language and Speech, 34(4), 351–366.

Auerbach, F. (1913). Das gesetz der bevölkerungskonzentration [The law of population concentration]. Petermanns Geographische Mitteilungen, 59, 74–76.

Austin, J. L. (1962). How to do things with words. Oxford University Press.

Baayen, R. H. (2001). Word frequency distributions. Kluwer Academic.

Baixeries, J., Elvevåg, B., & Ferrer-i-Cancho, R. (2013). The evolution of the exponent of Zipf’s law in language ontogeny. PLoS One, 8(3), 1–14.

Bard, E. G., Aylett, M. P., & Lickley, R. J. (2002). Towards a psycholinguistics of dialogue: Defining reaction time and error rate in a dialogue corpus. In: J. Bos, M. E. Foster, & C. Matheson (Eds.), Proceedings of the 6th Workshop on the Semantics and Pragmatics of Dialogue (EDILOG 2002) (pp. 29–36).

Baumann, A., Kaźmierski, K., & Matzinger, T. (2021). Scaling laws for phonotactic complexity in spoken English language data. Language and Speech, 64(3), 693–704.

Benešová, L., Křen, M., & Waclawičová, M. (2015). Korpus spontánní mluvené češtiny ORAL2013 [Corpus of informal spoken Czech ORAL2013]. Časopis pro moderní filologii (Journal for Modern Philology), 97(1), 42–50.

Bian, C., Lin, R., Zhang, X., Ma, Q. D., & Ivanov, P. C. (2016). Scaling laws and model of words organization in spoken and written language. EPL (Europhysics Letters), 113(1), Article 18002.

Blasius, B., & Tönjes, R. (2009). Zipf’s law in the popularity distribution of chess openings. Physical Review Letters, 103(21), Article 218701.

Boyle, E. A., Anderson, A. H., & Newlands, A. (1994). The effects of visibility on dialogue and performance in a cooperative problem solving task. Language and Speech, 37(1), 1–20.

Branigan, H., Lickley, R., & McKelvie, D. (1999). Non-linguistic influences on rates of disfluency in spontaneous speech. In: J. J. Ohala, Y. Hasegawa, M. Ohala, D. Granville, & A. C. Bailey, Proceedings of the 14th International Conference of Phonetic Sciences (pp. 387–390).

Brennan, S., Schuhmann, K., & Batres, K. (2013). Entrainment on the move and in the lab: The Walking Around corpus. In: M. Knauff, M. Pauen, N. Sebanz, & I. Wachsmuth (Eds.), Proceedings of the Annual Meeting of the Cognitive Science Society (Vol. 35, pp. 1934–1939).

Būdienė, G., & Gruodis, A. (2016). Zipf and related scaling laws. 3. Literature overview of multidisciplinary applications (from informational aspects to energetic aspects). Innovative Infotechnologies for Science, Business and Education, 2(21), 12–19.

Carletta, J., Ashby, S., Bourban, S., Flynn, M., Guillemot, M., Hain, T., ... Wellner, P. (2005). The AMI meeting corpus: A pre-announcement. In: S. Renals, & S. Bengio, Proceedings of the Second International Conference on Machine Learning for Multimodal Interaction (pp. 28–39). ACM.

Clark, H. H. (1996). Using language. Cambridge University Press.

Clark, H. H., & Brennan, S. E. (1991). Grounding in communication. In L. B. Resnick, J. M. Levine, & S. D. Teasley (Eds.), Perspectives on Socially Shared Cognition (pp. 127–149). American Psychological Association.

Clauset, A., Shalizi, C. R., & Newman, M. E. (2009). Power-law distributions in empirical data. SIAM Review, 4, 661–703.

Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.). Erlbaum.

Dehaene, S., & Mehler, J. (1992). Cross-linguistic regularities in the frequency of number words. Cognition, 43(1), 1–29.

Doherty-Sneddon, G., Anderson, A. H., O’Malley, C., Langton, S., Garrod, S., & Bruce, V. (1997). Face-to-face and video-mediated communication: A comparison of dialogue structure and task performance. Journal of Experimental Psychology: Applied, 3(2), 105–125.

Du Bois, J. W., Chafe, W. L., Meyer, C., Thompson, S. A., & Martey, N. (2000). Santa Barbara corpus of spoken American English. Linguistic Data Consortium.

Ekman, P., Friesen, W. V., & Hager, J. C. (2002). Facial action coding system: The manual on CD ROM. A Human Face.

Estoup, J.-B. (1912). Gammes sténographiques. Recueil de textes choisis pour l’acquisition méthodique de la vitesse, précédé d’une introduction par J.-B. Estoup [Shorthand scales: Collection of texts chosen for the methodical acquisition of speed, preceded by an introduction by J.-B. Estoup]. Institut Sténographique.

Ferrer-i-Cancho, R. (2005). The variation of Zipf’s law in human language. The European Physical Journal B: Condensed Matter and Complex Systems, 44(2), 249–257.

Ferrer-i-Cancho, R. (2006). When language breaks into pieces A conflict between communication through isolated signals and language. Biosystems, 84(3), 242–253.

Ferrer-i-Cancho, R. (2018). Optimization models of natural communication. Journal of Quantitative Linguistics, 25(3), 207–237.

Ferrer-i-Cancho, R., & Elvevåg, B. (2010). Random texts do not exhibit the real Zipf’s law-like rank distribution. PLoS One, 5(3), e9411.

Ferrer-i-Cancho, R., & Gavaldà, R. (2009). The frequency spectrum of finite samples from the intermittent silence process. Journal of the American Society for Information Science and Technology, 60(4), 837–843.

Ferrer-i-Cancho, R., Bentz, C., & Seguin, C. (2022). Optimal coding and the origins of Zipfian laws. Journal of Quantitative Linguistics, 29(2), 165–194.

Garvey, C. (1979). An approach to the study of children’s role play. The Quarterly Newsletter of the Laboratory of Comparative Human Cognition, 1(4), 69–73.

Geller, N. L. (1979). A test of significance for the whitworth distribution. Journal of the American Society for Information Science, 30(4), 229–231.

Genty, E., & Byrne, R. W. (2009). Why do gorillas make sequences of gestures? Animal Cognition, 13(2), 287–301.

Godfrey, J. J., Holliman, E. C., & McDaniel, J. (1992). Switchboard: Telephone speech corpus for research and development. In: Proceedings of the 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’92) (pp. 517–520). IEEE.

Goldstein, M. L., Morris, S. A., & Yen, G. G. (2004). Problems with fitting to the power-law distribution. The European Physical Journal B: Condensed Matter and Complex Systems, 41(2), 255–258.

Goodwin, C. (1981). Conversational organization: Interaction between speakers and hearers. Academic Press.

Ha, L. Q., Sicilia-Garcia, E. I., Ming, J., & Smith, F. J. (2002). Extension of Zipf’s law to words and phrases. In: COLING 2002: Proceedings of the 19th International Conference on Computational Linguistics.

Ha, L. Q., Hanna, P., Ming, J., & Smith, F. J. (2009). Extending Zipf’s law to n-grams for large corpora. Artificial Intelligence Review, 32(1), 101–113.

Haugh, M., & Chang, W. L. (2013). Collaborative creation of spoken language corpora. In T. Greer, D. Tatsuki, & C. Roever (Eds.), Pragmatics and Language Learning (Vol. 13, pp. 133–159). University of Hawaii at Mānoa, National Foreign Language Resource Center.

Heeman, P., & Allen, J. (1995). The Trains 93 dialogues. Computer Science Department, The University of Rochester.

Hernández-Fernández, A., & Diéguez-Vide, F. (2013). La ley de Zipf y la detección de la evolución verbal en la enfermedad de Alzheimer. Anuario de Psicología/The UB Journal of Psychology, 43, 67–82.

Hernández-Fernández, A., Torre, I. G., Garrido, J. M., & Lacasa, L. (2019). Linguistic laws in speech: The case of Catalan and Spanish. Entropy, 21(12), 173–188.

Janin, A., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Morgan, N., ... Wooters, C. (2003). The ICSI meeting corpus. In Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’03) (Vol. 1, pp. 364–367).

Jurafsky, D., Shriberg, E., & Biasca, D. (1997). Switchboard SWBD-DAMSL shallow-discourse-function annotation coders manual. Institute of Cognitive Science, University of Colorado, Boulder.

Ko, E. S., Han, N. R., Strassel, S., & Martey, N. (2003). Korean telephone conversations transcripts LDC2003T08. Web download. Linguistic Data Consortium.

Kuvač Kraljević, J., & Hržica, G. (2016). Croatian adult spoken language corpus (HrAL). FLUMINENSIA: časopis za filološka istraživanja, 28(2), 87–102.

Levinson, S. C., & Torreira, F. (2015). Timing in turn-taking and its implications for processing models of language. Frontiers in Psychology, 6, Article 731.

Li, W. (1992). Random texts exhibit Zipf’s-law-like word frequency distribution. IEEE Transactions on Information Theory, 38(6), 1842–1845.

Lickley, R. J. (2001). Dialogue moves and disfluency rates. In: ISCA Tutorial and Research Workshop on Disfluency in Spontaneous Speech (pp. 93–96).

Lin, R., Ma, Q. D., & Bian, C. (2015). Scaling laws in human speech, decreasing emergence of new words and a generalized model. arXiv preprint arXiv:1412.4846.

Linders, G. M., & Louwerse, M. M. (2020). Zipf’s law in human-machine dialog. In: S. Marsella & R. Jack, Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents (pp. 1–8).

Louwerse, M. M., & Bangerter, A. (2010). Effects of ambiguous gestures and language on the time course of reference resolution. Cognitive Science, 34(8), 1517–1529.

Louwerse, M. M., & Mitchell, H. H. (2003). Toward a taxonomy of a set of discourse markers in dialog: A theoretical and computational linguistic account. Discourse Processes, 35(3), 199–239.

Louwerse, M. M., Dale, R., Bard, E. G., & Jeuniaux, P. (2012). Behavior matching in multimodal communication is synchronized. Cognitive Science, 36(8), 1404–1426.

MacWhinney, B. (2007). The Talkbank project. In J. Beal, K. Corrigan, & H. Moisl (Eds.), Creating and Digitizing Language Corpora: Volume 1: Synchronic Databases (pp. 163–180). Palgrave Macmillan.

Mandelbrot, B. (1953). An informational theory of the statistical structure of language. In W. Jackson (Ed.), Communication theory (pp. 486–502). Butterworths Scientific Publications.

McNeill, D. (1992). Hand and mind: What gestures reveal about thought. University of Chicago Press.

Mehri, A., & Jamaati, M. (2017). Variation of Zipf’s exponent in one hundred live languages: A study of the Holy Bible translations. Physics Letters A, 381(31), 2470–2477.

Miller, G. A. (1957). Some effects of intermittent silence. The American Journal of Psychology, 70(2), 311–314.

Miller, D., Graff, D., Cieri, C., Jones, K., & Strassel, S. (2014). Callfriend Farsi second edition transcripts LDC2014T01. Web download. Linguistic Data Consortium.

Moreno-Sánchez, I., Font-Clos, F., & Corral, Á. (2016). Large-scale analysis of Zipf’s law in English texts. PLoS One, 11(1), Article e0147073.

Németh, G., & Zainkó, C. (2002). Multilingual statistical text analysis, Zipf’s law and Hungarian speech generation. Acta Linguistica Hungarica, 49(3), 385–405.

Neophytou, K., Van Egmond, M., & Avrutin, S. (2017). Zipf’s law in aphasia across languages: A comparison of English, Hungarian and Greek. Journal of Quantitative Linguistics, 24(2/3), 178–196.

Oostdijk, N. (2000). The spoken Dutch corpus: Overview and first evaluation. In: Proceedings of the Second International Conference on Language Resources & Evaluation (LREC’00) (pp. 887–894).

Petrov, S., Das, D., & McDonald, R. (2012). A universal part-of-speech tagset. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12) (pp. 2089–2096).

Piantadosi, S. T. (2014). Zipf’s word frequency law in natural language: A critical review and future directions. Psychonomic Bulletin & Review, 21(5), 1112–1130.

Piantadosi, S. T., Tily, H., & Gibson, E. (2011). Word lengths are optimized for efficient communication. Proceedings of the National Academy of Sciences, 108(9), 3526–3529.

Pickering, M. J., & Garrod, S. (2004). Toward a mechanistic psychology of dialogue. Behavioral and Brain Sciences, 27(2), 169–190.

Pickering, M. J., & Garrod, S. (2013). An integrated theory of language production and comprehension. Behavioral and Brain Sciences, 36(4), 329–347.

Qi, P., Zhang, Y., Zhang, Y., Bolton, J., & Manning, C. D. (2020). Stanza: A Python natural language processing toolkit for many human languages. In: A. Celikyilmaz & T.-H. Wen (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations (pp. 101–108).

Ridley, D. R. (1982). Zipf’s law in transcribed speech. Psychological Research, 44(1), 97–103.

Schegloff, E. A. (1996). Issues of relevance for discourse analysis: Contingency in action, interaction and co-participant context. In E. H. Hovy & D. R. Scott (Eds.), Computational and Conversational Discourse: Burning Issues—An Interdisciplinary Account (pp. 3–35). Springer.

Searle, J. R. (1976). A classification of illocutionary acts. Language in Society, 5(1), 1–23.

Segbers, J., & Schroeder, S. (2017). How many words do children know? A corpus-based estimation of children’s total vocabulary size. Language Testing, 34(3), 297–320.

Semple, S., Ferrer-i-Cancho, R., & Gustison, M. L. (2022). Linguistic laws in biology. Trends in Ecology & Evolution, 37(1), 53–66.

Serrà, J., Corral, Á., Boguñá, M., Haro, M., & Arcos, J. L. (2012). Measuring the evolution of contemporary Western popular music. Scientific Reports, 2(1), 1–6.

Shriberg, E., Dhillon, R., Bhagat, S., Ang, J., & Carvey, H. (2004). The ICSI meeting recorder dialog act (MRDA) corpus. In: C. Sidner, & M. Strube, Proceedings of the Fifth SIGdial Workshop on Discourse and Dialogue (pp. 97–100).

Ten Bosch, L., Oostdijk, N., & De Ruiter, J. P. (2004). Durational aspects of turn-taking in spontaneous face-to-face and telephone dialogues. In: P. Sojka, I. Kopecek, & K. Pala (Eds.), Proceedings of the 7th International Conference on Text, Speech and Dialogue (pp. 563–570). Springer.

Torre, I. G., Luque, B., Lacasa, L., Kello, C. T., & Hernández-Fernández, A. (2019). On the physical origin of linguistic laws and lognormality in speech. Royal Society Open. Science, 6(8), Article 191023.

Tuzzi, A., Popescu, I. I., & Altmann, G. (2010). Quantitative analysis of Italian texts. RAM-Verlag.

Williams, J. R., Lessard, P. R., Desu, S., Clark, E. M., Bagrow, J. P., Danforth, C. M., & Dodds, P. S. (2015). Zipf’s law holds for phrases, not words. Scientific Reports, 5(1), 1–7.

Yung, F., Duh, K., Komura, T., & Matsumoto, Y. (2017). A psycholinguistic model for the marking of discourse relations. Dialogue & Discourse, 8(1), 106–131.

Zipf, G. K. (1932). Selected studies of the principle of relative frequency in language. Harvard University Press.

Zipf, G. K. (1935). The psycho-biology of language: An introduction to dynamic philology. Houghton, Mifflin.

Zipf, G. K. (1949). Human behavior and the principle of least effort. Addison-Wesley.