Abstract
Human languages are often claimed to fundamentally differ from other communication systems. But what is it exactly that unites them as a separate category? This article proposes to approach this problem – here termed the Zipfian Challenge – as a standard classification task. A corpus with textual material from diverse writing systems and languages, as well as other symbolic and non-symbolic systems, is provided. These are subsequently used to train and test binary classification algorithms, assigning labels “writing” and “non-writing” to character strings of the test sets. The performance is generally high, reaching 98% accuracy for the best algorithms. Human languages emerge to have a statistical fingerprint: large unit inventories, high entropy, and few repetitions of adjacent units. This fingerprint can be used to tease them apart from other symbolic and non-symbolic systems.- Anthology ID:
- 2023.conll-1.3
- Volume:
- Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL)
- Month:
- December
- Year:
- 2023
- Address:
- Singapore
- Editors:
- Jing Jiang, David Reitter, Shumin Deng
- Venue:
- CoNLL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 27–37
- Language:
- URL:
- https://aclanthology.org/2023.conll-1.3
- DOI:
- 10.18653/v1/2023.conll-1.3
- Cite (ACL):
- Christian Bentz. 2023. The Zipfian Challenge: Learning the statistical fingerprint of natural languages. In Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL), pages 27–37, Singapore. Association for Computational Linguistics.
- Cite (Informal):
- The Zipfian Challenge: Learning the statistical fingerprint of natural languages (Bentz, CoNLL 2023)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-1/2023.conll-1.3.pdf