How Should We Model the Probability of a Language?
Rasul Dent, Pedro Ortiz Suarez, Thibault Clérice, Benoît Sagot
Abstract
Of the over 7,000 languages spoken in the world, commercial language identification (LID) systems only reliably identify a few hundred in written form. Research-grade systems extend this coverage under certain circumstances, but for most languages coverage remains patchy or nonexistent. This position paper argues that this situation is largely self-imposed. In particular, it arises from a persistent framing of LID as decontextualized text classification, which obscures the central role of prior probability estimation and is reinforced by institutional incentives that favor global, fixed-prior models. We argue that improving coverage for tail languages requires rethinking LID as a routing problem and developing principled ways to incorporate environmental cues that make languages locally plausible.- Anthology ID:
- 2026.vardial-1.18
- Volume:
- Proceedings of the 13th Workshop on NLP for Similar Languages, Varieties and Dialects
- Month:
- March
- Year:
- 2026
- Address:
- Rabat, Morocco
- Venues:
- VarDial | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 223–233
- Language:
- URL:
- https://preview.aclanthology.org/manual-author-scripts/2026.vardial-1.18/
- DOI:
- Cite (ACL):
- Rasul Dent, Pedro Ortiz Suarez, Thibault Clérice, and Benoît Sagot. 2026. How Should We Model the Probability of a Language?. In Proceedings of the 13th Workshop on NLP for Similar Languages, Varieties and Dialects, pages 223–233, Rabat, Morocco. Association for Computational Linguistics.
- Cite (Informal):
- How Should We Model the Probability of a Language? (Dent et al., VarDial 2026)
- PDF:
- https://preview.aclanthology.org/manual-author-scripts/2026.vardial-1.18.pdf