How Should We Model the Probability of a Language?

Rasul Dent; Pedro Ortiz Suarez; Thibault Clérice; Benoît Sagot

How Should We Model the Probability of a Language?

Rasul Dent, Pedro Ortiz Suarez, Thibault Clérice, Benoît Sagot

Abstract

Of the over 7,000 languages spoken in the world, commercial language identification (LID) systems only reliably identify a few hundred in written form. Research-grade systems extend this coverage under certain circumstances, but for most languages coverage remains patchy or nonexistent. This position paper argues that this situation is largely self-imposed. In particular, it arises from a persistent framing of LID as decontextualized text classification, which obscures the central role of prior probability estimation and is reinforced by institutional incentives that favor global, fixed-prior models. We argue that improving coverage for tail languages requires rethinking LID as a routing problem and developing principled ways to incorporate environmental cues that make languages locally plausible.

Anthology ID:: 2026.vardial-1.18
Volume:: Proceedings of the 13th Workshop on NLP for Similar Languages, Varieties and Dialects
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Venues:: VarDial | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 223–233
Language:
URL:: https://preview.aclanthology.org/manual-author-scripts/2026.vardial-1.18/
DOI:
Bibkey:
Cite (ACL):: Rasul Dent, Pedro Ortiz Suarez, Thibault Clérice, and Benoît Sagot. 2026. How Should We Model the Probability of a Language?. In Proceedings of the 13th Workshop on NLP for Similar Languages, Varieties and Dialects, pages 223–233, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: How Should We Model the Probability of a Language? (Dent et al., VarDial 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/manual-author-scripts/2026.vardial-1.18.pdf

PDF Cite Search Fix data