Do LLMs model human linguistic variation? A case study in Hindi-English Verb code-mixing

Mukund Choudhary, Madhur Jindal, Gaurja Aeron, Monojit Choudhury


Abstract
Do large language models (LLMs) model linguistic variation? We investigate this question through Hindi-English (Hinglish) verb code-mixing, where speakers can use either a Hindi verb or an English verb with the light verb karna (’do’). Both forms are grammatical, but speakers show unexplained variation in language choice for the verb. We compare human preferences on controlled code-mixed minimal pairs to LLM perplexities spanning families, sizes, and training language compositions. We find that current LLMs do not reliably classify verb language preferences to match native speaker judgments. We also see that with specific supervision, some models do predict human preference to an extent. We release native speaker acceptability judgments on 30 verb pairs, perplexity ratios for 4,279 verb pairs across 7 models, and experimental materials.
Anthology ID:
2026.findings-eacl.291
Volume:
Findings of the Association for Computational Linguistics: EACL 2026
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5491–5509
Language:
URL:
https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.291/
DOI:
Bibkey:
Cite (ACL):
Mukund Choudhary, Madhur Jindal, Gaurja Aeron, and Monojit Choudhury. 2026. Do LLMs model human linguistic variation? A case study in Hindi-English Verb code-mixing. In Findings of the Association for Computational Linguistics: EACL 2026, pages 5491–5509, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
Do LLMs model human linguistic variation? A case study in Hindi-English Verb code-mixing (Choudhary et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-eacl/2026.findings-eacl.291.pdf
Checklist:
 2026.findings-eacl.291.checklist.pdf