LLM-Adapted Colombian Spanish Lexicography: Proficiency Control, Hallucination, and Cultural Distortion

Johnatan E. Bonilla


Abstract
We evaluate whether open-source LLMs can produce proficiency-graded English adaptations of entries from the Diccionario de colombianismos (DiCol), a Colombian Spanish lexicographic resource used in language teaching. Three 7–8B instruction-tuned models—Llama 3.1, Qwen2.5, and Mistral—generate Beginner, Intermediate, and Advanced translations for all 8,252 definitions using structured zero-shot prompts identical across levels except for the target CEFR band. Automated metrics show that Intermediate targeting collapses (73–83% classified as Advanced by vocabulary, 𝜒2 > 705, p < .001) and that Advanced outputs expand 4.9–8.2× relative to the source. Expert annotation of a 360-entry stratified sample (𝜅 = 0.61–0.68) identifies hallucination in 19% of entries (Fleiss’ 𝜅 = 0.77 for cultural preservation categories, 97% unanimity among flagged cases). Hallucination concentrates in the Advanced condition (81%, 𝜒2 = 86.6, p < .001) and is associated with higher expansion (U = 16,662, p < .001, r = 0.68), manifesting primarily as generic elaboration and, in a smaller proportion, as Colombia-stereotyping and pragmatic polarity inversion. We discuss these findings through the lens of (CITATION)’s domestication framework and describe the observed pattern as algorithmic domestication.
Anthology ID:
2026.c3nlp-1.5
Volume:
Proceedings of the 4th Workshop on Cross-Cultural Considerations in NLP (C3NLP 2026)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Vinodkumar Prabhakaran, Sunipa Dev, Luciana Benotti, Daniel Hershcovich, Yong Cao, Li Zhou, BOlei Ma, Ife Adebara
Venues:
C3NLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
67–75
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.c3nlp-1.5/
DOI:
Bibkey:
Cite (ACL):
Johnatan E. Bonilla. 2026. LLM-Adapted Colombian Spanish Lexicography: Proficiency Control, Hallucination, and Cultural Distortion. In Proceedings of the 4th Workshop on Cross-Cultural Considerations in NLP (C3NLP 2026), pages 67–75, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
LLM-Adapted Colombian Spanish Lexicography: Proficiency Control, Hallucination, and Cultural Distortion (Bonilla, C3NLP 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.c3nlp-1.5.pdf