The Mirage of Diversity: Unmasking the Cultural Vocabulary Ceiling in LLMs

Soumedhik Bharati; Subhrajit Mukherjee; Shibam Mandal

The Mirage of Diversity: Unmasking the Cultural Vocabulary Ceiling in LLMs

Soumedhik Bharati, Subhrajit Mukherjee, Shibam Mandal

Abstract

Large Language Models are widely used to generate and adapt cultural texts, yet the depth of their cultural representation remains poorly quantified. Intuitively, as a narrative text expands in length, the diversity of cultural words should scale proportionately. To formally test this, we evaluate the FairyTaleQA dataset, adapted by three models and introduce our primary contribution: the Contextual Stereotype Amplification Index (CSAI), an evaluation framework combining LLM-as-a-judge extraction, embedding-based cliché anchoring, and Natural Language Inference (NLI) congruence validation. By mapping the frequency of extracted Culture Specific Items (CSIs) against narrative length using Heaps’ Law (V = k ⋅ T𝛽), we present empirical evidence of a systematic limitation in current systems: they struggle to scale cultural diversity even under explicit cultural prompting. Models rapidly hit a "Cultural Vocabulary Ceiling," constrained to a fixed set of hyper-stereotypical terms. Furthermore, we demonstrate that merely optimizing for higher CSI frequency as done in prior works rewards logically broken tokenism. Our CSAI formulation actively penalizes such gratuitous stereotyping, offering a more principled approach to measuring and evaluating cultural homogenization in generative AI systems.

Anthology ID:: 2026.c3nlp-1.7
Volume:: Proceedings of the 4th Workshop on Cross-Cultural Considerations in NLP (C3NLP 2026)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Vinodkumar Prabhakaran, Sunipa Dev, Luciana Benotti, Daniel Hershcovich, Yong Cao, Li Zhou, BOlei Ma, Ife Adebara
Venues:: C3NLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 101–107
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.c3nlp-1.7/
DOI:
Bibkey:
Cite (ACL):: Soumedhik Bharati, Subhrajit Mukherjee, and Shibam Mandal. 2026. The Mirage of Diversity: Unmasking the Cultural Vocabulary Ceiling in LLMs. In Proceedings of the 4th Workshop on Cross-Cultural Considerations in NLP (C3NLP 2026), pages 101–107, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: The Mirage of Diversity: Unmasking the Cultural Vocabulary Ceiling in LLMs (Bharati et al., C3NLP 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.c3nlp-1.7.pdf

PDF Cite Search Fix data