Quulgan Minggad

2026

MonCulture-Eval: A Hierarchical Benchmark for Evaluating Mongolian Cultural Capabilities of Large Language Models across Scripts and Regions
Quulgan Minggad | Xiao Zinan | Yuan Sun
Findings of the Association for Computational Linguistics: ACL 2026

While Large Language Models (LLMs) have achieved impressive linguistic fluency in low-resource languages, their capacity to process deep cultural nuances remains insufficiently quantified. This paper introduces MonCulture-Eval, a benchmark designed to assess the cultural intelligence of LLMs in the Mongolian context across two writing systems (Traditional and Cyrillic) and three regional sub-cultures (Alxa, Ordos, and Horqin). Curated entirely from primary, non-digitized archives to prevent data contamination, the benchmark employs a three-layer cognitive hierarchy—Factual, Situational, and Values—supplemented by specialized tasks including Riddles, Taboos, and Proverbs. Evaluation of frontier models reveals a severe "Script Gap" and a systematic "Etic Bias," where models sanitize spiritual rituals into secular functional norms.

pdf bib abs

Diversity in Unity, Theory in Practice: Hierarchical Multitask Benchmarks for Chinese Minority Languages
Yijie Li | Xi Cao | Yuan Sun | Quulgan Minggad | Abdulla Ablikim | Jia Qing Cai Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Despite the rapid advancement of LLMs, their performance on linguistically and culturally diverse minority languages within a unified national context remains underexplored. We present CMiLBench, a collection of hierarchical multitask benchmarks designed to translate theoretical notions of “diversity in unity” into practical evaluation for three representative Chinese minority languages: Tibetan, Mongolian, and Uyghur. CMiLBench comprises 24,663 instances across 5 difficulty levels and 17 tasks spanning foundational ability, cultural specificity, and safety alignment. We adopt existing dataset adaptation, minority knowledge construction, and high-resource benchmark translation to construct CMiLBench. We assess 14 state-of-the-art commercial and open-source LLMs with a hybrid framework that integrates automatic metrics and LLM-as-a-Judge scoring. The comparative experimental results reveal the gap between theoretical capability and practical utility. CMiLBench serves as a foundational and scalable evaluation resource to bridge the digital language divide and promote the informatization and intelligentization of low-resource Chinese minority languages.

Co-authors

Xiao Zinan 1

Venues

ACL1
Findings1

Fix author