No Language Data Left Behind: A Cross-Cultural Study of CJK Language Datasets in the Hugging Face Ecosystem

Dasol Choi, Woomyoung Park, Youngsook Song


Abstract
Recent advances in Natural Language Processing (NLP) have underscored the crucial role of high-quality datasets in building large language models (LLMs). However, while extensive resources and analyses exist for English, the landscape for East Asian languages, particularly Chinese, Japanese, and Korean (CJK), remains fragmented and underexplored, despite these languages serving over 1.6 billion speakers. To address this gap, we investigate the HuggingFace ecosystem from a cross-linguistic perspective, focusing on how cultural norms, research environments, and institutional practices shape dataset availability and quality. Drawing on more than 3,300 datasets, we employ quantitative and qualitative methods to examine how these factors drive distinct creation and curation patterns across Chinese, Japanese, and Korean NLP communities. Our findings highlight the large-scale and often institution-driven nature of Chinese datasets, grassroots community-led development in Korean NLP, and an entertainment and subculture-focused emphasis on Japanese collections. By uncovering these patterns, we reveal practical strategies for enhancing dataset documentation, licensing clarity, and cross-lingual resource sharing, guiding more effective and culturally attuned LLM development in East Asia. We conclude by discussing best practices for future dataset curation and collaboration, aiming to strengthen resource development across all three languages.
Anthology ID:
2025.mrl-main.1
Volume:
Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)
Month:
November
Year:
2025
Address:
Suzhuo, China
Editors:
David Ifeoluwa Adelani, Catherine Arnett, Duygu Ataman, Tyler A. Chang, Hila Gonen, Rahul Raja, Fabian Schmidt, David Stap, Jiayi Wang
Venues:
MRL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1–10
Language:
URL:
https://preview.aclanthology.org/name-variant-enfa-fane/2025.mrl-main.1/
DOI:
10.18653/v1/2025.mrl-main.1
Bibkey:
Cite (ACL):
Dasol Choi, Woomyoung Park, and Youngsook Song. 2025. No Language Data Left Behind: A Cross-Cultural Study of CJK Language Datasets in the Hugging Face Ecosystem. In Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025), pages 1–10, Suzhuo, China. Association for Computational Linguistics.
Cite (Informal):
No Language Data Left Behind: A Cross-Cultural Study of CJK Language Datasets in the Hugging Face Ecosystem (Choi et al., MRL 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/name-variant-enfa-fane/2025.mrl-main.1.pdf