We Need to Measure Data Diversity in NLP — Better and Broader

Dong Nguyen, Esther Ploeger


Abstract
Although diversity in NLP datasets has received growing attention, the question of how to measure it remains largely underexplored. This opinion paper examines the conceptual and methodological challenges of measuring data diversity and argues that interdisciplinary perspectives are essential for developing more fine-grained and valid measures.
Anthology ID:
2025.emnlp-main.445
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8823–8832
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.445/
DOI:
Bibkey:
Cite (ACL):
Dong Nguyen and Esther Ploeger. 2025. We Need to Measure Data Diversity in NLP — Better and Broader. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 8823–8832, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
We Need to Measure Data Diversity in NLP — Better and Broader (Nguyen & Ploeger, EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.445.pdf
Checklist:
 2025.emnlp-main.445.checklist.pdf