EnDive: A Cross-Dialect Benchmark for Fairness and Performance in Large Language Models

Abhay Gupta, Jacob Cheung, Philip Meng, Shayan Sayyed, Kevin Zhu, Austen Liao, Sean O’Brien


Abstract
The diversity of human language, shaped by social, cultural, and regional influences, presents significant challenges for natural language processing (NLP) systems. Existing benchmarks often overlook intra-language variations, leaving speakers of non-standard dialects underserved. To address this gap, we introduce EnDive (English Diversity), a benchmark that evaluates seven state-of-the-art (SOTA) large language models (LLMs) across tasks in language understanding, algorithmic reasoning, mathematics, and logic. Our framework translates Standard American English datasets into five underrepresented dialects using few-shot prompting with verified examples from native speakers, and compares these translations against rule-based methods via fluency assessments, preference tests, and semantic similarity metrics. Human evaluations confirm high translation quality, with average scores of at least 6.02/7 for faithfulness, fluency, and formality. By filtering out near-identical translations, we create a challenging dataset that reveals significant performance disparities—models consistently underperform on dialectal inputs compared to Standard American English (SAE). EnDive thus advances dialect-aware NLP by uncovering model biases and promoting more equitable language technologies.
Anthology ID:
2025.findings-emnlp.913
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
16830–16855
Language:
URL:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.913/
DOI:
10.18653/v1/2025.findings-emnlp.913
Bibkey:
Cite (ACL):
Abhay Gupta, Jacob Cheung, Philip Meng, Shayan Sayyed, Kevin Zhu, Austen Liao, and Sean O’Brien. 2025. EnDive: A Cross-Dialect Benchmark for Fairness and Performance in Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 16830–16855, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
EnDive: A Cross-Dialect Benchmark for Fairness and Performance in Large Language Models (Gupta et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.913.pdf
Checklist:
 2025.findings-emnlp.913.checklist.pdf