Dialetto, ma Quanto Dialetto? Transcribing and Evaluating Dialects on a Continuum

Ryan Soh-Eun Shim, Barbara Plank


Abstract
There is increasing interest in looking at dialects in NLP. However, most work to date still treats dialects as discrete categories. For instance, evaluative work in variation-oriented NLP for English often works with Indian English or African-American Venacular English as homogeneous categories, yet even within one variety there is substantial variation. We examine within-dialect variation and show that performance critically varies within categories. We measure speech-to-text performance on Italian dialects, and empirically observe a geographical performance disparity. This disparity correlates substantially (-0.5) with linguistic similarity to the highest performing dialect variety. We cross-examine our results against dialectometry methods, and interpret the performance disparity to be due to a bias towards dialects that are more similar to the standard variety in the speech-to-text model examined. We additionally leverage geostatistical methods to predict zero-shot performance at unseen sites, and find the incorporation of geographical information to substantially improve prediction performance, indicating there to be geographical structure in the performance distribution.
Anthology ID:
2025.findings-naacl.48
Volume:
Findings of the Association for Computational Linguistics: NAACL 2025
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
838–849
Language:
URL:
https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/2025.findings-naacl.48/
DOI:
10.18653/v1/2025.findings-naacl.48
Bibkey:
Cite (ACL):
Ryan Soh-Eun Shim and Barbara Plank. 2025. Dialetto, ma Quanto Dialetto? Transcribing and Evaluating Dialects on a Continuum. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 838–849, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
Dialetto, ma Quanto Dialetto? Transcribing and Evaluating Dialects on a Continuum (Shim & Plank, Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/2025.findings-naacl.48.pdf