Abstract
We introduce RoDia, the first dataset for Romanian dialect identification from speech. The RoDia dataset includes a varied compilation of speech samples from five distinct regions of Romania, covering both urban and rural environments, totaling 2 hours of manually annotated speech data. Along with our dataset, we introduce a set of competitive models to be used as baselines for future research. The top scoring model achieves a macro F1 score of 59.83% and a micro F1 score of 62.08%, indicating that the task is challenging. We thus believe that RoDia is a valuable resource that will stimulate research aiming to address the challenges of Romanian dialect identification. We release our dataset at https://github.com/codrut2/RoDia.- Anthology ID:
- 2024.findings-naacl.20
- Volume:
- Findings of the Association for Computational Linguistics: NAACL 2024
- Month:
- June
- Year:
- 2024
- Address:
- Mexico City, Mexico
- Editors:
- Kevin Duh, Helena Gomez, Steven Bethard
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 279–286
- Language:
- URL:
- https://aclanthology.org/2024.findings-naacl.20
- DOI:
- Cite (ACL):
- Rotaru Codruț, Nicolae Ristea, and Radu Ionescu. 2024. RoDia: A New Dataset for Romanian Dialect Identification from Speech. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 279–286, Mexico City, Mexico. Association for Computational Linguistics.
- Cite (Informal):
- RoDia: A New Dataset for Romanian Dialect Identification from Speech (Codruț et al., Findings 2024)
- PDF:
- https://preview.aclanthology.org/ingestion-checklist/2024.findings-naacl.20.pdf