Linear Semantic Segmentation for Low-Resource Spoken Dialects
Kirill Chirkunov, Younes Samih, Abed Alhakim Freihat, Hanan Aldarmaki
Abstract
Semantic segmentation is a core component of discourse analysis, yet existing models are primarily developed and evaluated on high-resource written text, limiting their effectiveness on low-resource conversational varieties. In particular, dialectal Arabic exhibits informal syntax, code-switching, and weakly marked discourse structure that challenge standard semantic segmentation approaches for text. In this paper, we introduce a new multi-genre benchmark (more than 1000 samples) for semantic segmentation in Arabic, focusing on dialectal discourse. The benchmark covers casual telephone conversations, code-switched podcasts, expressive dialogue, and broadcast news, and was annotated and validated by native Arabic annotators. Using this benchmark, we show that segmentation models performing well on MSA news genres degrade on dialectal conversational texts. We further propose a segmentation model that targets local semantic coherence and robustness to discourse discontinuities, consistently outperforming strong baselines on dialectal non-news genres. The benchmark and approach generalize to other low-resource spoken languages.- Anthology ID:
- 2026.findings-acl.1740
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 34844–34861
- Language:
- URL:
- https://preview.aclanthology.org/ingestion-form-platform/2026.findings-acl.1740/
- DOI:
- Cite (ACL):
- Kirill Chirkunov, Younes Samih, Abed Alhakim Freihat, and Hanan Aldarmaki. 2026. Linear Semantic Segmentation for Low-Resource Spoken Dialects. In Findings of the Association for Computational Linguistics: ACL 2026, pages 34844–34861, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Linear Semantic Segmentation for Low-Resource Spoken Dialects (Chirkunov et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingestion-form-platform/2026.findings-acl.1740.pdf