Data-Augmentation-Based Dialectal Adaptation for LLMs

Fahim Faisal, Antonios Anastasopoulos


Abstract
This report presents gmnlp’s participation to the Dialect-Copa shared task at VarDial 2024 (Chifu et al., 2024), which focuses on evaluating the commonsense reasoning capabilities of large language models (LLMs) on South Slavic micro-dialects. The task aims to assess how well LLMs can handle non-standard dialectal varieties, as their performance on standard languages is already well-established. We propose an approach that combines the strengths of different types of language models and leverages data augmentation techniques to improve task performance on three South Slavic dialects: Chakavian, Cherkano, and Torlak. We conduct experiments using a language-family-focused encoder-based model (BERTić) and a domain-agnostic multilingual model (AYA-101). Our results demonstrate that the proposed data augmentation techniques lead to substantial performance gains across all three test datasets in the open-source model category. This work highlights the practical utility of data augmentation and the potential of LLMs in handling non-standard dialectal varieties, contributing to the broader goal of advancing natural language understanding in low-resource and dialectal settings.
Anthology ID:
2024.vardial-1.17
Volume:
Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024)
Month:
June
Year:
2024
Address:
Mexico City, Mexico
Editors:
Yves Scherrer, Tommi Jauhiainen, Nikola Ljubešić, Marcos Zampieri, Preslav Nakov, Jörg Tiedemann
Venues:
VarDial | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
197–208
Language:
URL:
https://aclanthology.org/2024.vardial-1.17
DOI:
10.18653/v1/2024.vardial-1.17
Bibkey:
Cite (ACL):
Fahim Faisal and Antonios Anastasopoulos. 2024. Data-Augmentation-Based Dialectal Adaptation for LLMs. In Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024), pages 197–208, Mexico City, Mexico. Association for Computational Linguistics.
Cite (Informal):
Data-Augmentation-Based Dialectal Adaptation for LLMs (Faisal & Anastasopoulos, VarDial-WS 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/dois-2013-emnlp/2024.vardial-1.17.pdf
Supplementary material:
 2024.vardial-1.17.SupplementaryMaterial.txt