GRDD+: An Extended Greek Dialectal Dataset with Cross-Architecture Fine-tuning Evaluation

Stergios Chatzikyriakidis, Dimitriοs Papadakis, Sevasti Ioanna Papaioannou, Erofili Psaltaki


Abstract
We present an extended Greek Dialectal Dataset (GRDD+) that complements the existing GRDD dataset with more data from Cretan, Cypriot, Pontic and Northern Greek, while we add six new varieties: Greco-Corsican, Griko (Southern Italian Greek), Maniot, Heptanesian, Tsakonian, and Katharevusa Greek. The result is a dataset with total size 6,374,939 words and 10 varieties. This is the first dataset with such variation and size to date. We conduct a number of fine-tuning experiments to see the effect of good quality dialectal data on a number of LLMs. We fine-tune three model architectures (Llama-3-8B, Llama-3.1-8B, Krikri-8B) and compare the results to frontier models (Claude-3.7-Sonnet, Gemini-2.5, ChatGPT-5).
Anthology ID:
2026.lrec-main.245
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
3138–3146
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.245/
DOI:
Bibkey:
Cite (ACL):
Stergios Chatzikyriakidis, Dimitriοs Papadakis, Sevasti Ioanna Papaioannou, and Erofili Psaltaki. 2026. GRDD+: An Extended Greek Dialectal Dataset with Cross-Architecture Fine-tuning Evaluation. International Conference on Language Resources and Evaluation, main:3138–3146.
Cite (Informal):
GRDD+: An Extended Greek Dialectal Dataset with Cross-Architecture Fine-tuning Evaluation (Chatzikyriakidis et al., LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.245.pdf