GRDD+: An Extended Greek Dialectal Dataset with Cross-Architecture Fine-tuning Evaluation
Stergios Chatzikyriakidis, Dimitriοs Papadakis, Sevasti Ioanna Papaioannou, Erofili Psaltaki
Abstract
We present an extended Greek Dialectal Dataset (GRDD+) that complements the existing GRDD dataset with more data from Cretan, Cypriot, Pontic and Northern Greek, while we add six new varieties: Greco-Corsican, Griko (Southern Italian Greek), Maniot, Heptanesian, Tsakonian, and Katharevusa Greek. The result is a dataset with total size 6,374,939 words and 10 varieties. This is the first dataset with such variation and size to date. We conduct a number of fine-tuning experiments to see the effect of good quality dialectal data on a number of LLMs. We fine-tune three model architectures (Llama-3-8B, Llama-3.1-8B, Krikri-8B) and compare the results to frontier models (Claude-3.7-Sonnet, Gemini-2.5, ChatGPT-5).- Anthology ID:
- 2026.lrec-main.245
- Volume:
- Proceedings of the Fifteenth Language Resources and Evaluation Conference
- Month:
- May
- Year:
- 2026
- Address:
- Palma de Mallorca, Spain
- Editors:
- Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
- Venue:
- LREC
- SIG:
- Publisher:
- ELRA Language Resource Association
- Note:
- Pages:
- 3138–3146
- Language:
- URL:
- https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.245/
- DOI:
- Cite (ACL):
- Stergios Chatzikyriakidis, Dimitriοs Papadakis, Sevasti Ioanna Papaioannou, and Erofili Psaltaki. 2026. GRDD+: An Extended Greek Dialectal Dataset with Cross-Architecture Fine-tuning Evaluation. International Conference on Language Resources and Evaluation, main:3138–3146.
- Cite (Informal):
- GRDD+: An Extended Greek Dialectal Dataset with Cross-Architecture Fine-tuning Evaluation (Chatzikyriakidis et al., LREC 2026)
- PDF:
- https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.245.pdf