CEFR-Cymraeg: A Dataset and Baseline Models for Language Proficiency Assessment in Welsh

Eeshan Waqar, Jonathan Davies, Dawn Knight, Fernando Alva-Manchego


Abstract
We introduce CEFR-Cymraeg, the first dataset annotated with Common European Framework of Reference (CEFR) levels for Welsh. The dataset is built from learning materials for adult learners, carefully extracted from widely used coursebooks and verified by teachers of Welsh as a second language. It spans levels A1 to B2 and includes multiple units of analysis: sentences, dialogues, paragraphs, and documents. In total, 2,658 entries are provided with gold-standard CEFR annotations, making CEFR-Cymraeg a valuable resource for research on language learning and low-resourced Celtic languages. To illustrate its potential applications, we define language proficiency assessment as a multi-class classification task and fine-tune multilingual pre-trained language models. Given the limited size of the dataset, we also experiment with data augmentation. Results show that these models successfully capture proficiency distinctions and generalise well to Welsh, with the best-performing model reaching a weighted F1-score of 0.83. Qualitative analysis confirmed that most apparent errors reflected valid pedagogical variation rather than model inconsistencies. CEFR-Cymraeg establishes a benchmark resource for Welsh and opens new opportunities for educational NLP, corpus linguistics, and multilingual proficiency research.
Anthology ID:
2026.lrec-main.279
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
3496–3505
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.279/
DOI:
Bibkey:
Cite (ACL):
Eeshan Waqar, Jonathan Davies, Dawn Knight, and Fernando Alva-Manchego. 2026. CEFR-Cymraeg: A Dataset and Baseline Models for Language Proficiency Assessment in Welsh. International Conference on Language Resources and Evaluation, main:3496–3505.
Cite (Informal):
CEFR-Cymraeg: A Dataset and Baseline Models for Language Proficiency Assessment in Welsh (Waqar et al., LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.279.pdf