CEFR-Cymraeg: A Dataset and Baseline Models for Language Proficiency Assessment in Welsh
Eeshan Waqar, Jonathan Davies, Dawn Knight, Fernando Alva-Manchego
Abstract
We introduce CEFR-Cymraeg, the first dataset annotated with Common European Framework of Reference (CEFR) levels for Welsh. The dataset is built from learning materials for adult learners, carefully extracted from widely used coursebooks and verified by teachers of Welsh as a second language. It spans levels A1 to B2 and includes multiple units of analysis: sentences, dialogues, paragraphs, and documents. In total, 2,658 entries are provided with gold-standard CEFR annotations, making CEFR-Cymraeg a valuable resource for research on language learning and low-resourced Celtic languages. To illustrate its potential applications, we define language proficiency assessment as a multi-class classification task and fine-tune multilingual pre-trained language models. Given the limited size of the dataset, we also experiment with data augmentation. Results show that these models successfully capture proficiency distinctions and generalise well to Welsh, with the best-performing model reaching a weighted F1-score of 0.83. Qualitative analysis confirmed that most apparent errors reflected valid pedagogical variation rather than model inconsistencies. CEFR-Cymraeg establishes a benchmark resource for Welsh and opens new opportunities for educational NLP, corpus linguistics, and multilingual proficiency research.- Anthology ID:
- 2026.lrec-main.279
- Volume:
- Proceedings of the Fifteenth Language Resources and Evaluation Conference
- Month:
- May
- Year:
- 2026
- Address:
- Palma de Mallorca, Spain
- Editors:
- Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
- Venue:
- LREC
- SIG:
- Publisher:
- ELRA Language Resource Association
- Note:
- Pages:
- 3496–3505
- Language:
- URL:
- https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.279/
- DOI:
- Cite (ACL):
- Eeshan Waqar, Jonathan Davies, Dawn Knight, and Fernando Alva-Manchego. 2026. CEFR-Cymraeg: A Dataset and Baseline Models for Language Proficiency Assessment in Welsh. International Conference on Language Resources and Evaluation, main:3496–3505.
- Cite (Informal):
- CEFR-Cymraeg: A Dataset and Baseline Models for Language Proficiency Assessment in Welsh (Waqar et al., LREC 2026)
- PDF:
- https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.279.pdf