Jonathan Davies
2026
CEFR-Cymraeg: A Dataset and Baseline Models for Language Proficiency Assessment in Welsh
Eeshan Waqar | Jonathan Davies | Dawn Knight | Fernando Alva-Manchego
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Eeshan Waqar | Jonathan Davies | Dawn Knight | Fernando Alva-Manchego
Proceedings of the Fifteenth Language Resources and Evaluation Conference
We introduce CEFR-Cymraeg, the first dataset annotated with Common European Framework of Reference (CEFR) levels for Welsh. The dataset is built from learning materials for adult learners, carefully extracted from widely used coursebooks and verified by teachers of Welsh as a second language. It spans levels A1 to B2 and includes multiple units of analysis: sentences, dialogues, paragraphs, and documents. In total, 2,658 entries are provided with gold-standard CEFR annotations, making CEFR-Cymraeg a valuable resource for research on language learning and low-resourced Celtic languages. To illustrate its potential applications, we define language proficiency assessment as a multi-class classification task and fine-tune multilingual pre-trained language models. Given the limited size of the dataset, we also experiment with data augmentation. Results show that these models successfully capture proficiency distinctions and generalise well to Welsh, with the best-performing model reaching a weighted F1-score of 0.83. Qualitative analysis confirmed that most apparent errors reflected valid pedagogical variation rather than model inconsistencies. CEFR-Cymraeg establishes a benchmark resource for Welsh and opens new opportunities for educational NLP, corpus linguistics, and multilingual proficiency research.
Proffiliadur: Welsh Language Text Profiling Toolkit
Nicolás Gutiérrez-Rolón | Jonathan Davies | Tomos Williams | Dawn Knight | Fernando Alva-Manchego
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Nicolás Gutiérrez-Rolón | Jonathan Davies | Tomos Williams | Dawn Knight | Fernando Alva-Manchego
Proceedings of the Fifteenth Language Resources and Evaluation Conference
We introduce Proffiliadur, a Python toolkit for text profiling and readability analysis in Welsh. The toolkit computes 141 surface, lexical, morphological, and syntactic indices, designed to capture linguistic variation while incorporating a Welsh-specific tokenisation process that enables accurate morphological analysis and handles phenomena such as initial consonant mutation. Proffiliadur enables systematic assessment of text accessibility and supports applications in education, healthcare, and public communication. We demonstrate the toolkit’s usefulness through two complementary analyses. First, we examine texts written in accordance with the Cymraeg Clîr ("Clear Welsh") principles and compare them with regular Welsh texts. Second, we analyse texts across CEFR proficiency levels to explore how linguistic complexity varies with learner ability. We also evaluate feature-based and neural classification models for automatic complexity detection, showing that interpretable linguistic indices alone achieve strong predictive performance (F1 = 0.94), comparable to a fine-tuned transformer (F1 = 0.97). Proffiliadur provides the first dedicated text profiling toolkit for Welsh, offering reproducible, linguistically grounded measures of readability for a low-resource language.