Raúl García-Cerdá
Also published as: Raúl García Cerdá
2025
Building a Lightweight Classifier to Distinguish Closely Related Language Varieties with Limited Supervision: The Case of Catalan vs Valencian
Raúl García-Cerdá
|
María Miró Maestre
|
Miquel Canal
Proceedings of the First Workshop on Advancing NLP for Low-Resource Languages
Dialectal variation among closely related languages poses a major challenge in low-resource NLP, as their linguistic similarity increases confusability for automatic systems. We introduce the first supervised classifier to distinguish standard Catalan from its regional variety Valencian. Our lightweight approach fine-tunes a RoBERTa-base model on a manually curated corpus of 20 000 sentences—without any Valencian-specific tools—and achieves 98 % accuracy on unseen test data. In a human evaluation of 90 mixed-variety items per reviewer, acceptance rates reached 96.7 % for Valencian and 97.7 % for Catalan (97.2 % overall). We discuss limitations with out-of-distribution inputs and outline future work on confidence calibration and dialect-aware tokenization. Our findings demonstrate that high-impact dialect classification is feasible with minimal resources.
Proceedings of the First Workshop on Comparative Performance Evaluation: From Rules to Language Models
Alicia Picazo-Izquierdo
|
Ernesto Luis Estevanell-Valladares
|
Ruslan Mitkov
|
Rafael Muñoz Guillena
|
Raúl García Cerdá
Proceedings of the First Workshop on Comparative Performance Evaluation: From Rules to Language Models