José Cañete
2022
ALBETO and DistilBETO: Lightweight Spanish Language Models
José Cañete
|
Sebastian Donoso
|
Felipe Bravo-Marquez
|
Andrés Carvallo
|
Vladimir Araujo
Proceedings of the Thirteenth Language Resources and Evaluation Conference
In recent years there have been considerable advances in pre-trained language models, where non-English language versions have also been made available. Due to their increasing use, many lightweight versions of these models (with reduced parameters) have also been released to speed up training and inference times. However, versions of these lighter models (e.g., ALBERT, DistilBERT) for languages other than English are still scarce. In this paper we present ALBETO and DistilBETO, which are versions of ALBERT and DistilBERT pre-trained exclusively on Spanish corpora. We train several versions of ALBETO ranging from 5M to 223M parameters and one of DistilBETO with 67M parameters. We evaluate our models in the GLUES benchmark that includes various natural language understanding tasks in Spanish. The results show that our lightweight models achieve competitive results to those of BETO (Spanish-BERT) despite having fewer parameters. More specifically, our larger ALBETO model outperforms all other models on the MLDoc, PAWS-X, XNLI, MLQA, SQAC and XQuAD datasets. However, BETO remains unbeaten for POS and NER. As a further contribution, all models are publicly available to the community for future research.
Evaluation Benchmarks for Spanish Sentence Representations
Vladimir Araujo
|
Andrés Carvallo
|
Souvik Kundu
|
José Cañete
|
Marcelo Mendoza
|
Robert E. Mercer
|
Felipe Bravo-Marquez
|
Marie-Francine Moens
|
Alvaro Soto
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Due to the success of pre-trained language models, versions of languages other than English have been released in recent years. This fact implies the need for resources to evaluate these models. In the case of Spanish, there are few ways to systematically assess the models’ quality. In this paper, we narrow the gap by building two evaluation benchmarks. Inspired by previous work (Conneau and Kiela, 2018; Chen et al., 2019), we introduce Spanish SentEval and Spanish DiscoEval, aiming to assess the capabilities of stand-alone and discourse-aware sentence representations, respectively. Our benchmarks include considerable pre-existing and newly constructed datasets that address different tasks from various domains. In addition, we evaluate and analyze the most recent pre-trained Spanish language models to exhibit their capabilities and limitations. As an example, we discover that for the case of discourse evaluation tasks, mBERT, a language model trained on multiple languages, usually provides a richer latent representation than models trained only with documents in Spanish. We hope our contribution will motivate a fairer, more comparable, and less cumbersome way to evaluate future Spanish language models.
Search
Co-authors
- Felipe Bravo-Marquez 2
- Andrés Carvallo 2
- Vladimir Araujo 2
- Sebastian Donoso 1
- Souvik Kundu 1
- show all...
Venues
- lrec2