XLM-T: Multilingual Language Models in Twitter for Sentiment Analysis and Beyond

Francesco Barbieri, Luis Espinosa Anke, Jose Camacho-Collados


Abstract
Language models are ubiquitous in current NLP, and their multilingual capacity has recently attracted considerable attention. However, current analyses have almost exclusively focused on (multilingual variants of) standard benchmarks, and have relied on clean pre-training and task-specific corpora as multilingual signals. In this paper, we introduce XLM-T, a model to train and evaluate multilingual language models in Twitter. In this paper we provide: (1) a new strong multilingual baseline consisting of an XLM-R (Conneau et al. 2020) model pre-trained on millions of tweets in over thirty languages, alongside starter code to subsequently fine-tune on a target task; and (2) a set of unified sentiment analysis Twitter datasets in eight different languages and a XLM-T model trained on this dataset.
Anthology ID:
2022.lrec-1.27
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
258–266
Language:
URL:
https://aclanthology.org/2022.lrec-1.27
DOI:
Bibkey:
Cite (ACL):
Francesco Barbieri, Luis Espinosa Anke, and Jose Camacho-Collados. 2022. XLM-T: Multilingual Language Models in Twitter for Sentiment Analysis and Beyond. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 258–266, Marseille, France. European Language Resources Association.
Cite (Informal):
XLM-T: Multilingual Language Models in Twitter for Sentiment Analysis and Beyond (Barbieri et al., LREC 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2022.lrec-1.27.pdf
Code
 cardiffnlp/xlm-t
Data
TweetEval