Jargon: A Suite of Language Models and Evaluation Tasks for French Specialized Domains

Vincent Segonne; Aidan Mannion; Laura Cristina Alonzo Canul; Alexandre Daniel Audibert; Xingyu Liu; Cécile Macaire; Adrien Pupier; Yongxin Zhou; Mathilde Aguiar; Felix E. Herron; Magali Norré; Massih R. Amini; Pierrette Bouillon; Iris Eshkol; Emmanuelle Esperança-Rodier; Thomas François; Lorraine Goeuriot; Jérôme Goulian; Mathieu Lafourcade; Benjamin Lecouteux; François Portet; Fabien Ringeval; Vincent Vandeghinste; Maximin Coavoux; Marco Dinarelli; Didier Schwab

Jargon: A Suite of Language Models and Evaluation Tasks for French Specialized Domains

Vincent Segonne, Aidan Mannion, Laura Cristina Alonzo Canul, Alexandre Daniel Audibert, Xingyu Liu, Cécile Macaire, Adrien Pupier, Yongxin Zhou, Mathilde Aguiar, Felix E. Herron, Magali Norré, Massih R Amini, Pierrette Bouillon, Iris Eshkol-Taravella, Emmanuelle Esperança-Rodier, Thomas François, Lorraine Goeuriot, Jérôme Goulian, Mathieu Lafourcade, Benjamin Lecouteux, François Portet, Fabien Ringeval, Vincent Vandeghinste, Maximin Coavoux, Marco Dinarelli, Didier Schwab

Abstract

Pretrained Language Models (PLMs) are the de facto backbone of most state-of-the-art NLP systems. In this paper, we introduce a family of domain-specific pretrained PLMs for French, focusing on three important domains: transcribed speech, medicine, and law. We use a transformer architecture based on efficient methods (LinFormer) to maximise their utility, since these domains often involve processing long documents. We evaluate and compare our models to state-of-the-art models on a diverse set of tasks and datasets, some of which are introduced in this paper. We gather the datasets into a new French-language evaluation benchmark for these three domains. We also compare various training configurations: continued pretraining, pretraining from scratch, as well as single- and multi-domain pretraining. Extensive domain-specific experiments show that it is possible to attain competitive downstream performance even when pre-training with the approximative LinFormer attention mechanism. For full reproducibility, we release the models and pretraining data, as well as contributed datasets.

Anthology ID:: 2024.lrec-main.827
Volume:: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:: May
Year:: 2024
Address:: Torino, Italia
Editors:: Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:: LREC | COLING
SIG:
Publisher:: ELRA and ICCL
Note:
Pages:: 9463–9476
Language:
URL:: https://aclanthology.org/2024.lrec-main.827
DOI:
Bibkey:
Cite (ACL):: Vincent Segonne, Aidan Mannion, Laura Cristina Alonzo Canul, Alexandre Daniel Audibert, Xingyu Liu, Cécile Macaire, Adrien Pupier, Yongxin Zhou, Mathilde Aguiar, Felix E. Herron, Magali Norré, Massih R Amini, Pierrette Bouillon, Iris Eshkol-Taravella, Emmanuelle Esperança-Rodier, Thomas François, Lorraine Goeuriot, Jérôme Goulian, Mathieu Lafourcade, et al.. 2024. Jargon: A Suite of Language Models and Evaluation Tasks for French Specialized Domains. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 9463–9476, Torino, Italia. ELRA and ICCL.
Cite (Informal):: Jargon: A Suite of Language Models and Evaluation Tasks for French Specialized Domains (Segonne et al., LREC-COLING 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-2/2024.lrec-main.827.pdf

PDF Search