MultiLegalPile: A 689GB Multilingual Legal Corpus

Joel Niklaus, Veton Matoshi, Matthias Stürmer, Ilias Chalkidis, Daniel Ho


Abstract
Large, high-quality datasets are crucial for training Large Language Models (LLMs). However, so far, few datasets are available for specialized critical domains such as law and the available ones are often small and only in English. To fill this gap, we curate and release MultiLegalPile, a 689GB corpus in 24 languages from 17 jurisdictions. MultiLegalPile includes diverse legal data sources and allows for pretraining NLP models under fair use, with most of the dataset licensed very permissively. We pretrain two RoBERTa models and one Longformer multilingually, and 24 monolingual models on each of the language-specific subsets and evaluate them on LEXTREME. Additionally, we evaluate the English and multilingual models on LexGLUE. Our multilingual models set a new SotA on LEXTREME and our English models on LexGLUE. We release the dataset, trained models, and all code under the most open licenses possible.
Anthology ID:
2024.acl-long.805
Volume:
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
15077–15094
Language:
URL:
https://aclanthology.org/2024.acl-long.805
DOI:
10.18653/v1/2024.acl-long.805
Bibkey:
Cite (ACL):
Joel Niklaus, Veton Matoshi, Matthias Stürmer, Ilias Chalkidis, and Daniel Ho. 2024. MultiLegalPile: A 689GB Multilingual Legal Corpus. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15077–15094, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
MultiLegalPile: A 689GB Multilingual Legal Corpus (Niklaus et al., ACL 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/autopr/2024.acl-long.805.pdf