ConLID: Supervised Contrastive Learning for Low-Resource Language Identification
Negar Foroutan, Jakhongir Saydaliev, Grace Kim, Antoine Bosselut
Abstract
Language identification (LID) is a critical step in curating multilingual LLM pretraining corpora from web crawls. While many studies on LID model training focus on collecting diverse training data to improve performance, low-resource languages – often limited to single-domain data, such as the Bible – continue to perform poorly. To resolve these class imbalance and bias issues, we propose a novel supervised contrastive learning (SCL) approach to learn domain-invariant representations for low-resource languages. We show that our approach improves LID performance on out-of-domain data for low-resource languages by 3.2 percentage points, while maintaining its performance for the high-resource languages.- Anthology ID:
- 2026.eacl-long.315
- Volume:
- Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- March
- Year:
- 2026
- Address:
- Rabat, Morocco
- Editors:
- Vera Demberg, Kentaro Inui, Lluís Marquez
- Venue:
- EACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 6693–6708
- Language:
- URL:
- https://preview.aclanthology.org/manual-author-scripts/2026.eacl-long.315/
- DOI:
- Cite (ACL):
- Negar Foroutan, Jakhongir Saydaliev, Grace Kim, and Antoine Bosselut. 2026. ConLID: Supervised Contrastive Learning for Low-Resource Language Identification. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6693–6708, Rabat, Morocco. Association for Computational Linguistics.
- Cite (Informal):
- ConLID: Supervised Contrastive Learning for Low-Resource Language Identification (Foroutan et al., EACL 2026)
- PDF:
- https://preview.aclanthology.org/manual-author-scripts/2026.eacl-long.315.pdf