Identifying Rare Languages in Common Crawl Data is a Needles-in-a-Haystack Problem
Rasul Dent, Pedro Ortiz Suarez, Thibault Clérice, Benoît Sagot
Abstract
Automatic language identification is frequently framed as a multi-class classification problem. However, when creating digital corpora for less commonly written languages, it may be more appropriate to consider it a data mining problem. For these varieties, one knows ahead of time that the vast majority of documents are of little interest. By minimizing resources spent on classifying such documents, we can create corpora covering previously overlooked languages faster than existing pipelines. To demonstrate the effectiveness of the targeted mining perspective, we introduce a new pipeline that can filter a single snapshot in two hours. We also provide web corpora for several French-based Creoles.- Anthology ID:
- 2025.findings-emnlp.77
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2025
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1460–1473
- Language:
- URL:
- https://preview.aclanthology.org/author-page-diogo-silva-nova/2025.findings-emnlp.77/
- DOI:
- 10.18653/v1/2025.findings-emnlp.77
- Cite (ACL):
- Rasul Dent, Pedro Ortiz Suarez, Thibault Clérice, and Benoît Sagot. 2025. Identifying Rare Languages in Common Crawl Data is a Needles-in-a-Haystack Problem. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 1460–1473, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- Identifying Rare Languages in Common Crawl Data is a Needles-in-a-Haystack Problem (Dent et al., Findings 2025)
- PDF:
- https://preview.aclanthology.org/author-page-diogo-silva-nova/2025.findings-emnlp.77.pdf