Identifying Rare Languages in Common Crawl Data is a Needles-in-a-Haystack Problem

Rasul Dent, Pedro Ortiz Suarez, Thibault Clérice, Benoît Sagot


Abstract
Automatic language identification is frequently framed as a multi-class classification problem. However, when creating digital corpora for less commonly written languages, it may be more appropriate to consider it a data mining problem. For these varieties, one knows ahead of time that the vast majority of documents are of little interest. By minimizing resources spent on classifying such documents, we can create corpora covering previously overlooked languages faster than existing pipelines. To demonstrate the effectiveness of the targeted mining perspective, we introduce a new pipeline that can filter a single snapshot in two hours. We also provide web corpora for several French-based Creoles.
Anthology ID:
2025.findings-emnlp.77
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1460–1473
Language:
URL:
https://preview.aclanthology.org/name-variant-v-g-vinod-vydiswaran/2025.findings-emnlp.77/
DOI:
10.18653/v1/2025.findings-emnlp.77
Bibkey:
Cite (ACL):
Rasul Dent, Pedro Ortiz Suarez, Thibault Clérice, and Benoît Sagot. 2025. Identifying Rare Languages in Common Crawl Data is a Needles-in-a-Haystack Problem. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 1460–1473, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Identifying Rare Languages in Common Crawl Data is a Needles-in-a-Haystack Problem (Dent et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/name-variant-v-g-vinod-vydiswaran/2025.findings-emnlp.77.pdf
Checklist:
 2025.findings-emnlp.77.checklist.pdf