Identifying Rare Languages in Common Crawl Data is a Needles-in-a-Haystack Problem
Rasul Dent, Pedro Ortiz Suarez, Thibault Clérice, Benoît Sagot
Abstract
Automatic language identification is frequentlyframed as a multi-class classification problem.However, when creating digital corpora forless commonly written languages, it may bemore appropriate to consider it a data min-ing problem. For these varieties, one knowsahead of time that the vast majority of doc-uments are of little interest. By minimizingresources spent on classifying such documents,we can create corpora covering previously over-looked languages faster than existing pipelines.To demonstrate the effectiveness of the tar-geted mining perspective, we introduce a newpipeline that can filter a single snapshot in twohours. We also provide web corpora for severalFrench-based Creoles.- Anthology ID:
- 2025.findings-emnlp.77
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2025
- Month:
- November
- Year:
- 2025
- Address:
- Suzhou, China
- Editors:
- Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1460–1473
- Language:
- URL:
- https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.77/
- DOI:
- 10.18653/v1/2025.findings-emnlp.77
- Cite (ACL):
- Rasul Dent, Pedro Ortiz Suarez, Thibault Clérice, and Benoît Sagot. 2025. Identifying Rare Languages in Common Crawl Data is a Needles-in-a-Haystack Problem. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 1460–1473, Suzhou, China. Association for Computational Linguistics.
- Cite (Informal):
- Identifying Rare Languages in Common Crawl Data is a Needles-in-a-Haystack Problem (Dent et al., Findings 2025)
- PDF:
- https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.77.pdf