Identifying Rare Languages in Common Crawl Data is a Needles-in-a-Haystack Problem

Rasul Dent; Pedro Ortiz Suarez; Thibault Clérice; Benoît Sagot

doi:10.18653/v1/2025.findings-emnlp.77

Identifying Rare Languages in Common Crawl Data is a Needles-in-a-Haystack Problem

Rasul Dent, Pedro Ortiz Suarez, Thibault Clérice, Benoît Sagot

Abstract

Automatic language identification is frequentlyframed as a multi-class classification problem.However, when creating digital corpora forless commonly written languages, it may bemore appropriate to consider it a data min-ing problem. For these varieties, one knowsahead of time that the vast majority of doc-uments are of little interest. By minimizingresources spent on classifying such documents,we can create corpora covering previously over-looked languages faster than existing pipelines.To demonstrate the effectiveness of the tar-geted mining perspective, we introduce a newpipeline that can filter a single snapshot in twohours. We also provide web corpora for severalFrench-based Creoles.

Anthology ID:: 2025.findings-emnlp.77
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1460–1473
Language:
URL:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.77/
DOI:: 10.18653/v1/2025.findings-emnlp.77
Bibkey:
Cite (ACL):: Rasul Dent, Pedro Ortiz Suarez, Thibault Clérice, and Benoît Sagot. 2025. Identifying Rare Languages in Common Crawl Data is a Needles-in-a-Haystack Problem. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 1460–1473, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Identifying Rare Languages in Common Crawl Data is a Needles-in-a-Haystack Problem (Dent et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.77.pdf
Checklist:: 2025.findings-emnlp.77.checklist.pdf

PDF Cite Search Checklist Fix data