Abstract
This paper proposes Labadain Crawler, a data collection pipeline tailored to automate and optimize the process of constructing textual corpora from the web, with a specific target to low-resource languages. The system is built on top of Nutch, an open-source web crawler and data extraction framework, and incorporates language processing components such as a tokenizer and a language identification model. The pipeline efficacy is demonstrated through successful testing with Tetun, one of Timor-Leste’s official languages, resulting in the construction of a high-quality Tetun text corpus comprising 321.7k sentences extracted from over 22k web pages. The contributions of this paper include the development of a Tetun tokenizer, a Tetun language identification model, and a Tetun text corpus, marking an important milestone in Tetun text information retrieval.- Anthology ID:
- 2024.lrec-main.390
- Volume:
- Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
- Month:
- May
- Year:
- 2024
- Address:
- Torino, Italia
- Editors:
- Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
- Venues:
- LREC | COLING
- SIG:
- Publisher:
- ELRA and ICCL
- Note:
- Pages:
- 4368–4380
- Language:
- URL:
- https://aclanthology.org/2024.lrec-main.390
- DOI:
- Cite (ACL):
- Gabriel de Jesus and Sérgio Sobral Nunes. 2024. Data Collection Pipeline for Low-Resource Languages: A Case Study on Constructing a Tetun Text Corpus. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 4368–4380, Torino, Italia. ELRA and ICCL.
- Cite (Informal):
- Data Collection Pipeline for Low-Resource Languages: A Case Study on Constructing a Tetun Text Corpus (de Jesus & Nunes, LREC-COLING 2024)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-2/2024.lrec-main.390.pdf