Data Collection Pipeline for Low-Resource Languages: A Case Study on Constructing a Tetun Text Corpus

Gabriel de Jesus; Sérgio Sobral Nunes

Data Collection Pipeline for Low-Resource Languages: A Case Study on Constructing a Tetun Text Corpus

Abstract

This paper proposes Labadain Crawler, a data collection pipeline tailored to automate and optimize the process of constructing textual corpora from the web, with a specific target to low-resource languages. The system is built on top of Nutch, an open-source web crawler and data extraction framework, and incorporates language processing components such as a tokenizer and a language identification model. The pipeline efficacy is demonstrated through successful testing with Tetun, one of Timor-Leste’s official languages, resulting in the construction of a high-quality Tetun text corpus comprising 321.7k sentences extracted from over 22k web pages. The contributions of this paper include the development of a Tetun tokenizer, a Tetun language identification model, and a Tetun text corpus, marking an important milestone in Tetun text information retrieval.

Anthology ID:: 2024.lrec-main.390
Volume:: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:: May
Year:: 2024
Address:: Torino, Italia
Editors:: Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:: LREC | COLING
SIG:
Publisher:: ELRA and ICCL
Note:
Pages:: 4368–4380
Language:
URL:: https://preview.aclanthology.org/add-emnlp-2024-awards/2024.lrec-main.390/
DOI:
Bibkey:
Cite (ACL):: Gabriel de Jesus and Sérgio Sobral Nunes. 2024. Data Collection Pipeline for Low-Resource Languages: A Case Study on Constructing a Tetun Text Corpus. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 4368–4380, Torino, Italia. ELRA and ICCL.
Cite (Informal):: Data Collection Pipeline for Low-Resource Languages: A Case Study on Constructing a Tetun Text Corpus (de Jesus & Nunes, LREC-COLING 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/add-emnlp-2024-awards/2024.lrec-main.390.pdf

PDF Cite Search Fix data