No Data to Crawl? Monolingual Corpus Creation from PDF Files of Truly low-Resource Languages in Peru

Gina Bustamante; Arturo Oncevay; Roberto Zariquiey

No Data to Crawl? Monolingual Corpus Creation from PDF Files of Truly low-Resource Languages in Peru

Gina Bustamante, Arturo Oncevay, Roberto Zariquiey

Abstract

We introduce new monolingual corpora for four indigenous and endangered languages from Peru: Shipibo-konibo, Ashaninka, Yanesha and Yine. Given the total absence of these languages in the web, the extraction and processing of texts from PDF files is relevant in a truly low-resource language scenario. Our procedure for monolingual corpus creation considers language-specific and language-agnostic steps, and focuses on educational PDF files with multilingual sentences, noisy pages and low-structured content. Through an evaluation based on language modelling and character-level perplexity on a subset of manually extracted sentences, we determine that our method allows the creation of clean corpora for the four languages, a key resource for natural language processing tasks nowadays.

Anthology ID:: 2020.lrec-1.356
Volume:: Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:: May
Year:: 2020
Address:: Marseille, France
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 2914–2923
Language:: English
URL:: https://aclanthology.org/2020.lrec-1.356
DOI:
Bibkey:
Cite (ACL):: Gina Bustamante, Arturo Oncevay, and Roberto Zariquiey. 2020. No Data to Crawl? Monolingual Corpus Creation from PDF Files of Truly low-Resource Languages in Peru. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2914–2923, Marseille, France. European Language Resources Association.
Cite (Informal):: No Data to Crawl? Monolingual Corpus Creation from PDF Files of Truly low-Resource Languages in Peru (Bustamante et al., LREC 2020)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingestion-script-update/2020.lrec-1.356.pdf

PDF Search