Roberto Zariquiey


2020

pdf bib
No Data to Crawl? Monolingual Corpus Creation from PDF Files of Truly low-Resource Languages in Peru
Gina Bustamante | Arturo Oncevay | Roberto Zariquiey
Proceedings of the 12th Language Resources and Evaluation Conference

We introduce new monolingual corpora for four indigenous and endangered languages from Peru: Shipibo-konibo, Ashaninka, Yanesha and Yine. Given the total absence of these languages in the web, the extraction and processing of texts from PDF files is relevant in a truly low-resource language scenario. Our procedure for monolingual corpus creation considers language-specific and language-agnostic steps, and focuses on educational PDF files with multilingual sentences, noisy pages and low-structured content. Through an evaluation based on language modelling and character-level perplexity on a subset of manually extracted sentences, we determine that our method allows the creation of clean corpora for the four languages, a key resource for natural language processing tasks nowadays.

2018

pdf bib
Toward Universal Dependencies for Shipibo-Konibo
Alonso Vasquez | Renzo Ego Aguirre | Candy Angulo | John Miller | Claudia Villanueva | Željko Agić | Roberto Zariquiey | Arturo Oncevay
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

We present an initial version of the Universal Dependencies (UD) treebank for Shipibo-Konibo, the first South American, Amazonian, Panoan and Peruvian language with a resource built under UD. We describe the linguistic aspects of how the tagset was defined and the treebank was annotated; in addition we present our specific treatment of linguistic units called clitics. Although the treebank is still under development, it allowed us to perform a typological comparison against Spanish, the predominant language in Peru, and dependency syntax parsing experiments in both monolingual and cross-lingual approaches.