Roberto Zariquiey
2020
No Data to Crawl? Monolingual Corpus Creation from PDF Files of Truly low-Resource Languages in Peru
Gina Bustamante
|
Arturo Oncevay
|
Roberto Zariquiey
Proceedings of the 12th Language Resources and Evaluation Conference
We introduce new monolingual corpora for four indigenous and endangered languages from Peru: Shipibo-konibo, Ashaninka, Yanesha and Yine. Given the total absence of these languages in the web, the extraction and processing of texts from PDF files is relevant in a truly low-resource language scenario. Our procedure for monolingual corpus creation considers language-specific and language-agnostic steps, and focuses on educational PDF files with multilingual sentences, noisy pages and low-structured content. Through an evaluation based on language modelling and character-level perplexity on a subset of manually extracted sentences, we determine that our method allows the creation of clean corpora for the four languages, a key resource for natural language processing tasks nowadays.
2018
Toward Universal Dependencies for Shipibo-Konibo
Alonso Vasquez
|
Renzo Ego Aguirre
|
Candy Angulo
|
John Miller
|
Claudia Villanueva
|
Željko Agić
|
Roberto Zariquiey
|
Arturo Oncevay
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)
We present an initial version of the Universal Dependencies (UD) treebank for Shipibo-Konibo, the first South American, Amazonian, Panoan and Peruvian language with a resource built under UD. We describe the linguistic aspects of how the tagset was defined and the treebank was annotated; in addition we present our specific treatment of linguistic units called clitics. Although the treebank is still under development, it allowed us to perform a typological comparison against Spanish, the predominant language in Peru, and dependency syntax parsing experiments in both monolingual and cross-lingual approaches.
Search
Co-authors
- Arturo Oncevay 2
- Alonso Vasquez 1
- Renzo Ego Aguirre 1
- Candy Angulo 1
- John Miller 1
- show all...