2025
pdf
bib
abs
The UD-NewsCrawl Treebank: Reflections and Challenges from a Large-scale Tagalog Syntactic Annotation Project
Angelina Aspra Aquino
|
Lester James Validad Miranda
|
Elsie Marie T. Or
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
This paper presents UD-NewsCrawl, the largest Tagalog treebank to date, containing 15.6k trees manually annotated according tothe Universal Dependencies framework. We detail our treebank development process, including data collection, pre-processing, manual annotation, and quality assurance procedures. We provide baseline evaluations using multiple transformer-based models to assess the performance of state-of-the-art dependency parsers on Tagalog. We also highlight challenges in the syntactic analysis of Tagalog given its distinctive grammatical properties, and discuss its implications for the annotation of this treebank. We anticipate that UD-NewsCrawl and our baseline model implementations will serve as valuable resources for advancing computational linguistics research in underrepresented languages like Tagalog.
2024
pdf
bib
abs
Envisioning NLP for intercultural climate communication
Steven Bird
|
Angelina Aquino
|
Ian Gumbula
Proceedings of the 1st Workshop on Natural Language Processing Meets Climate Change (ClimateNLP 2024)
Climate communication is often seen by the NLP community as an opportunity for machine translation, applied to ever smaller languages. However, over 90% the world’s linguistic diversity comes from languages with ‘primary orality’ and mostly spoken in non-Western oral societies. A case in point is the Aboriginal communities of Northern Australia, where we have been conducting workshops on climate communication, revealing shortcomings in existing communication practices along with new opportunities for improving intercultural communication. We present a case study of climate communication in an oral society, including the voices of many local people, and draw several lessons for the research program of NLP in the climate space.
2022
pdf
bib
Zero-shot and few-shot approaches for tokenization, tagging, and dependency parsing of Tagalog text
Angelina Aquino
|
Franz de Leon
Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation
2020
pdf
bib
abs
Parsing in the absence of related languages: Evaluating low-resource dependency parsers on Tagalog
Angelina Aquino
|
Franz de Leon
Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020)
Cross-lingual and multilingual methods have been widely suggested as options for dependency parsing of low-resource languages; however, these typically require the use of annotated data in related high-resource languages. In this paper, we evaluate the performance of these methods versus monolingual parsing of Tagalog, an Austronesian language which shares little typological similarity with any existing high-resource languages. We show that a monolingual model developed on minimal target language data consistently outperforms all cross-lingual and multilingual models when no closely-related sources exist for a low-resource language.