Joseph Marvin Imperial

2023

pdf abs
Automatic Readability Assessment for Closely Related Languages
Joseph Marvin Imperial | Ekaterina Kochmar
Findings of the Association for Computational Linguistics: ACL 2023

In recent years, the main focus of research on automatic readability assessment (ARA) has shifted towards using expensive deep learning-based methods with the primary goal of increasing models’ accuracy. This, however, is rarely applicable for low-resource languages where traditional handcrafted features are still widely used due to the lack of existing NLP tools to extract deeper linguistic representations. In this work, we take a step back from the technical component and focus on how linguistic aspects such as mutual intelligibility or degree of language relatedness can improve ARA in a low-resource setting. We collect short stories written in three languages in the Philippines—Tagalog, Bikol, and Cebuano—to train readability assessment models and explore the interaction of data and features in various cross-lingual setups. Our results show that the inclusion of CrossNGO, a novel specialized feature exploiting n-gram overlap applied to languages with high mutual intelligibility, significantly improves the performance of ARA models compared to the use of off-the-shelf large multilingual language models alone. Consequently, when both linguistic representations are combined, we achieve state-of-the-art results for Tagalog and Cebuano, and baseline scores for ARA in Bikol.

2022

pdf abs
NU HLT at CMCL 2022 Shared Task: Multilingual and Crosslingual Prediction of Human Reading Behavior in Universal Language Space
Joseph Marvin Imperial
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

In this paper, we present a unified model that works for both multilingual and crosslingual prediction of reading times of words in various languages. The secret behind the success of this model is in the preprocessing step where all words are transformed to their universal language representation via the International Phonetic Alphabet (IPA). To the best of our knowledge, this is the first study to favorably exploit this phonological property of language for the two tasks. Various feature types were extracted covering basic frequencies, n-grams, information theoretic, and psycholinguistically-motivated predictors for model training. A finetuned Random Forest model obtained best performance for both tasks with 3.8031 and 3.9065 MAE scores for mean first fixation duration (FFDAvg) and mean total reading time (TRTAvg) respectively.

pdf abs
A Baseline Readability Model for Cebuano
Joseph Marvin Imperial | Lloyd Lois Antonie Reyes | Michael Antonio Ibanez | Ranz Sapinit | Mohammed Hussien
Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022)

In this study, we developed the first baseline readability model for the Cebuano language. Cebuano is the second most-used native language in the Philippines with about 27.5 million speakers. As the baseline, we extracted traditional or surface-based features, syllable patterns based from Cebuano’s documented orthography, and neural embeddings from the multilingual BERT model. Results show that the use of the first two handcrafted linguistic features obtained the best performance trained on an optimized Random Forest model with approximately 87% across all metrics. The feature sets and algorithm used also is similar to previous results in readability assessment for the Filipino language—showing potential of crosslingual application. To encourage more work for readability assessment in Philippine languages such as Cebuano, we open-sourced both code and data.

2021

pdf abs
BERT Embeddings for Automatic Readability Assessment
Joseph Marvin Imperial
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Automatic readability assessment (ARA) is the task of evaluating the level of ease or difficulty of text documents for a target audience. For researchers, one of the many open problems in the field is to make such models trained for the task show efficacy even for low-resource languages. In this study, we propose an alternative way of utilizing the information-rich embeddings of BERT models with handcrafted linguistic features through a combined method for readability assessment. Results show that the proposed method outperforms classical approaches in readability assessment using English and Filipino datasets, obtaining as high as 12.4% increase in F1 performance. We also show that the general information encoded in BERT embeddings can be used as a substitute feature set for low-resource languages like Filipino with limited semantic and syntactic NLP tools to explicitly extract feature values for the task.

pdf bib
Under the Microscope: Interpreting Readability Assessment Models for Filipino
Joseph Marvin Imperial | Ethel Ong
Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation

pdf
Science Mapping of Publications in Natural Language Processing in the Philippines: 2006 to 2020
Rachel Edita O. Roxas | Joseph Marvin Imperial | Angelica H. De La Cruz
Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation