Cross-lingual Strategies for Low-resource Language Modeling: A Study on Five Indic Dialects

Niyati Bafna, Cristina España-Bonet, Josef Van Genabith, Benoît Sagot, Rachel Bawden


Abstract
Neural language models play an increasingly central role for language processing, given their success for a range of NLP tasks. In this study, we compare some canonical strategies in language modeling for low-resource scenarios, evaluating all models by their (finetuned) performance on a POS-tagging downstream task. We work with five (extremely) low-resource dialects from the Indic dialect continuum (Braj, Awadhi, Bhojpuri, Magahi, Maithili), which are closely related to each other and the standard mid-resource dialect, Hindi. The strategies we evaluate broadly include from-scratch pretraining, and cross-lingual transfer between the dialects as well as from different kinds of off-the- shelf multilingual models; we find that a model pretrained on other mid-resource Indic dialects and languages, with extended pretraining on target dialect data, consistently outperforms other models. We interpret our results in terms of dataset sizes, phylogenetic relationships, and corpus statistics, as well as particularities of this linguistic system.
Anthology ID:
2023.jeptalnrecital-long.3
Volume:
Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : travaux de recherche originaux -- articles longs
Month:
6
Year:
2023
Address:
Paris, France
Editors:
Christophe Servan, Anne Vilnat
Venue:
JEP/TALN/RECITAL
SIG:
Publisher:
ATALA
Note:
Pages:
28–42
Language:
URL:
https://aclanthology.org/2023.jeptalnrecital-long.3
DOI:
Bibkey:
Cite (ACL):
Niyati Bafna, Cristina España-Bonet, Josef Van Genabith, Benoît Sagot, and Rachel Bawden. 2023. Cross-lingual Strategies for Low-resource Language Modeling: A Study on Five Indic Dialects. In Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : travaux de recherche originaux -- articles longs, pages 28–42, Paris, France. ATALA.
Cite (Informal):
Cross-lingual Strategies for Low-resource Language Modeling: A Study on Five Indic Dialects (Bafna et al., JEP/TALN/RECITAL 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl24-info/2023.jeptalnrecital-long.3.pdf