Cross-lingual Strategies for Low-resource Language Modeling: A Study on Five Indic Dialects
Niyati Bafna, Cristina España-Bonet, Josef Van Genabith, Benoît Sagot, Rachel Bawden
Abstract
Neural language models play an increasingly central role for language processing, given their success for a range of NLP tasks. In this study, we compare some canonical strategies in language modeling for low-resource scenarios, evaluating all models by their (finetuned) performance on a POS-tagging downstream task. We work with five (extremely) low-resource dialects from the Indic dialect continuum (Braj, Awadhi, Bhojpuri, Magahi, Maithili), which are closely related to each other and the standard mid-resource dialect, Hindi. The strategies we evaluate broadly include from-scratch pretraining, and cross-lingual transfer between the dialects as well as from different kinds of off-the- shelf multilingual models; we find that a model pretrained on other mid-resource Indic dialects and languages, with extended pretraining on target dialect data, consistently outperforms other models. We interpret our results in terms of dataset sizes, phylogenetic relationships, and corpus statistics, as well as particularities of this linguistic system.- Anthology ID:
- 2023.jeptalnrecital-long.3
- Volume:
- Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : travaux de recherche originaux -- articles longs
- Month:
- 6
- Year:
- 2023
- Address:
- Paris, France
- Editors:
- Christophe Servan, Anne Vilnat
- Venue:
- JEP/TALN/RECITAL
- SIG:
- Publisher:
- ATALA
- Note:
- Pages:
- 28–42
- Language:
- URL:
- https://aclanthology.org/2023.jeptalnrecital-long.3
- DOI:
- Cite (ACL):
- Niyati Bafna, Cristina España-Bonet, Josef Van Genabith, Benoît Sagot, and Rachel Bawden. 2023. Cross-lingual Strategies for Low-resource Language Modeling: A Study on Five Indic Dialects. In Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : travaux de recherche originaux -- articles longs, pages 28–42, Paris, France. ATALA.
- Cite (Informal):
- Cross-lingual Strategies for Low-resource Language Modeling: A Study on Five Indic Dialects (Bafna et al., JEP/TALN/RECITAL 2023)
- PDF:
- https://preview.aclanthology.org/naacl24-info/2023.jeptalnrecital-long.3.pdf