2024
pdf
abs
DeMuX: Data-efficient Multilingual Learning
Simran Khanuja
|
Srinivas Gowriraj
|
Lucio Dery
|
Graham Neubig
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Pre-trained multilingual models have enabled deployment of NLP technologies for multiple languages. However, optimally fine-tuning these models under an annotation budget, such that performance on desired target languages is jointly maximized, still remains an open question. In this paper, we introduce DeMuX, a framework that prescribes the exact data-points to label from vast amounts of unlabelled multilingual data, having unknown degrees of overlap with the target set. Unlike most prior works, our end-to-end framework is language-agnostic, accounts for model representations, and supports multilingual target configurations. Our active learning strategies rely upon distance and uncertainty measures to select task-specific neighbors that are most informative to label, given a model. DeMuX outperforms strong baselines in 84% of the test cases, in the zero-shot setting of disjoint source and target language sets (including multilingual target pools), across three models and four tasks. Notably, in low-budget settings (5-100 examples), we observe gains of up to 8-11 F1 points. Our code is released here: https://github.com/simran-khanuja/demux.
2023
pdf
abs
Evaluating the Diversity, Equity, and Inclusion of NLP Technology: A Case Study for Indian Languages
Simran Khanuja
|
Sebastian Ruder
|
Partha Talukdar
Findings of the Association for Computational Linguistics: EACL 2023
In order for NLP technology to be widely applicable, fair, and useful, it needs to serve a diverse set of speakers across the world’s languages, be equitable, i.e., not unduly biased towards any particular language, and be inclusive of all users, particularly in low-resource settings where compute constraints are common. In this paper, we propose an evaluation paradigm that assesses NLP technologies across all three dimensions. While diversity and inclusion have received attention in recent literature, equity is currently unexplored. We propose to address this gap using the Gini coefficient, a well-established metric used for estimating societal wealth inequality. Using our paradigm, we highlight the distressed state of current technologies for Indian (IN) languages (a linguistically large and diverse set, with a varied speaker population), across all three dimensions. To improve upon these metrics, we demonstrate the importance of region-specific choices in model building and dataset creation, and more importantly, propose a novel, generalisable approach to optimal resource allocation during fine-tuning. Finally, we discuss steps to mitigate these biases and encourage the community to employ multi-faceted evaluation when building linguistically diverse and equitable technologies.
pdf
abs
Multi-lingual and Multi-cultural Figurative Language Understanding
Anubha Kabra
|
Emmy Liu
|
Simran Khanuja
|
Alham Fikri Aji
|
Genta Winata
|
Samuel Cahyawijaya
|
Anuoluwapo Aremu
|
Perez Ogayo
|
Graham Neubig
Findings of the Association for Computational Linguistics: ACL 2023
Figurative language permeates human communication, but at the same time is relatively understudied in NLP. Datasets have been created in English to accelerate progress towards measuring and improving figurative language processing in language models (LMs). However, the use of figurative language is an expression of our cultural and societal experiences, making it difficult for these phrases to be universally applicable. In this work, we create a figurative language inference dataset, {pasted macro ‘DATASETNAME’}, for seven diverse languages associated with a variety of cultures: Hindi, Indonesian, Javanese, Kannada, Sundanese, Swahili and Yoruba. Our dataset reveals that each language relies on cultural and regional concepts for figurative expressions, with the highest overlap between languages originating from the same region. We assess multilingual LMs’ abilities to interpret figurative language in zero-shot and few-shot settings. All languages exhibit a significant deficiency compared to English, with variations in performance reflecting the availability of pre-training and fine-tuning data, emphasizing the need for LMs to be exposed to a broader range of linguistic and cultural variation during training. Data and code is released at
https://anonymous.4open.science/r/Multilingual-Fig-QA-7B03/pdf
abs
GlobalBench: A Benchmark for Global Progress in Natural Language Processing
Yueqi Song
|
Simran Khanuja
|
Pengfei Liu
|
Fahim Faisal
|
Alissa Ostapenko
|
Genta Winata
|
Alham Fikri Aji
|
Samuel Cahyawijaya
|
Yulia Tsvetkov
|
Antonios Anastasopoulos
|
Graham Neubig
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Despite the major advances in NLP, significant disparities in NLP system performance across languages still exist. Arguably, these are due to uneven resource allocation and sub-optimal incentives to work on less resourced languages. To track and further incentivize the global development of equitable language technology, we introduce GlobalBench. Prior multilingual benchmarks are static and have focused on a limited number of tasks and languages. In contrast, GlobalBench is an ever-expanding collection that aims to dynamically track progress on all NLP datasets in all languages. Rather than solely measuring accuracy, GlobalBench also tracks the estimated per-speaker utility and equity of technology across all languages, providing a multi-faceted view of how language technology is serving people of the world. Furthermore, GlobalBench is designed to identify the most under-served languages, and rewards research efforts directed towards those languages. At present, the most under-served languages are the ones with a relatively high population, but nonetheless overlooked by composite multilingual benchmarks (like Punjabi, Portuguese, and Wu Chinese). Currently, GlobalBench covers 966 datasets in 190 languages, and has 1,128 system submissions spanning 62 languages.
2021
pdf
MergeDistill: Merging Language Models using Pre-trained Distillation
Simran Khanuja
|
Melvin Johnson
|
Partha Talukdar
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
2020
pdf
abs
GLUECoS: An Evaluation Benchmark for Code-Switched NLP
Simran Khanuja
|
Sandipan Dandapat
|
Anirudh Srinivasan
|
Sunayana Sitaram
|
Monojit Choudhury
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Code-switching is the use of more than one language in the same conversation or utterance. Recently, multilingual contextual embedding models, trained on multiple monolingual corpora, have shown promising results on cross-lingual and multilingual tasks. We present an evaluation benchmark, GLUECoS, for code-switched languages, that spans several NLP tasks in English-Hindi and English-Spanish. Specifically, our evaluation benchmark includes Language Identification from text, POS tagging, Named Entity Recognition, Sentiment Analysis, Question Answering and a new task for code-switching, Natural Language Inference. We present results on all these tasks using cross-lingual word embedding models and multilingual models. In addition, we fine-tune multilingual models on artificially generated code-switched data. Although multilingual models perform significantly better than cross-lingual models, our results show that in most tasks, across both language pairs, multilingual models fine-tuned on code-switched data perform best, showing that multilingual models can be further optimized for code-switching tasks.
pdf
bib
abs
A New Dataset for Natural Language Inference from Code-mixed Conversations
Simran Khanuja
|
Sandipan Dandapat
|
Sunayana Sitaram
|
Monojit Choudhury
Proceedings of the 4th Workshop on Computational Approaches to Code Switching
Natural Language Inference (NLI) is the task of inferring the logical relationship, typically entailment or contradiction, between a premise and hypothesis. Code-mixing is the use of more than one language in the same conversation or utterance, and is prevalent in multilingual communities all over the world. In this paper, we present the first dataset for code-mixed NLI, in which both the premises and hypotheses are in code-mixed Hindi-English. We use data from Hindi movies (Bollywood) as premises, and crowd-source hypotheses from Hindi-English bilinguals. We conduct a pilot annotation study and describe the final annotation protocol based on observations from the pilot. Currently, the data collected consists of 400 premises in the form of code-mixed conversation snippets and 2240 code-mixed hypotheses. We conduct an extensive analysis to infer the linguistic phenomena commonly observed in the dataset obtained. We evaluate the dataset using a standard mBERT-based pipeline for NLI and report results.
2019
pdf
Dependency Parser for Bengali-English Code-Mixed Data enhanced with a Synthetic Treebank
Urmi Ghosh
|
Dipti Sharma
|
Simran Khanuja
Proceedings of the 18th International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2019)
pdf
abs
Unsung Challenges of Building and Deploying Language Technologies for Low Resource Language Communities
Pratik Joshi
|
Christain Barnes
|
Sebastin Santy
|
Simran Khanuja
|
Sanket Shah
|
Anirudh Srinivasan
|
Satwik Bhattamishra
|
Sunayana Sitaram
|
Monojit Choudhury
|
Kalika Bali
Proceedings of the 16th International Conference on Natural Language Processing
In this paper, we examine and analyze the challenges associated with developing and introducing language technologies to low-resource language communities. While doing so we bring to light the successes and failures of past work in this area, challenges being faced in doing so, and what have they achieved. Throughout this paper, we take a problem-facing approach and describe essential factors which the success of such technologies hinges upon. We present the various aspects in a manner which clarify and lay out the different tasks involved, which can aid organizations looking to make an impact in this area. We take the example of Gondi, an extremely-low resource Indian language, to reinforce and complement our discussion.