A New Dataset for Natural Language Inference from Code-mixed Conversations

Simran Khanuja, Sandipan Dandapat, Sunayana Sitaram, Monojit Choudhury


Abstract
Natural Language Inference (NLI) is the task of inferring the logical relationship, typically entailment or contradiction, between a premise and hypothesis. Code-mixing is the use of more than one language in the same conversation or utterance, and is prevalent in multilingual communities all over the world. In this paper, we present the first dataset for code-mixed NLI, in which both the premises and hypotheses are in code-mixed Hindi-English. We use data from Hindi movies (Bollywood) as premises, and crowd-source hypotheses from Hindi-English bilinguals. We conduct a pilot annotation study and describe the final annotation protocol based on observations from the pilot. Currently, the data collected consists of 400 premises in the form of code-mixed conversation snippets and 2240 code-mixed hypotheses. We conduct an extensive analysis to infer the linguistic phenomena commonly observed in the dataset obtained. We evaluate the dataset using a standard mBERT-based pipeline for NLI and report results.
Anthology ID:
2020.calcs-1.2
Volume:
Proceedings of the The 4th Workshop on Computational Approaches to Code Switching
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
CALCS
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
9–16
Language:
English
URL:
https://aclanthology.org/2020.calcs-1.2
DOI:
Bibkey:
Cite (ACL):
Simran Khanuja, Sandipan Dandapat, Sunayana Sitaram, and Monojit Choudhury. 2020. A New Dataset for Natural Language Inference from Code-mixed Conversations. In Proceedings of the The 4th Workshop on Computational Approaches to Code Switching, pages 9–16, Marseille, France. European Language Resources Association.
Cite (Informal):
A New Dataset for Natural Language Inference from Code-mixed Conversations (Khanuja et al., CALCS 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/nodalida-main-page/2020.calcs-1.2.pdf
Data
GLUEMultiNLISNLIXNLI