Code-Mixed Question Answering Challenge: Crowd-sourcing Data and Techniques
Khyathi Chandu, Ekaterina Loginova, Vishal Gupta, Josef van Genabith, Günter Neumann, Manoj Chinnakotla, Eric Nyberg, Alan W. Black
Abstract
Code-Mixing (CM) is the phenomenon of alternating between two or more languages which is prevalent in bi- and multi-lingual communities. Most NLP applications today are still designed with the assumption of a single interaction language and are most likely to break given a CM utterance with multiple languages mixed at a morphological, phrase or sentence level. For example, popular commercial search engines do not yet fully understand the intents expressed in CM queries. As a first step towards fostering research which supports CM in NLP applications, we systematically crowd-sourced and curated an evaluation dataset for factoid question answering in three CM languages - Hinglish (Hindi+English), Tenglish (Telugu+English) and Tamlish (Tamil+English) which belong to two language families (Indo-Aryan and Dravidian). We share the details of our data collection process, techniques which were used to avoid inducing lexical bias amongst the crowd workers and other CM specific linguistic properties of the dataset. Our final dataset, which is available freely for research purposes, has 1,694 Hinglish, 2,848 Tamlish and 1,391 Tenglish factoid questions and their answers. We discuss the techniques used by the participants for the first edition of this ongoing challenge.- Anthology ID:
- W18-3204
- Volume:
- Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching
- Month:
- July
- Year:
- 2018
- Address:
- Melbourne, Australia
- Editors:
- Gustavo Aguilar, Fahad AlGhamdi, Victor Soto, Thamar Solorio, Mona Diab, Julia Hirschberg
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 29–38
- Language:
- URL:
- https://aclanthology.org/W18-3204
- DOI:
- 10.18653/v1/W18-3204
- Cite (ACL):
- Khyathi Chandu, Ekaterina Loginova, Vishal Gupta, Josef van Genabith, Günter Neumann, Manoj Chinnakotla, Eric Nyberg, and Alan W. Black. 2018. Code-Mixed Question Answering Challenge: Crowd-sourcing Data and Techniques. In Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, pages 29–38, Melbourne, Australia. Association for Computational Linguistics.
- Cite (Informal):
- Code-Mixed Question Answering Challenge: Crowd-sourcing Data and Techniques (Chandu et al., ACL 2018)
- PDF:
- https://preview.aclanthology.org/ingest-2024-clasp/W18-3204.pdf
- Data
- SQuAD