2018
pdf
abs
Code-Mixed Question Answering Challenge: Crowd-sourcing Data and Techniques
Khyathi Chandu
|
Ekaterina Loginova
|
Vishal Gupta
|
Josef van Genabith
|
Günter Neumann
|
Manoj Chinnakotla
|
Eric Nyberg
|
Alan W. Black
Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching
Code-Mixing (CM) is the phenomenon of alternating between two or more languages which is prevalent in bi- and multi-lingual communities. Most NLP applications today are still designed with the assumption of a single interaction language and are most likely to break given a CM utterance with multiple languages mixed at a morphological, phrase or sentence level. For example, popular commercial search engines do not yet fully understand the intents expressed in CM queries. As a first step towards fostering research which supports CM in NLP applications, we systematically crowd-sourced and curated an evaluation dataset for factoid question answering in three CM languages - Hinglish (Hindi+English), Tenglish (Telugu+English) and Tamlish (Tamil+English) which belong to two language families (Indo-Aryan and Dravidian). We share the details of our data collection process, techniques which were used to avoid inducing lexical bias amongst the crowd workers and other CM specific linguistic properties of the dataset. Our final dataset, which is available freely for research purposes, has 1,694 Hinglish, 2,848 Tamlish and 1,391 Tenglish factoid questions and their answers. We discuss the techniques used by the participants for the first edition of this ongoing challenge.
pdf
abs
Transliteration Better than Translation? Answering Code-mixed Questions over a Knowledge Base
Vishal Gupta
|
Manoj Chinnakotla
|
Manish Shrivastava
Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching
Humans can learn multiple languages. If they know a fact in one language, they can answer a question in another language they understand. They can also answer Code-mix (CM) questions: questions which contain both languages. This behavior is attributed to the unique learning ability of humans. Our task aims to study if machines can achieve this. We demonstrate how effectively a machine can answer CM questions. In this work, we adopt a two phase approach: candidate generation and candidate re-ranking to answer questions. We propose a Triplet-Siamese-Hybrid CNN (TSHCNN) to re-rank candidate answers. We show experiments on the SimpleQuestions dataset. Our network is trained only on English questions provided in this dataset and noisy Hindi translations of these questions and can answer English-Hindi CM questions effectively without the need of translation into English. Back-transliterated CM questions outperform their lexical and sentence level translated counterparts by 5% & 35% in accuracy respectively, highlighting the efficacy of our approach in a resource constrained setting.
pdf
abs
Retrieve and Re-rank: A Simple and Effective IR Approach to Simple Question Answering over Knowledge Graphs
Vishal Gupta
|
Manoj Chinnakotla
|
Manish Shrivastava
Proceedings of the First Workshop on Fact Extraction and VERification (FEVER)
SimpleQuestions is a commonly used benchmark for single-factoid question answering (QA) over Knowledge Graphs (KG). Existing QA systems rely on various components to solve different sub-tasks of the problem (such as entity detection, entity linking, relation prediction and evidence integration). In this work, we propose a different approach to the problem and present an information retrieval style solution for it. We adopt a two-phase approach: candidate generation and candidate re-ranking to answer questions. We propose a Triplet-Siamese-Hybrid CNN (TSHCNN) to re-rank candidate answers. Our approach achieves an accuracy of 80% which sets a new state-of-the-art on the SimpleQuestions dataset.
2012
pdf
Automatic Punjabi Text Extractive Summarization System
Vishal Gupta
|
Gurpreet Lehal
Proceedings of COLING 2012: Demonstration Papers
pdf
Complete Pre Processing Phase of Punjabi Text Extractive Summarization System
Vishal Gupta
|
Gurpreet Lehal
Proceedings of COLING 2012: Demonstration Papers
pdf
Domain Based Classification of Punjabi Text Documents
Nidhi Krail
|
Vishal Gupta
Proceedings of COLING 2012: Demonstration Papers
pdf
Domain Based Punjabi Text Document Clustering
Saurabh Sharma
|
Vishal Gupta
Proceedings of COLING 2012: Demonstration Papers
pdf
Domain Based Classification of Punjabi Text Documents using Ontology and Hybrid Based Approach
Nidhi Krail
|
Vishal Gupta
Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing
2011
pdf
Punjabi Language Stemmer for nouns and proper names
Vishal Gupta
|
Gurpreet Singh Lehal
Proceedings of the 2nd Workshop on South Southeast Asian Natural Language Processing (WSSANLP)