Divyansh Kaushik


Practical Benefits of Feature Feedback Under Distribution Shift
Anurag Katakkar | Clay H. Yoo | Weiqin Wang | Zachary Lipton | Divyansh Kaushik
Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

In attempts to develop sample-efficient and interpretable algorithms, researcher have explored myriad mechanisms for collecting and exploiting feature feedback, auxiliary annotations provided for training (but not test) instances that highlight salient evidence. Examples include bounding boxes around objects and salient spans in text. Despite its intuitive appeal, feature feedback has not delivered significant gains in practical problems as assessed on iid holdout sets. However, recent works on counterfactually augmented data suggest an alternative benefit of supplemental annotations, beyond interpretability: lessening sensitivity to spurious patterns and consequently delivering gains in out-of-domain evaluations. We speculate that while existing methods for incorporating feature feedback have delivered negligible in-sample performance gains, they may nevertheless provide out-of-domain benefits. Our experiments addressing sentiment analysis, show that feature feedback methods perform significantly better on various natural out-of-domain datasets despite comparable in-domain evaluations. By contrast, performance on natural language inference remains comparable. Finally, we compare those tasks where feature feedback does (and does not) help.


On the Efficacy of Adversarial Data Collection for Question Answering: Results from a Large-Scale Randomized Study
Divyansh Kaushik | Douwe Kiela | Zachary C. Lipton | Wen-tau Yih
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

In adversarial data collection (ADC), a human workforce interacts with a model in real time, attempting to produce examples that elicit incorrect predictions. Researchers hope that models trained on these more challenging datasets will rely less on superficial patterns, and thus be less brittle. However, despite ADC’s intuitive appeal, it remains unclear when training on adversarial datasets produces more robust models. In this paper, we conduct a large-scale controlled study focused on question answering, assigning workers at random to compose questions either (i) adversarially (with a model in the loop); or (ii) in the standard fashion (without a model). Across a variety of models and datasets, we find that models trained on adversarial data usually perform better on other adversarial datasets but worse on a diverse collection of out-of-domain evaluation sets. Finally, we provide a qualitative analysis of adversarial (vs standard) data, identifying key differences and offering guidance for future research.

Dynabench: Rethinking Benchmarking in NLP
Douwe Kiela | Max Bartolo | Yixin Nie | Divyansh Kaushik | Atticus Geiger | Zhengxuan Wu | Bertie Vidgen | Grusha Prasad | Amanpreet Singh | Pratik Ringshia | Zhiyi Ma | Tristan Thrush | Sebastian Riedel | Zeerak Waseem | Pontus Stenetorp | Robin Jia | Mohit Bansal | Christopher Potts | Adina Williams
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the-loop dataset creation: annotators seek to create examples that a target model will misclassify, but that another person will not. In this paper, we argue that Dynabench addresses a critical need in our community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and falter in real-world scenarios. With Dynabench, dataset creation, model development, and model assessment can directly inform each other, leading to more robust and informative benchmarks. We report on four initial NLP tasks, illustrating these concepts and highlighting the promise of the platform, and address potential objections to dynamic benchmarking as a new standard for the field.


How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks
Divyansh Kaushik | Zachary C. Lipton
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Many recent papers address reading comprehension, where examples consist of (question, passage, answer) tuples. Presumably, a model must combine information from both questions and passages to predict corresponding answers. However, despite intense interest in the topic, with hundreds of published papers vying for leaderboard dominance, basic questions about the difficulty of many popular benchmarks remain unanswered. In this paper, we establish sensible baselines for the bAbI, SQuAD, CBT, CNN, and Who-did-What datasets, finding that question- and passage-only models often perform surprisingly well. On 14 out of 20 bAbI tasks, passage-only models achieve greater than 50% accuracy, sometimes matching the full model. Interestingly, while CBT provides 20-sentence passages, only the last is needed for accurate prediction. By comparison, SQuAD and CNN appear better-constructed.


Making Travel Smarter: Extracting Travel Information From Email Itineraries Using Named Entity Recognition
Divyansh Kaushik | Shashank Gupta | Chakradhar Raju | Reuben Dias | Sanjib Ghosh
Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

The purpose of this research is to address the problem of extracting information from travel itineraries and discuss the challenges faced in the process. Business-to-customer emails like booking confirmations and e-tickets are usually machine generated by filling slots in pre-defined templates which improve the presentation of such emails but also make the emails more complex in structure. Extracting the relevant information from these emails would let users track their journeys and important updates on applications installed on their devices to give them a consolidated over view of their itineraries and also save valuable time. We investigate the use of an HMM-based named entity recognizer on such emails which we will use to label and extract relevant entities. NER in such emails is challenging as these itineraries offer less useful contextual information. We also propose a rich set of features which are integrated into the model and are specific to our domain. The result from our model is a list of lists containing the relevant information extracted from ones itinerary.