Amanpreet Singh


2022

pdf
SpecNFS: A Challenge Dataset Towards Extracting Formal Models from Natural Language Specifications
Sayontan Ghosh | Amanpreet Singh | Alex Merenstein | Wei Su | Scott A. Smolka | Erez Zadok | Niranjan Balasubramanian
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Can NLP assist in building formal models for verifying complex systems? We study this challenge in the context of parsing Network File System (NFS) specifications. We define a semantic-dependency problem over SpecIR, a representation language we introduce to model sentences appearing in NFS specification documents (RFCs) as IF-THEN statements, and present an annotated dataset of 1,198 sentences. We develop and evaluate semantic-dependency parsing systems for this problem. Evaluations show that even when using a state-of-the-art language model, there is significant room for improvement, with the best models achieving an F1 score of only 60.5 and 33.3 in the named-entity-recognition and dependency-link-prediction sub-tasks, respectively. We also release additional unlabeled data and other domain-related texts. Experiments show that these additional resources increase the F1 measure when used for simple domain-adaption and transfer-learning-based approaches, suggesting fruitful directions for further research

2021

pdf
Dynabench: Rethinking Benchmarking in NLP
Douwe Kiela | Max Bartolo | Yixin Nie | Divyansh Kaushik | Atticus Geiger | Zhengxuan Wu | Bertie Vidgen | Grusha Prasad | Amanpreet Singh | Pratik Ringshia | Zhiyi Ma | Tristan Thrush | Sebastian Riedel | Zeerak Waseem | Pontus Stenetorp | Robin Jia | Mohit Bansal | Christopher Potts | Adina Williams
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the-loop dataset creation: annotators seek to create examples that a target model will misclassify, but that another person will not. In this paper, we argue that Dynabench addresses a critical need in our community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and falter in real-world scenarios. With Dynabench, dataset creation, model development, and model assessment can directly inform each other, leading to more robust and informative benchmarks. We report on four initial NLP tasks, illustrating these concepts and highlighting the promise of the platform, and address potential objections to dynamic benchmarking as a new standard for the field.

2020

pdf
Infosys Machine Translation System for WMT20 Similar Language Translation Task
Kamalkumar Rathinasamy | Amanpreet Singh | Balaguru Sivasambagupta | Prajna Prasad Neerchal | Vani Sivasankaran
Proceedings of the Fifth Conference on Machine Translation

This paper describes Infosys’s submission to the WMT20 Similar Language Translation shared task. We participated in Indo-Aryan language pair in the language direction Hindi to Marathi. Our baseline system is byte-pair encoding based transformer model trained with the Fairseq sequence modeling toolkit. Our final system is an ensemble of two transformer models, which ranked first in WMT20 evaluation. One model is designed to learn the nuances of translation of this low resource language pair by taking advantage of the fact that the source and target languages are same alphabet languages. The other model is the result of experimentation with the proportion of back-translated data to the parallel data to improve translation fluency.

2019

pdf
Neural Network Acceptability Judgments
Alex Warstadt | Amanpreet Singh | Samuel R. Bowman
Transactions of the Association for Computational Linguistics, Volume 7

This paper investigates the ability of artificial neural networks to judge the grammatical acceptability of a sentence, with the goal of testing their linguistic competence. We introduce the Corpus of Linguistic Acceptability (CoLA), a set of 10,657 English sentences labeled as grammatical or ungrammatical from published linguistics literature. As baselines, we train several recurrent neural network models on acceptability classification, and find that our models outperform unsupervised models by Lau et al. (2016) on CoLA. Error-analysis on specific grammatical phenomena reveals that both Lau et al.’s models and ours learn systematic generalizations like subject-verb-object order. However, all models we test perform far below human level on a wide range of grammatical constructions.

2018

pdf
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Wang | Amanpreet Singh | Julian Michael | Felix Hill | Omer Levy | Samuel Bowman
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

Human ability to understand language is general, flexible, and robust. In contrast, most NLU models above the word level are designed for a specific task and struggle with out-of-domain data. If we aspire to develop models with understanding beyond the detection of superficial correspondences between inputs and outputs, then it is critical to develop a unified model that can execute a range of linguistic tasks across different domains. To facilitate research in this direction, we present the General Language Understanding Evaluation (GLUE, gluebenchmark.com): a benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models. For some benchmark tasks, training data is plentiful, but for others it is limited or does not match the genre of the test set. GLUE thus favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks. While none of the datasets in GLUE were created from scratch for the benchmark, four of them feature privately-held test data, which is used to ensure that the benchmark is used fairly. We evaluate baselines that use ELMo (Peters et al., 2018), a powerful transfer learning technique, as well as state-of-the-art sentence representation models. The best models still achieve fairly low absolute scores. Analysis with our diagnostic dataset yields similarly weak performance over all phenomena tested, with some exceptions.