Shankar Kumar


2021

pdf bib
Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models
Felix Stahlberg | Shankar Kumar
Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications

Synthetic data generation is widely known to boost the accuracy of neural grammatical error correction (GEC) systems, but existing methods often lack diversity or are too simplistic to generate the broad range of grammatical errors made by human writers. In this work, we use error type tags from automatic annotation tools such as ERRANT to guide synthetic data generation. We compare several models that can produce an ungrammatical sentence given a clean sentence and an error type tag. We use these models to build a new, large synthetic pre-training data set with error tag frequency distributions matching a given development set. Our synthetic data set yields large and consistent gains, improving the state-of-the-art on the BEA-19 and CoNLL-14 test sets. We also show that our approach is particularly effective in adapting a GEC system, trained on mixed native and non-native English, to a native English test set, even surpassing real training data consisting of high-quality sentence pairs.

pdf bib
Data Strategies for Low-Resource Grammatical Error Correction
Simon Flachs | Felix Stahlberg | Shankar Kumar
Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications

Grammatical Error Correction (GEC) is a task that has been extensively investigated for the English language. However, for low-resource languages the best practices for training GEC systems have not yet been systematically determined. We investigate how best to take advantage of existing data sources for improving GEC systems for languages with limited quantities of high quality training data. We show that methods for generating artificial training data for GEC can benefit from including morphological errors. We also demonstrate that noisy error correction data gathered from Wikipedia revision histories and the language learning website Lang8, are valuable data sources. Finally, we show that GEC systems pre-trained on noisy data sources can be fine-tuned effectively using small amounts of high quality, human-annotated data.

2020

pdf bib
Seq2Edits: Sequence Transduction Using Span-level Edit Operations
Felix Stahlberg | Shankar Kumar
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We propose Seq2Edits, an open-vocabulary approach to sequence editing for natural language processing (NLP) tasks with a high degree of overlap between input and output texts. In this approach, each sequence-to-sequence transduction is represented as a sequence of edit operations, where each operation either replaces an entire source span with target tokens or keeps it unchanged. We evaluate our method on five NLP tasks (text normalization, sentence fusion, sentence splitting & rephrasing, text simplification, and grammatical error correction) and report competitive results across the board. For grammatical error correction, our method speeds up inference by up to 5.2x compared to full sequence models because inference time depends on the number of edits rather than the number of target tokens. For text normalization, sentence fusion, and grammatical error correction, our approach improves explainability by associating each edit operation with a human-readable tag.

pdf bib
Data Weighted Training Strategies for Grammatical Error Correction
Jared Lichtarge | Chris Alberti | Shankar Kumar
Transactions of the Association for Computational Linguistics, Volume 8

Recent progress in the task of Grammatical Error Correction (GEC) has been driven by addressing data sparsity, both through new methods for generating large and noisy pretraining data and through the publication of small and higher-quality finetuning data in the BEA-2019 shared task. Building upon recent work in Neural Machine Translation (NMT), we make use of both kinds of data by deriving example-level scores on our large pretraining data based on a smaller, higher-quality dataset. In this work, we perform an empirical study to discover how to best incorporate delta-log-perplexity, a type of example scoring, into a training schedule for GEC. In doing so, we perform experiments that shed light on the function and applicability of delta-log-perplexity. Models trained on scored data achieve state- of-the-art results on common GEC test sets.

2019

pdf bib
Corpora Generation for Grammatical Error Correction
Jared Lichtarge | Chris Alberti | Shankar Kumar | Noam Shazeer | Niki Parmar | Simon Tong
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Grammatical Error Correction (GEC) has been recently modeled using the sequence-to-sequence framework. However, unlike sequence transduction problems such as machine translation, GEC suffers from the lack of plentiful parallel data. We describe two approaches for generating large parallel datasets for GEC using publicly available Wikipedia data. The first method extracts source-target pairs from Wikipedia edit histories with minimal filtration heuristics while the second method introduces noise into Wikipedia sentences via round-trip translation through bridge languages. Both strategies yield similar sized parallel corpora containing around 4B tokens. We employ an iterative decoding strategy that is tailored to the loosely supervised nature of our constructed corpora. We demonstrate that neural GEC models trained using either type of corpora give similar performance. Fine-tuning these models on the Lang-8 corpus and ensembling allows us to surpass the state of the art on both the CoNLL ‘14 benchmark and the JFLEG task. We present systematic analysis that compares the two approaches to data generation and highlights the effectiveness of ensembling.

2015

pdf bib
Multilingual Open Relation Extraction Using Cross-lingual Projection
Manaal Faruqui | Shankar Kumar
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2010

pdf bib
Expected Sequence Similarity Maximization
Cyril Allauzen | Shankar Kumar | Wolfgang Macherey | Mehryar Mohri | Michael Riley
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
Model Combination for Machine Translation
John DeNero | Shankar Kumar | Ciprian Chelba | Franz Och
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

2009

pdf bib
Efficient Minimum Error Rate Training and Minimum Bayes-Risk Decoding for Translation Hypergraphs and Lattices
Shankar Kumar | Wolfgang Macherey | Chris Dyer | Franz Och
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

2008

pdf bib
Lattice Minimum Bayes-Risk Decoding for Statistical Machine Translation
Roy Tromble | Shankar Kumar | Franz Och | Wolfgang Macherey
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

2007

pdf bib
Improving Word Alignment with Bridge Languages
Shankar Kumar | Franz J. Och | Wolfgang Macherey
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

2005

pdf bib
Local Phrase Reordering Models for Statistical Machine Translation
Shankar Kumar | William Byrne
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing

2004

pdf bib
A Smorgasbord of Features for Statistical Machine Translation
Franz Josef Och | Daniel Gildea | Sanjeev Khudanpur | Anoop Sarkar | Kenji Yamada | Alex Fraser | Shankar Kumar | Libin Shen | David Smith | Katherine Eng | Viren Jain | Zhen Jin | Dragomir Radev
Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004

pdf bib
Minimum Bayes-Risk Decoding for Statistical Machine Translation
Shankar Kumar | William Byrne
Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004

2003

pdf bib
A Weighted Finite State Transducer Implementation of the Alignment Template Model for Statistical Machine Translation
Shankar Kumar | William Byrne
Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics

2002

pdf bib
Minimum Bayes-Risk Word Alignments of Bilingual Texts
Shankar Kumar | William Byrne
Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002)