Karen Livescu


2023

pdf
SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks
Suwon Shon | Siddhant Arora | Chyi-Jiunn Lin | Ankita Pasad | Felix Wu | Roshan S Sharma | Wei-Lun Wu | Hung-yi Lee | Karen Livescu | Shinji Watanabe
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community, but have not received as much attention as lower-level tasks like speech and speaker recognition. In this work, we introduce several new annotated SLU benchmark tasks based on freely available speech data, which complement existing benchmarks and address gaps in the SLU evaluation landscape. We contribute four tasks: question answering and summarization involve inference over longer speech sequences; named entity localization addresses the speech-specific task of locating the targeted content in the signal; dialog act classification identifies the function of a given speech utterance. In order to facilitate the development of SLU models that leverage the success of pre-trained speech representations, we will release a new benchmark suite, including for each task (i) curated annotations for a relatively small fine-tuning set, (ii) reproducible pipeline (speech recognizer + text model) and end-to-end baseline models and evaluation metrics, (iii) baseline model performance in various types of systems for easy comparisons. We present the details of data collection and annotation and the performance of the baseline models. We also analyze the sensitivity of pipeline models’ performance to the speech recognition accuracy, using more than 20 publicly availablespeech recognition models.

2022

pdf
TTIC’s WMT-SLT 22 Sign Language Translation System
Bowen Shi | Diane Brentari | Gregory Shakhnarovich | Karen Livescu
Proceedings of the Seventh Conference on Machine Translation (WMT)

We describe TTIC’s model submission to WMT-SLT 2022 task on sign language translation (Swiss-German Sign Language (DSGS) - German). Our model consists of an I3D backbone for image encoding and a Transformerbased encoder-decoder model for sequence modeling. The I3D is pre-trained with isolated sign recognition using the WLASL dataset. The model is based on RGB images alone and does not rely on the pre-extracted human pose. We explore a few different strategies for model training in this paper. Our system achieves 0.3 BLEU score and 0.195 Chrf score on the official test set.

pdf
On the Use of External Data for Spoken Named Entity Recognition
Ankita Pasad | Felix Wu | Suwon Shon | Karen Livescu | Kyu Han
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Spoken language understanding (SLU) tasks involve mapping from speech signals to semantic labels. Given the complexity of such tasks, good performance is expected to require large labeled datasets, which are difficult to collect for each new task and domain. However, recent advances in self-supervised speech representations have made it feasible to consider learning SLU models with limited labeled data. In this work, we focus on low-resource spoken named entity recognition (NER) and address the question: Beyond self-supervised pre-training, how can we use external speech and/or text data that are not annotated for the task? We consider self-training, knowledge distillation, and transfer learning for end-to-end (E2E) and pipeline (speech recognition followed by text NER) approaches. We find that several of these approaches improve performance in resource-constrained settings beyond the benefits from pre-trained representations. Compared to prior work, we find relative improvements in F1 of up to 16%. While the best baseline model is a pipeline approach, the best performance using external data is ultimately achieved by an E2E model. We provide detailed comparisons and analyses, developing insights on, for example, the effects of leveraging external data on (i) different categories of NER errors and (ii) the switch in performance trends between pipeline and E2E models.

pdf bib
Self-supervised Representation Learning for Speech Processing
Hung-yi Lee | Abdelrahman Mohamed | Shinji Watanabe | Tara Sainath | Karen Livescu | Shang-Wen Li | Shu-wen Yang | Katrin Kirchhoff
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Tutorial Abstracts

There is a trend in the machine learning community to adopt self-supervised approaches to pre-train deep networks. Self-supervised representation learning (SSL) utilizes proxy supervised learning tasks, for example, distinguishing parts of the input signal from distractors, or generating masked input segments conditioned on the unmasked ones, to obtain training data from unlabeled corpora. BERT and GPT in NLP and SimCLR and BYOL in CV are famous examples in this direction. These approaches make it possible to use a tremendous amount of unlabeled data available on the web to train large networks and solve complicated tasks. Thus, SSL has the potential to scale up current machine learning technologies, especially for low-resourced, under-represented use cases, and democratize the technologies. Recently self-supervised approaches for speech processing are also gaining popularity. There are several workshops in relevant topics hosted at ICML 2020 (https://icml-sas.gitlab.io/), NeurIPS 2020 (https://neurips-sas-2020.github.io/), and AAAI 2022 (https://aaai-sas-2022.github.io/). However, there is no previous tutorial about a similar topic based on the authors’ best knowledge. Due to the growing popularity of SSL, and the shared mission of the areas in bringing speech and language technologies to more use cases with better quality and scaling the technologies for under-represented languages, we propose this tutorial to systematically survey the latest SSL techniques, tools, datasets, and performance achievement in speech processing. The proposed tutorial is highly relevant to the special theme of ACL about language diversity. One of the main focuses of the tutorial is leveraging SSL to reduce the dependence of speech technologies on labeled data, and to scale up the technologies especially for under-represented languages and use cases.

pdf
Baked-in State Probing
Shubham Toshniwal | Sam Wiseman | Karen Livescu | Kevin Gimpel
Findings of the Association for Computational Linguistics: EMNLP 2022

Neural language models have been analyzed for their linguistic and extra-linguistic knowledge via probing. Of particular interest has been the following question: how much can a language model trained only on form learn about meaning? Recent work has demonstrated via probing classifiers that in the setting of simple procedural text, where by “meaning” we mean the underlying world state, language models have a non-trivial performance on world state tracking. However, our proposed evaluation based on model predictions shows differing results, suggesting that these models are either not capturing the world state or not using it. How do these results change if the model has access to the world state? We explore this alternate setting with access to the underlying world state only during training and investigate ways of “baking in” the state knowledge along with the primary task of language modeling. Our proposed approaches allow for state probing during inference simply via text prompts, avoiding any probing classifier machinery. In terms of performance, we show that baking in the state knowledge during training leads to significant improvements in state tracking performance and text generation quality,

pdf
Searching for fingerspelled content in American Sign Language
Bowen Shi | Diane Brentari | Greg Shakhnarovich | Karen Livescu
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Natural language processing for sign language video—including tasks like recognition, translation, and search—is crucial for making artificial intelligence technologies accessible to deaf individuals, and is gaining research interest in recent years. In this paper, we address the problem of searching for fingerspelled keywords or key phrases in raw sign language videos. This is an important task since significant content in sign language is often conveyed via fingerspelling, and to our knowledge the task has not been studied before. We propose an end-to-end model for this task, FSS-Net, that jointly detects fingerspelling and matches it to a text sequence. Our experiments, done on a large public dataset of ASL fingerspelling in the wild, show the importance of fingerspelling detection as a component of a search and retrieval model. Our model significantly outperforms baseline methods adapted from prior work on related tasks.

pdf
Substructure Distribution Projection for Zero-Shot Cross-Lingual Dependency Parsing
Freda Shi | Kevin Gimpel | Karen Livescu
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We present substructure distribution projection (SubDP), a technique that projects a distribution over structures in one domain to another, by projecting substructure distributions separately. Models for the target domain can then be trained, using the projected distributions as soft silver labels. We evaluate SubDP on zero shot cross-lingual dependency parsing, taking dependency arcs as substructures: we project the predicted dependency arc distributions in the source language(s) to target language(s), and train a target language parser on the resulting distributions. Given an English tree bank as the only source of human supervision, SubDP achieves better unlabeled attachment score than all prior work on the Universal Dependencies v2.2 (Nivre et al., 2020) test set across eight diverse target languages, as well as the best labeled attachment score on six languages. In addition, SubDP improves zero shot cross-lingual dependency parsing with very few (e.g., 50) supervised bitext pairs, across a broader range of target languages.

pdf
Open-Domain Sign Language Translation Learned from Online Video
Bowen Shi | Diane Brentari | Gregory Shakhnarovich | Karen Livescu
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Existing work on sign language translation – that is, translation from sign language videos into sentences in a written language – has focused mainly on (1) data collected in a controlled environment or (2) data in a specific domain, which limits the applicability to real-world settings. In this paper, we introduce OpenASL, a large-scale American Sign Language (ASL) - English dataset collected from online video sites (e.g., YouTube).OpenASL contains 288 hours of ASL videos in multiple domains from over 200 signers and is the largest publicly available ASL translation dataset to date. To tackle the challenges of sign language translation in realistic settings and without glosses, we propose a set of techniques including sign search as a pretext task for pre-training and fusion of mouthing and handshape features. The proposed techniques produce consistent and large improvements in translation quality, over baseline models basedon prior work.

2021

pdf
Substructure Substitution: Structured Data Augmentation for NLP
Haoyue Shi | Karen Livescu | Kevin Gimpel
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf
On Generalization in Coreference Resolution
Shubham Toshniwal | Patrick Xia | Sam Wiseman | Karen Livescu | Kevin Gimpel
Proceedings of the Fourth Workshop on Computational Models of Reference, Anaphora and Coreference

While coreference resolution is defined independently of dataset domain, most models for performing coreference resolution do not transfer well to unseen domains. We consolidate a set of 8 coreference resolution datasets targeting different domains to evaluate the off-the-shelf performance of models. We then mix three datasets for training; even though their domain, annotation guidelines, and metadata differ, we propose a method for jointly training a single model on this heterogeneous data mixture by using data augmentation to account for annotation differences and sampling to balance the data quantities. We find that in a zero-shot setting, models trained on a single dataset transfer poorly while joint training yields improved overall performance, leading to better generalization in coreference resolution models. This work contributes a new benchmark for robust coreference resolution and multiple new state-of-the-art results.

2020

pdf
Discrete Latent Variable Representations for Low-Resource Text Classification
Shuning Jin | Sam Wiseman | Karl Stratos | Karen Livescu
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

While much work on deep latent variable models of text uses continuous latent variables, discrete latent variables are interesting because they are more interpretable and typically more space efficient. We consider several approaches to learning discrete latent variable models for text in the case where exact marginalization over these variables is intractable. We compare the performance of the learned representations as features for low-resource document and sentence classification. Our best models outperform the previous best reported results with continuous representations in these low-resource settings, while learning significantly more compressed representations. Interestingly, we find that an amortized variant of Hard EM performs particularly well in the lowest-resource regimes.

pdf
PeTra: A Sparsely Supervised Memory Model for People Tracking
Shubham Toshniwal | Allyson Ettinger | Kevin Gimpel | Karen Livescu
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We propose PeTra, a memory-augmented neural network designed to track entities in its memory slots. PeTra is trained using sparse annotation from the GAP pronoun resolution dataset and outperforms a prior memory model on the task while using a simpler architecture. We empirically compare key modeling choices, finding that we can simplify several aspects of the design of the memory module while retaining strong performance. To measure the people tracking capability of memory models, we (a) propose a new diagnostic evaluation based on counting the number of unique entities in text, and (b) conduct a small scale human evaluation to compare evidence of people tracking in the memory logs of PeTra relative to a previous approach. PeTra is highly effective in both evaluations, demonstrating its ability to track people in its memory despite being trained with limited annotation.

pdf
On the Role of Supervision in Unsupervised Constituency Parsing
Haoyue Shi | Karen Livescu | Kevin Gimpel
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We analyze several recent unsupervised constituency parsing models, which are tuned with respect to the parsing F1 score on the Wall Street Journal (WSJ) development set (1,700 sentences). We introduce strong baselines for them, by training an existing supervised parsing model (Kitaev and Klein, 2018) on the same labeled examples they access. When training on the 1,700 examples, or even when using only 50 examples for training and 5 for development, such a few-shot parsing approach can outperform all the unsupervised parsing methods by a significant margin. Few-shot parsing can be further improved by a simple data augmentation method and self-training. This suggests that, in order to arrive at fair conclusions, we should carefully consider the amount of labeled data used for model development. We propose two protocols for future work on unsupervised parsing: (i) use fully unsupervised criteria for hyperparameter tuning and model selection; (ii) use as few labeled examples as possible for model development, and compare to few-shot parsing trained on the same labeled examples.

pdf
Learning to Ignore: Long Document Coreference with Bounded Memory Neural Networks
Shubham Toshniwal | Sam Wiseman | Allyson Ettinger | Karen Livescu | Kevin Gimpel
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Long document coreference resolution remains a challenging task due to the large memory and runtime requirements of current models. Recent work doing incremental coreference resolution using just the global representation of entities shows practical benefits but requires keeping all entities in memory, which can be impractical for long documents. We argue that keeping all entities in memory is unnecessary, and we propose a memory-augmented neural network that tracks only a small bounded number of entities at a time, thus guaranteeing a linear runtime in length of document. We show that (a) the model remains competitive with models with high memory and computational requirements on OntoNotes and LitBank, and (b) the model learns an efficient memory management strategy easily outperforming a rule-based strategy

pdf
A Cross-Task Analysis of Text Span Representations
Shubham Toshniwal | Haoyue Shi | Bowen Shi | Lingyu Gao | Karen Livescu | Kevin Gimpel
Proceedings of the 5th Workshop on Representation Learning for NLP

Many natural language processing (NLP) tasks involve reasoning with textual spans, including question answering, entity recognition, and coreference resolution. While extensive research has focused on functional architectures for representing words and sentences, there is less work on representing arbitrary spans of text within sentences. In this paper, we conduct a comprehensive empirical evaluation of six span representation methods using eight pretrained language representation models across six tasks, including two tasks that we introduce. We find that, although some simple span representations are fairly reliable across tasks, in general the optimal span representation varies by task, and can also vary within different facets of individual tasks. We also find that the choice of span representation has a bigger impact with a fixed pretrained encoder than with a fine-tuned encoder.

2019

pdf
Pre-training on high-resource speech recognition improves low-resource speech-to-text translation
Sameer Bansal | Herman Kamper | Karen Livescu | Adam Lopez | Sharon Goldwater
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We present a simple approach to improve direct speech-to-text translation (ST) when the source language is low-resource: we pre-train the model on a high-resource automatic speech recognition (ASR) task, and then fine-tune its parameters for ST. We demonstrate that our approach is effective by pre-training on 300 hours of English ASR data to improve Spanish English ST from 10.8 to 20.2 BLEU when only 20 hours of Spanish-English ST training data are available. Through an ablation study, we find that the pre-trained encoder (acoustic model) accounts for most of the improvement, despite the fact that the shared language in these tasks is the target language text, not the source language audio. Applying this insight, we show that pre-training on ASR helps ST even when the ASR language differs from both source and target ST languages: pre-training on French ASR also improves Spanish-English ST. Finally, we show that the approach improves performance on a true low-resource task: pre-training on a combination of English ASR and French ASR improves Mboshi-French ST, where only 4 hours of data are available, from 3.5 to 7.1 BLEU.

pdf
Visually Grounded Neural Syntax Acquisition
Haoyue Shi | Jiayuan Mao | Kevin Gimpel | Karen Livescu
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We present the Visually Grounded Neural Syntax Learner (VG-NSL), an approach for learning syntactic representations and structures without any explicit supervision. The model learns by looking at natural images and reading paired captions. VG-NSL generates constituency parse trees of texts, recursively composes representations for constituents, and matches them with images. We define concreteness of constituents by their matching scores with images, and use it to guide the parsing of text. Experiments on the MSCOCO data set show that VG-NSL outperforms various unsupervised parsing approaches that do not use visual grounding, in terms of F1 scores against gold parse trees. We find that VGNSL is much more stable with respect to the choice of random initialization and the amount of training data. We also find that the concreteness acquired by VG-NSL correlates well with a similar measure defined by linguists. Finally, we also apply VG-NSL to multiple languages in the Multi30K data set, showing that our model consistently outperforms prior unsupervised approaches.

2018

pdf
Parsing Speech: a Neural Approach to Integrating Lexical and Acoustic-Prosodic Information
Trang Tran | Shubham Toshniwal | Mohit Bansal | Kevin Gimpel | Karen Livescu | Mari Ostendorf
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

In conversational speech, the acoustic signal provides cues that help listeners disambiguate difficult parses. For automatically parsing spoken utterances, we introduce a model that integrates transcribed text and acoustic-prosodic features using a convolutional neural network over energy and pitch trajectories coupled with an attention-based recurrent neural network that accepts text and prosodic features. We find that different types of acoustic-prosodic features are individually helpful, and together give statistically significant improvements in parse and disfluency detection F1 scores over a strong text-only baseline. For this study with known sentence boundaries, error analyses show that the main benefit of acoustic-prosodic features is in sentences with disfluencies, attachment decisions are most improved, and transcription errors obscure gains from prosody.

pdf
Variational Sequential Labelers for Semi-Supervised Learning
Mingda Chen | Qingming Tang | Karen Livescu | Kevin Gimpel
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We introduce a family of multitask variational methods for semi-supervised sequence labeling. Our model family consists of a latent-variable generative model and a discriminative labeler. The generative models use latent variables to define the conditional probability of a word given its context, drawing inspiration from word prediction objectives commonly used in learning word embeddings. The labeler helps inject discriminative information into the latent space. We explore several latent variable configurations, including ones with hierarchical structure, which enables the model to account for both label-specific and word-specific information. Our models consistently outperform standard sequential baselines on 8 sequence labeling datasets, and improve further with unlabeled data.

2017

pdf
Learning to Embed Words in Context for Syntactic Tasks
Lifu Tu | Kevin Gimpel | Karen Livescu
Proceedings of the 2nd Workshop on Representation Learning for NLP

We present models for embedding words in the context of surrounding words. Such models, which we refer to as token embeddings, represent the characteristics of a word that are specific to a given context, such as word sense, syntactic category, and semantic role. We explore simple, efficient token embedding models based on standard neural network architectures. We learn token embeddings on a large amount of unannotated text and evaluate them as features for part-of-speech taggers and dependency parsers trained on much smaller amounts of annotated data. We find that predictors endowed with token embeddings consistently outperform baseline predictors across a range of context window and training set sizes.

2016

pdf
Charagram: Embedding Words and Sentences via Character n-grams
John Wieting | Mohit Bansal | Kevin Gimpel | Karen Livescu
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf
Mapping Unseen Words to Task-Trained Embedding Spaces
Pranava Swaroop Madhyastha | Mohit Bansal | Kevin Gimpel | Karen Livescu
Proceedings of the 1st Workshop on Representation Learning for NLP

2015

pdf
Deep Multilingual Correlation for Improved Word Embeddings
Ang Lu | Weiran Wang | Mohit Bansal | Kevin Gimpel | Karen Livescu
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
From Paraphrase Database to Compositional Paraphrase Model and Back
John Wieting | Mohit Bansal | Kevin Gimpel | Karen Livescu
Transactions of the Association for Computational Linguistics, Volume 3

The Paraphrase Database (PPDB; Ganitkevitch et al., 2013) is an extensive semantic resource, consisting of a list of phrase pairs with (heuristic) confidence estimates. However, it is still unclear how it can best be used, due to the heuristic nature of the confidences and its necessarily incomplete coverage. We propose models to leverage the phrase pairs from the PPDB to build parametric paraphrase models that score paraphrase pairs more accurately than the PPDB’s internal scores while simultaneously improving its coverage. They allow for learning phrase embeddings as well as improved word embeddings. Moreover, we introduce two new, manually annotated datasets to evaluate short-phrase paraphrasing models. Using our paraphrase model trained using PPDB, we achieve state-of-the-art results on standard word and bigram similarity tasks and beat strong baselines on our new short phrase paraphrase tasks.

2014

pdf bib
Revisiting Word Neighborhoods for Speech Recognition
Preethi Jyothi | Karen Livescu
Proceedings of the 2014 Joint Meeting of SIGMORPHON and SIGFSM

pdf
Tailoring Continuous Word Representations for Dependency Parsing
Mohit Bansal | Kevin Gimpel | Karen Livescu
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2013

pdf bib
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing
David Yarowsky | Timothy Baldwin | Anna Korhonen | Karen Livescu | Steven Bethard
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

2012

pdf
Discriminative Pronunciation Modeling: A Large-Margin, Feature-Rich Approach
Hao Tang | Joseph Keshet | Karen Livescu
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2010

pdf
Domain Adaptation with Unlabeled Data for Dialog Act Tagging
Anna Margolis | Karen Livescu | Mari Ostendorf
Proceedings of the 2010 Workshop on Domain Adaptation for Natural Language Processing

2008

pdf bib
Invited talk: Phonological Models in Automatic Speech Recognition
Karen Livescu
Proceedings of the Tenth Meeting of ACL Special Interest Group on Computational Morphology and Phonology

2004

pdf
Feature-based Pronunciation Modeling for Speech Recognition
Karen Livescu | James Glass
Proceedings of HLT-NAACL 2004: Short Papers