Dan Garrette


2022

pdf
Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation
Jonathan H. Clark | Dan Garrette | Iulia Turc | John Wieting
Transactions of the Association for Computational Linguistics, Volume 10

Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model’s ability to adapt. In this paper, we present Canine, a neural encoder that operates directly on character sequences—without explicit tokenization or vocabulary—and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, Canine combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. Canine outperforms a comparable mBert model by 5.7 F1 on TyDi QA, a challenging multilingual benchmark, despite having fewer model parameters.

2021

pdf
Frequency Effects on Syntactic Rule Learning in Transformers
Jason Wei | Dan Garrette | Tal Linzen | Ellie Pavlick
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Pre-trained language models perform well on a variety of linguistic tasks that require symbolic reasoning, raising the question of whether such models implicitly represent abstract symbols and rules. We investigate this question using the case study of BERT’s performance on English subject–verb agreement. Unlike prior work, we train multiple instances of BERT from scratch, allowing us to perform a series of controlled interventions at pre-training time. We show that BERT often generalizes well to subject–verb pairs that never occurred in training, suggesting a degree of rule-governed behavior. We also find, however, that performance is heavily influenced by word frequency, with experiments showing that both the absolute frequency of a verb form, as well as the frequency relative to the alternate inflection, are causally implicated in the predictions BERT makes at inference time. Closer analysis of these frequency effects reveals that BERT’s behavior is consistent with a system that correctly applies the SVA rule in general but struggles to overcome strong training priors and to estimate agreement features (singular vs. plural) on infrequent lexical items.

pdf
XTREME-R: Towards More Challenging and Nuanced Multilingual Evaluation
Sebastian Ruder | Noah Constant | Jan Botha | Aditya Siddhant | Orhan Firat | Jinlan Fu | Pengfei Liu | Junjie Hu | Dan Garrette | Graham Neubig | Melvin Johnson
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Machine learning has brought striking advances in multilingual natural language processing capabilities over the past year. For example, the latest techniques have improved the state-of-the-art performance on the XTREME multilingual benchmark by more than 13 points. While a sizeable gap to human-level performance remains, improvements have been easier to achieve in some tasks than in others. This paper analyzes the current state of cross-lingual transfer learning and summarizes some lessons learned. In order to catalyze meaningful progress, we extend XTREME to XTREME-R, which consists of an improved set of ten natural language understanding tasks, including challenging language-agnostic retrieval tasks, and covers 50 typologically diverse languages. In addition, we provide a massively multilingual diagnostic suite and fine-grained multi-dataset evaluation capabilities through an interactive public leaderboard to gain a better understanding of such models.

2020

pdf
Improving Multilingual Models with Language-Clustered Vocabularies
Hyung Won Chung | Dan Garrette | Kiat Chuan Tan | Jason Riesa
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

State-of-the-art multilingual models depend on vocabularies that cover all of the languages the model will expect to see at inference time, but the standard methods for generating those vocabularies are not ideal for massively multilingual applications. In this work, we introduce a novel procedure for multilingual vocabulary generation that combines the separately trained vocabularies of several automatically derived language clusters, thus balancing the trade-off between cross-lingual subword sharing and language-specific vocabularies. Our experiments show improvements across languages on key multilingual benchmark tasks TyDi QA (+2.9 F1), XNLI (+2.1%), and WikiAnn NER (+2.8 F1) and factor of 8 reduction in out-of-vocabulary rate, all without increasing the size of the model or data.

pdf
TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages
Jonathan H. Clark | Eunsol Choi | Michael Collins | Dan Garrette | Tom Kwiatkowski | Vitaly Nikolaev | Jennimaria Palomaki
Transactions of the Association for Computational Linguistics, Volume 8

Confidently making progress on multilingual modeling requires challenging, trustworthy evaluations. We present TyDi QA—a question answering dataset covering 11 typologically diverse languages with 204K question-answer pairs. The languages of TyDi QA are diverse with regard to their typology—the set of linguistic features each language expresses—such that we expect models performing well on this set to generalize across a large number of the world’s languages. We present a quantitative analysis of the data quality and example-level qualitative linguistic analyses of observed language phenomena that would not be found in English-only corpora. To provide a realistic information-seeking task and avoid priming effects, questions are written by people who want to know the answer, but don’t know the answer yet, and the data is collected directly in each language without the use of translation.

2019

pdf
How Multilingual is Multilingual BERT?
Telmo Pires | Eva Schlinger | Dan Garrette
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

In this paper, we show that Multilingual BERT (M-BERT), released by Devlin et al. (2018) as a single language model pre-trained from monolingual corpora in 104 languages, is surprisingly good at zero-shot cross-lingual model transfer, in which task-specific annotations in one language are used to fine-tune the model for evaluation in another language. To understand why, we present a large number of probing experiments, showing that transfer is possible even to languages in different scripts, that transfer works best between typologically similar languages, that monolingual corpora can train models for code-switching, and that the model can find translation pairs. From these results, we can conclude that M-BERT does create multilingual representations, but that these representations exhibit systematic deficiencies affecting certain language pairs.

2018

pdf
Part-of-Speech Tagging for Code-Switched, Transliterated Texts without Explicit Language Identification
Kelsey Ball | Dan Garrette
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Code-switching, the use of more than one language within a single utterance, is ubiquitous in much of the world, but remains a challenge for NLP largely due to the lack of representative data for training models. In this paper, we present a novel model architecture that is trained exclusively on monolingual resources, but can be applied to unseen code-switched text at inference time. The model accomplishes this by jointly maintaining separate word representations for each of the possible languages, or scripts in the case of transliteration, allowing each to contribute to inferences without forcing the model to commit to a language. Experiments on Hindi-English part-of-speech tagging demonstrate that our approach outperforms standard models when training on monolingual text without transliteration, and testing on code-switched text with alternate scripts.

2017

pdf
Automatic Compositor Attribution in the First Folio of Shakespeare
Maria Ryskina | Hannah Alpert-Abrams | Dan Garrette | Taylor Berg-Kirkpatrick
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Compositor attribution, the clustering of pages in a historical printed document by the individual who set the type, is a bibliographic task that relies on analysis of orthographic variation and inspection of visual details of the printed page. In this paper, we introduce a novel unsupervised model that jointly describes the textual and visual features needed to distinguish compositors. Applied to images of Shakespeare’s First Folio, our model predicts attributions that agree with the manual judgements of bibliographers with an accuracy of 87%, even on text that is the output of OCR.

pdf
STREAMLInED Challenges: Aligning Research Interests with Shared Tasks
Gina-Anne Levow | Emily M. Bender | Patrick Littell | Kristen Howell | Shobhana Chelliah | Joshua Crowgey | Dan Garrette | Jeff Good | Sharon Hargus | David Inman | Michael Maxwell | Michael Tjalve | Fei Xia
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages

2016

pdf
An Unsupervised Model of Orthographic Variation for Historical Document Transcription
Dan Garrette | Hannah Alpert-Abrams
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2015

pdf
Unsupervised Code-Switching for Multilingual Historical Document Transcription
Dan Garrette | Hannah Alpert-Abrams | Taylor Berg-Kirkpatrick | Dan Klein
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
A Supertag-Context Model for Weakly-Supervised CCG Parser Learning
Dan Garrette | Chris Dyer | Jason Baldridge | Noah A. Smith
Proceedings of the Nineteenth Conference on Computational Natural Language Learning

2014

pdf
Weakly-Supervised Bayesian Learning of a CCG Supertagger
Dan Garrette | Chris Dyer | Jason Baldridge | Noah A. Smith
Proceedings of the Eighteenth Conference on Computational Natural Language Learning

2013

pdf
Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages
Dan Garrette | Jason Mielens | Jason Baldridge
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf
Learning a Part-of-Speech Tagger from Two Hours of Annotation
Dan Garrette | Jason Baldridge
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
Montague Meets Markov: Deep Semantics with Probabilistic Logical Form
Islam Beltagy | Cuong Chau | Gemma Boleda | Dan Garrette | Katrin Erk | Raymond Mooney
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity

2012

pdf
Type-Supervised Hidden Markov Models for Part-of-Speech Tagging with Incomplete Tag Dictionaries
Dan Garrette | Jason Baldridge
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

2011

pdf
Integrating Logical Representations with Probabilistic Information using Markov Logic
Dan Garrette | Katrin Erk | Raymond Mooney
Proceedings of the Ninth International Conference on Computational Semantics (IWCS 2011)

2009

pdf
An Extensible Toolkit for Computational Semantics
Dan Garrette | Ewan Klein
Proceedings of the Eight International Conference on Computational Semantics