Jonathan H. Clark

Also published as: Jonathan Clark


pdf bib
Proceedings of the Workshop on Multilingual Information Access (MIA)
Akari Asai | Eunsol Choi | Jonathan H. Clark | Junjie Hu | Chia-Hsuan Lee | Jungo Kasai | Shayne Longpre | Ikuya Yamada | Rui Zhang
Proceedings of the Workshop on Multilingual Information Access (MIA)

MIA 2022 Shared Task: Evaluating Cross-lingual Open-Retrieval Question Answering for 16 Diverse Languages
Akari Asai | Shayne Longpre | Jungo Kasai | Chia-Hsuan Lee | Rui Zhang | Junjie Hu | Ikuya Yamada | Jonathan H. Clark | Eunsol Choi
Proceedings of the Workshop on Multilingual Information Access (MIA)

We present the results of the Workshop on Multilingual Information Access (MIA) 2022 Shared Task, evaluating cross-lingual open-retrieval question answering (QA) systems in 16 typologically diverse languages. In this task, we adapted two large-scale cross-lingual open-retrieval QA datasets in 14 typologically diverse languages, and newly annotated open-retrieval QA data in 2 underrepresented languages: Tagalog and Tamil. Four teams submitted their systems. The best constrained system uses entity-aware contextualized representations for document retrieval, thereby achieving an average F1 score of 31.6, which is 4.1 F1 absolute higher than the challenging baseline. The best system obtains particularly significant improvements in Tamil (20.8 F1), whereas most of the other systems yield nearly zero scores. The best unconstrained system achieves 32.2 F1, outperforming our baseline by 4.5 points.

Canine: Pre-training an Efficient Tokenization-Free Encoder for Language Representation
Jonathan H. Clark | Dan Garrette | Iulia Turc | John Wieting
Transactions of the Association for Computational Linguistics, Volume 10

Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model’s ability to adapt. In this paper, we present Canine, a neural encoder that operates directly on character sequences—without explicit tokenization or vocabulary—and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, Canine combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. Canine outperforms a comparable mBert model by 5.7 F1 on TyDi QA, a challenging multilingual benchmark, despite having fewer model parameters.


XOR QA: Cross-lingual Open-Retrieval Question Answering
Akari Asai | Jungo Kasai | Jonathan Clark | Kenton Lee | Eunsol Choi | Hannaneh Hajishirzi
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Multilingual question answering tasks typically assume that answers exist in the same language as the question. Yet in practice, many languages face both information scarcity—where languages have few reference articles—and information asymmetry—where questions reference concepts from other cultures. This work extends open-retrieval question answering to a cross-lingual setting enabling questions from one language to be answered via answer content from another language. We construct a large-scale dataset built on 40K information-seeking questions across 7 diverse non-English languages that TyDi QA could not find same-language answers for. Based on this dataset, we introduce a task framework, called Cross-lingual Open-Retrieval Question Answering (XOR QA), that consists of three new tasks involving cross-lingual document retrieval from multilingual and English resources. We establish baselines with state-of-the-art machine translation systems and cross-lingual pretrained models. Experimental results suggest that XOR QA is a challenging task that will facilitate the development of novel techniques for multilingual question answering. Our data and code are available at

Learning to Recognize Dialect Features
Dorottya Demszky | Devyani Sharma | Jonathan Clark | Vinodkumar Prabhakaran | Jacob Eisenstein
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Building NLP systems that serve everyone requires accounting for dialect differences. But dialects are not monolithic entities: rather, distinctions between and within dialects are captured by the presence, absence, and frequency of dozens of dialect features in speech and text, such as the deletion of the copula in “He ∅ running”. In this paper, we introduce the task of dialect feature detection, and present two multitask learning approaches, both based on pretrained transformers. For most dialects, large-scale annotated corpora for these features are unavailable, making it difficult to train recognizers. We train our models on a small number of minimal pairs, building on how linguists typically define dialect features. Evaluation on a test set of 22 dialect features of Indian English demonstrates that these models learn to recognize many features with high accuracy, and that a few minimal pairs can be as effective for training as thousands of labeled examples. We also demonstrate the downstream applicability of dialect feature detection both as a measure of dialect density and as a dialect classifier.


TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages
Jonathan H. Clark | Eunsol Choi | Michael Collins | Dan Garrette | Tom Kwiatkowski | Vitaly Nikolaev | Jennimaria Palomaki
Transactions of the Association for Computational Linguistics, Volume 8

Confidently making progress on multilingual modeling requires challenging, trustworthy evaluations. We present TyDi QA—a question answering dataset covering 11 typologically diverse languages with 204K question-answer pairs. The languages of TyDi QA are diverse with regard to their typology—the set of linguistic features each language expresses—such that we expect models performing well on this set to generalize across a large number of the world’s languages. We present a quantitative analysis of the data quality and example-level qualitative linguistic analyses of observed language phenomena that would not be found in English-only corpora. To provide a realistic information-seeking task and avoid priming effects, questions are written by people who want to know the answer, but don’t know the answer yet, and the data is collected directly in each language without the use of translation.

CapWAP: Image Captioning with a Purpose
Adam Fisch | Kenton Lee | Ming-Wei Chang | Jonathan Clark | Regina Barzilay
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

The traditional image captioning task uses generic reference captions to provide textual information about images. Different user populations, however, will care about different visual aspects of images. In this paper, we propose a new task, Captioning with A Purpose (CapWAP). Our goal is to develop systems that can be tailored to be useful for the information needs of an intended population, rather than merely provide generic information about an image. In this task, we use question-answer (QA) pairs—a natural expression of information need—from users, instead of reference captions, for both training and post-inference evaluation. We show that it is possible to use reinforcement learning to directly optimize for the intended information need, by rewarding outputs that allow a question answering model to provide correct answers to sampled user questions. We convert several visual question answering datasets into CapWAP datasets, and demonstrate that under a variety of scenarios our purposeful captioning system learns to anticipate and fulfill specific information needs better than its generic counterparts, as measured by QA performance on user questions from unseen images, when using the caption alone as context.


Locally Non-Linear Learning for Statistical Machine Translation via Discretization and Structured Regularization
Jonathan H. Clark | Chris Dyer | Alon Lavie
Transactions of the Association for Computational Linguistics, Volume 2

Linear models, which support efficient learning and inference, are the workhorses of statistical machine translation; however, linear decision rules are less attractive from a modeling perspective. In this work, we introduce a technique for learning arbitrary, rule-local, non-linear feature transforms that improve model expressivity, but do not sacrifice the efficient inference and learning associated with linear models. To demonstrate the value of our technique, we discard the customary log transform of lexical probabilities and drop the phrasal translation probability in favor of raw counts. We observe that our algorithm learns a variation of a log transform that leads to better translation quality compared to the explicit log transform. We conclude that non-linear responses play an important role in SMT, an observation that we hope will inform the efforts of feature engineers.


Scalable Modified Kneser-Ney Language Model Estimation
Kenneth Heafield | Ivan Pouzyrevsky | Jonathan H. Clark | Philipp Koehn
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)


One System, Many Domains: Open-Domain Statistical Machine Translation via Feature Augmentation
Jonathan Clark | Alon Lavie | Chris Dyer
Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Research Papers

In this paper, we introduce a simple technique for incorporating domain information into a statistical machine translation system that significantly improves translation quality when test data comes from multiple domains. Our approach augments (conjoins) standard translation model and language model features with domain indicator features and requires only minimal modifications to the optimization and decoding procedures. We evaluate our method on two language pairs with varying numbers of domains, and observe significant improvements of up to 1.0 BLEU.


The CMU-ARK German-English Translation System
Chris Dyer | Kevin Gimpel | Jonathan H. Clark | Noah A. Smith
Proceedings of the Sixth Workshop on Statistical Machine Translation

Unsupervised Word Alignment with Arbitrary Features
Chris Dyer | Jonathan H. Clark | Alon Lavie | Noah A. Smith
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability
Jonathan H. Clark | Chris Dyer | Alon Lavie | Noah A. Smith
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies


Improved Features and Grammar Selection for Syntax-Based MT
Greg Hanneman | Jonathan Clark | Alon Lavie
Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR

LoonyBin: Keeping Language Technologists Sane through Automated Management of Experimental (Hyper)Workflows
Jonathan H. Clark | Alon Lavie
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Many contemporary language technology systems are characterized by long pipelines of tools with complex dependencies. Too often, these workflows are implemented by ad hoc scripts; or, worse, tools are run manually, making experiments difficult to reproduce. These practices are difficult to maintain in the face of rapidly evolving workflows while they also fail to expose and record important details about intermediate data. Further complicating these systems are hyperparameters, which often cannot be directly optimized by conventional methods, requiring users to determine which combination of values is best via trial and error. We describe LoonyBin, an open-source tool that addresses these issues by providing: 1) a visual interface for the user to create and modify workflows; 2) a well-defined mechanism for tracking metadata and provenance; 3) a script generator that compiles visual workflows into shell scripts; and 4) a new workflow representation we call a HyperWorkflow, which intuitively and succinctly encodes small experimental variations within a larger workflow.


An Improved Statistical Transfer System for French-English Machine Translation
Greg Hanneman | Vamshi Ambati | Jonathan H. Clark | Alok Parlikar | Alon Lavie
Proceedings of the Fourth Workshop on Statistical Machine Translation


Toward Active Learning in Data Selection: Automatic Discovery of Language Features During Elicitation
Jonathan Clark | Robert Frederking | Lori Levin
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Data Selection has emerged as a common issue in language technologies. We define Data Selection as the choosing of a subset of training data that is most effective for a given task. This paper describes deductive feature detection, one component of a data selection system for machine translation. Feature detection determines whether features such as tense, number, and person are expressed in a language. The database of the The World Atlas of Language Structures provides a gold standard against which to evaluate feature detection. The discovered features can be used as input to a Navigator, which uses active learning to determine which piece of language data is the most important to acquire next.

Inductive Detection of Language Features via Clustering Minimal Pairs: Toward Feature-Rich Grammars in Machine Translation
Jonathan H. Clark | Robert Frederking | Lori Levin
Proceedings of the ACL-08: HLT Second Workshop on Syntax and Structure in Statistical Translation (SSST-2)