Rachel Wicks


The Effects of Language Token Prefixing for Multilingual Machine Translation
Rachel Wicks | Kevin Duh
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Machine translation traditionally refers to translating from a single source language into a single target language. In recent years, the field has moved towards large neural models either translating from or into many languages. The model must be correctly cued to translate into the correct target language.This is typically done by prefixing language tokens onto the source or target sequence. The location and content of the prefix can vary and many use different approaches without much justification towards one approach or another. As a guidance to future researchers and directions for future work, we present a series of experiments that show how the positioning and type of a target language prefix token effects translation performance. We show that source side prefixes improve performance. Further, we find that the best language information to denote via tokens depends on the supported language set.

Does Sentence Segmentation Matter for Machine Translation?
Rachel Wicks | Matt Post
Proceedings of the Seventh Conference on Machine Translation (WMT)

For the most part, NLP applications operate at the sentence level. Since sentences occur most naturally in documents, they must be extracted and segmented via the use of a segmenter, of which there are a handful of options. There has been some work evaluating the performance of segmenters on intrinsic metrics, that look at their ability to recover human-segmented sentence boundaries, but there has been no work looking at the effect of segmenters on downstream tasks. We ask the question, “does segmentation matter?” and attempt to answer it on the task of machine translation. We consider two settings: the application of segmenters to a black-box system whose training segmentation is mostly unknown, as well as the variation in performance when segmenters are applied to the training process, too. We find that the choice of segmenter largely does not matter, so long as its behavior is not one of extreme under- or over-segmentation. For such settings, we provide some qualitative analysis examining their harms, and point the way towards document-level processing.


A unified approach to sentence segmentation of punctuated text in many languages
Rachel Wicks | Matt Post
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

The sentence is a fundamental unit of text processing. Yet sentences in the wild are commonly encountered not in isolation, but unsegmented within larger paragraphs and documents. Therefore, the first step in many NLP pipelines is sentence segmentation. Despite its importance, this step is the subject of relatively little research. There are no standard test sets or even methods for evaluation, leaving researchers and engineers without a clear footing for evaluating and selecting models for the task. Existing tools have relatively small language coverage, and efforts to extend them to other languages are often ad hoc. We introduce a modern context-based modeling approach that provides a solution to the problem of segmenting punctuated text in many languages, and show how it can be trained on noisily-annotated data. We also establish a new 23-language multilingual evaluation set. Our approach exceeds high baselines set by existing methods on prior English corpora (WSJ and Brown corpora), and also performs well on average on our new evaluation set. We release our tool, ersatz, as open source.


The Johns Hopkins University Bible Corpus: 1600+ Tongues for Typological Exploration
Arya D. McCarthy | Rachel Wicks | Dylan Lewis | Aaron Mueller | Winston Wu | Oliver Adams | Garrett Nicolai | Matt Post | David Yarowsky
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present findings from the creation of a massively parallel corpus in over 1600 languages, the Johns Hopkins University Bible Corpus (JHUBC). The corpus consists of over 4000 unique translations of the Christian Bible and counting. Our data is derived from scraping several online resources and merging them with existing corpora, combining them under a common scheme that is verse-parallel across all translations. We detail our effort to scrape, clean, align, and utilize this ripe multilingual dataset. The corpus captures the great typological variety of the world’s languages. We catalog this by showing highly similar proportions of representation of Ethnologue’s typological features in our corpus. We also give an example application: projecting pronoun features like clusivity across alignments to richly annotate languages which do not mark the distinction.