David Yarowsky

2025

pdf bib abs
JHU’s Submission to the AmericasNLP 2025 Shared Task on the Creation of Educational Materials for Indigenous Languages
Tom Lupicki | Lavanya Shankar | Kaavya Chaparala | David Yarowsky
Proceedings of the Fifth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP)

This paper presents JHU’s submission to the AmericasNLP shared task on the creation of educational materials for Indigenous languages. The task involves transforming a base sentence given one or more tags that correspond to grammatical features, such as negation or tense. The task also spans four languages: Bribri, Maya, Guaraní, and Nahuatl. We experiment with augmenting prompts to large language models with different information, chain of thought prompting, ensembling large language models by majority voting, and training a pointer-generator network. Our System 1, an ensemble of large language models, achieves the best performance on Maya and Guaraní, building upon the previous successes in leveraging large language models for this task and highlighting the effectiveness of ensembling large language models.

2024

pdf bib abs
Evaluating Large Language Models along Dimensions of Language Variation: A Systematik Invesdigatiom uv Cross-lingual Generalization
Niyati Bafna | Kenton Murray | David Yarowsky
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

While large language models exhibit certain cross-lingual generalization capabilities, they suffer from performance degradation (PD) on unseen closely-related languages (CRLs) and dialects relative to their high-resource language neighbour (HRLN). However, we currently lack a fundamental understanding of what kinds of linguistic distances contribute to PD, and to what extent. Furthermore, studies of cross-lingual generalization are confounded by unknown quantities of CRL language traces in the training data, and by the frequent lack of availability of evaluation data in lower-resource related languages and dialects. To address these issues, we model phonological, morphological, and lexical distance as Bayesian noise processes to synthesize artificial languages that are controllably distant from the HRLN. We analyse PD as a function of underlying noise parameters, offering insights on model robustness to isolated and composed linguistic phenomena, and the impact of task and HRL characteristics on PD. We calculate parameter posteriors on real CRL-HRLN pair data and show that they follow computed trends of artificial languages, demonstrating the viability of our noisers. Our framework offers a cheap solution for estimating task performance on an unseen CRL given HRLN performance using its posteriors, as well as for diagnosing observed PD on a CRL in terms of its linguistic distances from its HRLN, and opens doors to principled methods of mitigating performance degradation.

pdf bib abs
Pointer-Generator Networks for Low-Resource Machine Translation: Don’t Copy That!
Niyati Bafna | Philipp Koehn | David Yarowsky
Proceedings of the Fifth Workshop on Insights from Negative Results in NLP

While Transformer-based neural machine translation (NMT) is very effective in high-resource settings, many languages lack the necessary large parallel corpora to benefit from it. In the context of low-resource (LR) MT between two closely-related languages, a natural intuition is to seek benefits from structural “shortcuts”, such as copying subwords from the source to the target, given that such language pairs often share a considerable number of identical words, cognates, and borrowings. We test Pointer-Generator Networks for this purpose for six language pairs over a variety of resource ranges, and find weak improvements for most settings. However, analysis shows that the model does not show greater improvements for closely-related vs. more distant language pairs, or for lower resource ranges, and that the models do not exhibit the expected usage of the mechanism for shared subwords. Our discussion of the reasons for this behaviour highlights several general challenges for LR NMT, such as modern tokenization strategies, noisy real-world conditions, and linguistic complexities. We call for better scrutiny of linguistically motivated improvements to NMT given the blackbox nature of Transformer models, as well as for a focus on the above problems in the field.

2022

pdf bib abs
Deciphering and Characterizing Out-of-Vocabulary Words for Morphologically Rich Languages
Georgie Botev | Arya D. McCarthy | Winston Wu | David Yarowsky
Proceedings of the 29th International Conference on Computational Linguistics

This paper presents a detailed foundational empirical case study of the nature of out-of-vocabulary words encountered in modern text in a moderate-resource language such as Bulgarian, and a multi-faceted distributional analysis of the underlying word-formation processes that can aid in their compositional translation, tagging, parsing, language modeling, and other NLP tasks. Given that out-of-vocabulary (OOV) words generally present a key open challenge to NLP and machine translation systems, especially toward the lower limit of resource availability, there are useful practical insights, as well as corpus-linguistic insights, from both a detailed manual and automatic taxonomic analysis of the types, multidimensional properties, and processing potential for multiple representative OOV data samples.

pdf bib abs
Known Words Will Do: Unknown Concept Translation via Lexical Relations
Winston Wu | David Yarowsky
Proceedings of the Fifth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2022)

Translating into low-resource languages is challenging due to the scarcity of training data. In this paper, we propose a probabilistic lexical translation method that bridges through lexical relations including synonyms, hypernyms, hyponyms, and co-hyponyms. This method, which only requires a dictionary like Wiktionary and a lexical database like WordNet, enables the translation of unknown vocabulary into low-resource languages for which we may only know the translation of a related concept. Experiments on translating a core vocabulary set into 472 languages, most of them low-resource, show the effectiveness of our approach.

The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation, and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements on several fronts that were made in the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 66 new languages, including 24 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g., missing gender and macrons information. We have amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking, while adding some missing morphological features to make the schema more inclusive. In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. Lastly, this new release makes a push towards inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet.

pdf bib abs
On the Robustness of Cognate Generation Models
Winston Wu | David Yarowsky
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We evaluate two popular neural cognate generation models’ robustness to several types of human-plausible noise (deletion, duplication, swapping, and keyboard errors, as well as a new type of error, phonological errors). We find that duplication and phonological substitution is least harmful, while the other types of errors are harmful. We present an in-depth analysis of the models’ results with respect to each error type to explain how and why these models perform as they do.

2021

pdf bib abs
On Pronunciations in Wiktionary: Extraction and Experiments on Multilingual Syllabification and Stress Prediction
Winston Wu | David Yarowsky
Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021)

We constructed parsers for five non-English editions of Wiktionary, which combined with pronunciations from the English edition, comprises over 5.3 million IPA pronunciations, the largest pronunciation lexicon of its kind. This dataset is a unique comparable corpus of IPA pronunciations annotated from multiple sources. We analyze the dataset, noting the presence of machine-generated pronunciations. We develop a novel visualization method to quantify syllabification. We experiment on the new combined task of multilingual IPA syllabification and stress prediction, finding that training a massively multilingual neural sequence-to-sequence model with copy attention can improve performance on both high- and low-resource languages, and multi-task training on stress prediction helps with syllabification.

pdf bib
Sequence Models for Computational Etymology of Borrowings
Winston Wu | Kevin Duh | David Yarowsky
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

This year’s iteration of the SIGMORPHON Shared Task on morphological reinflection focuses on typological diversity and cross-lingual variation of morphosyntactic features. In terms of the task, we enrich UniMorph with new data for 32 languages from 13 language families, with most of them being under-resourced: Kunwinjku, Classical Syriac, Arabic (Modern Standard, Egyptian, Gulf), Hebrew, Amharic, Aymara, Magahi, Braj, Kurdish (Central, Northern, Southern), Polish, Karelian, Livvi, Ludic, Veps, Võro, Evenki, Xibe, Tuvan, Sakha, Turkish, Indonesian, Kodi, Seneca, Asháninka, Yanesha, Chukchi, Itelmen, Eibela. We evaluate six systems on the new data and conduct an extensive error analysis of the systems’ predictions. Transformer-based models generally demonstrate superior performance on the majority of languages, achieving >90% accuracy on 65% of them. The languages on which systems yielded low accuracy are mainly under-resourced, with a limited amount of data. Most errors made by the systems are due to allomorphy, honorificity, and form variation. In addition, we observe that systems especially struggle to inflect multiword lemmas. The systems also produce misspelled forms or end up in repetitive loops (e.g., RNN-based models). Finally, we report a large drop in systems’ performance on previously unseen lemmas.

2020

pdf bib abs
Neural Transduction for Multilingual Lexical Translation
Dylan Lewis | Winston Wu | Arya D. McCarthy | David Yarowsky
Proceedings of the 28th International Conference on Computational Linguistics

We present a method for completing multilingual translation dictionaries. Our probabilistic approach can synthesize new word forms, allowing it to operate in settings where correct translations have not been observed in text (cf. cross-lingual embeddings). In addition, we propose an approximate Maximum Mutual Information (MMI) decoding objective to further improve performance in both many-to-one and one-to-one word level translation tasks where we use either multiple input languages for a single target language or more typical single language pair translation. The model is trained in a many-to-many setting, where it can leverage information from related languages to predict words in each of its many target languages. We focus on 6 languages: French, Spanish, Italian, Portuguese, Romanian, and Turkish. When indirect multilingual information is available, ensembling with mixture-of-experts as well as incorporating related languages leads to a 27% relative improvement in whole-word accuracy of predictions over a single-source baseline. To seed the completion when multilingual data is unavailable, it is better to decode with an MMI objective.

pdf bib abs
Wiktionary Normalization of Translations and Morphological Information
Winston Wu | David Yarowsky
Proceedings of the 28th International Conference on Computational Linguistics

We extend the Yawipa Wiktionary Parser (Wu and Yarowsky, 2020) to extract and normalize translations from etymology glosses, and morphological form-of relations, resulting in 300K unique translations and over 4 million instances of 168 annotated morphological relations. We propose a method to identify typos in translation annotations. Using the extracted morphological data, we develop multilingual neural models for predicting three types of word formation—clipping, contraction, and eye dialect—and improve upon a standard attention baseline by using copy attention.

pdf bib abs
Measuring the Similarity of Grammatical Gender Systems by Comparing Partitions
Arya D. McCarthy | Adina Williams | Shijia Liu | David Yarowsky | Ryan Cotterell
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

A grammatical gender system divides a lexicon into a small number of relatively fixed grammatical categories. How similar are these gender systems across languages? To quantify the similarity, we define gender systems extensionally, thereby reducing the problem of comparisons between languages’ gender systems to cluster evaluation. We borrow a rich inventory of statistical tools for cluster evaluation from the field of community detection (Driver and Kroeber, 1932; Cattell, 1945), that enable us to craft novel information theoretic metrics for measuring similarity between gender systems. We first validate our metrics, then use them to measure gender system similarity in 20 languages. We then ask whether our gender system similarities alone are sufficient to reconstruct historical relationships between languages. Towards this end, we make phylogenetic predictions on the popular, but thorny, problem from historical linguistics of inducing a phylogenetic tree over extant Indo-European languages. Of particular interest, languages on the same branch of our phylogenetic tree are notably similar, whereas languages from separate branches are no more similar than chance.

We present findings from the creation of a massively parallel corpus in over 1600 languages, the Johns Hopkins University Bible Corpus (JHUBC). The corpus consists of over 4000 unique translations of the Christian Bible and counting. Our data is derived from scraping several online resources and merging them with existing corpora, combining them under a common scheme that is verse-parallel across all translations. We detail our effort to scrape, clean, align, and utilize this ripe multilingual dataset. The corpus captures the great typological variety of the world’s languages. We catalog this by showing highly similar proportions of representation of Ethnologue’s typological features in our corpus. We also give an example application: projecting pronoun features like clusivity across alignments to richly annotate languages which do not mark the distinction.

pdf bib abs
Computational Etymology and Word Emergence
Winston Wu | David Yarowsky
Proceedings of the Twelfth Language Resources and Evaluation Conference

We developed an extensible, comprehensive Wiktionary parser that improves over several existing parsers. We predict the etymology of a word across the full range of etymology types and languages in Wiktionary, showing improvements over a strong baseline. We also model word emergence and show the application of etymology in modeling this phenomenon. We release our parser to further research in this understudied field.

pdf bib abs
An Analysis of Massively Multilingual Neural Machine Translation for Low-Resource Languages
Aaron Mueller | Garrett Nicolai | Arya D. McCarthy | Dylan Lewis | Winston Wu | David Yarowsky
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this work, we explore massively multilingual low-resource neural machine translation. Using translations of the Bible (which have parallel structure across languages), we train models with up to 1,107 source languages. We create various multilingual corpora, varying the number and relatedness of source languages. Using these, we investigate the best ways to use this many-way aligned resource for multilingual machine translation. Our experiments employ a grammatically and phylogenetically diverse set of source languages during testing for more representative evaluations. We find that best practices in this domain are highly language-specific: adding more languages to a training set is often better, but too many harms performance—the best number depends on the source language. Furthermore, training on related languages can improve or degrade performance, depending on the language. As there is no one-size-fits-most answer, we find that it is critical to tailor one’s approach to the source language and its typology.

The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological paradigms for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. We have implemented several improvements to the extraction pipeline which creates most of our data, so that it is both more complete and more correct. We have added 66 new languages, as well as new parts of speech for 12 languages. We have also amended the schema in several ways. Finally, we present three new community tools: two to validate data for resource creators, and one to make morphological data available from the command line. UniMorph is based at the Center for Language and Speech Processing (CLSP) at Johns Hopkins University in Baltimore, Maryland. This paper details advances made to the schema, tooling, and dissemination of project resources since the UniMorph 2.0 release described at LREC 2018.

pdf bib abs
Fine-grained Morphosyntactic Analysis and Generation Tools for More Than One Thousand Languages
Garrett Nicolai | Dylan Lewis | Arya D. McCarthy | Aaron Mueller | Winston Wu | David Yarowsky
Proceedings of the Twelfth Language Resources and Evaluation Conference

Exploiting the broad translation of the Bible into the world’s languages, we train and distribute morphosyntactic tools for approximately one thousand languages, vastly outstripping previous distributions of tools devoted to the processing of inflectional morphology. Evaluation of the tools on a subset of available inflectional dictionaries demonstrates strong initial models, supplemented and improved through ensembling and dictionary-based reranking. Likewise, a novel type-to-token based evaluation metric allows us to confirm that models generalize well across rare and common forms alike

pdf bib abs
Multilingual Dictionary Based Construction of Core Vocabulary
Winston Wu | Garrett Nicolai | David Yarowsky
Proceedings of the Twelfth Language Resources and Evaluation Conference

We propose a new functional definition and construction method for core vocabulary sets for multiple applications based on the relative coverage of a target concept in thousands of bilingual dictionaries. Our newly developed core concept vocabulary list derived from these dictionary consensus methods achieves high overlap with existing widely utilized core vocabulary lists targeted at applications such as first and second language learning or field linguistics. Our in-depth analysis illustrates multiple desirable properties of our newly proposed core vocabulary set, including their non-compositionality. We employ a cognate prediction method to recover missing coverage of this core vocabulary in massively multilingual dictionary construction, and we argue that this core vocabulary should be prioritized for elicitation when creating new dictionaries for low-resource languages for multiple downstream tasks including machine translation and language learning.

pdf bib abs
Induced Inflection-Set Keyword Search in Speech
Oliver Adams | Matthew Wiesner | Jan Trmal | Garrett Nicolai | David Yarowsky
Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

We investigate the problem of searching for a lexeme-set in speech by searching for its inflectional variants. Experimental results indicate how lexeme-set search performance changes with the number of hypothesized inflections, while ablation experiments highlight the relative importance of different components in the lexeme-set search pipeline and the value of using curated inflectional paradigms. We provide a recipe and evaluation set for the community to use as an extrinsic measure of the performance of inflection generation approaches.

2019

pdf bib abs
Modeling Color Terminology Across Thousands of Languages
Arya D. McCarthy | Winston Wu | Aaron Mueller | William Watson | David Yarowsky
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

There is an extensive history of scholarship into what constitutes a “basic” color term, as well as a broadly attested acquisition sequence of basic color terms across many languages, as articulated in the seminal work of Berlin and Kay (1969). This paper employs a set of diverse measures on massively cross-linguistic data to operationalize and critique the Berlin and Kay color term hypotheses. Collectively, the 14 empirically-grounded computational linguistic metrics we design—as well as their aggregation—correlate strongly with both the Berlin and Kay basic/secondary color term partition (γ = 0.96) and their hypothesized universal acquisition sequence. The measures and result provide further empirical evidence from computational linguistics in support of their claims, as well as additional nuance: they suggest treating the partition as a spectrum instead of a dichotomy.

pdf bib abs
Massively Multilingual Adversarial Speech Recognition
Oliver Adams | Matthew Wiesner | Shinji Watanabe | David Yarowsky
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We report on adaptation of multilingual end-to-end speech recognition models trained on as many as 100 languages. Our findings shed light on the relative importance of similarity between the target and pretraining languages along the dimensions of phonetics, phonology, language family, geographical location, and orthography. In this context, experiments demonstrate the effectiveness of two additional pretraining objectives in encouraging language-independent encoder representations: a context-independent phoneme objective paired with a language-adversarial classification objective.

pdf bib abs
Learning Morphosyntactic Analyzers from the Bible via Iterative Annotation Projection across 26 Languages
Garrett Nicolai | David Yarowsky
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

A large percentage of computational tools are concentrated in a very small subset of the planet’s languages. Compounding the issue, many languages lack the high-quality linguistic annotation necessary for the construction of such tools with current machine learning methods. In this paper, we address both issues simultaneously: leveraging the high accuracy of English taggers and parsers, we project morphological information onto translations of the Bible in 26 varied test languages. Using an iterative discovery, constraint, and training process, we build inflectional lexica in the target languages. Through a combination of iteration, ensembling, and reranking, we see double-digit relative error reductions in lemmatization and morphological analysis over a strong initial system.

2018

pdf bib
A Comparative Study of Extremely Low-Resource Transliteration of the World’s Languages
Winston Wu | David Yarowsky
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Creating a Translation Matrix of the Bible’s Names Across 591 Languages
Winston Wu | Nidhi Vyas | David Yarowsky
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Creating Large-Scale Multilingual Cognate Tables
Winston Wu | David Yarowsky
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Massively Translingual Compound Analysis and Translation Discovery
Winston Wu | David Yarowsky
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Improving Low Resource Machine Translation using Morphological Glosses (Non-archival Extended Abstract)
Steven Shearing | Christo Kirov | Huda Khayrallah | David Yarowsky
Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

pdf bib abs
Marrying Universal Dependencies and Universal Morphology
Arya D. McCarthy | Miikka Silfverberg | Ryan Cotterell | Mans Hulden | David Yarowsky
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

The Universal Dependencies (UD) and Universal Morphology (UniMorph) projects each present schemata for annotating the morphosyntactic details of language. Each project also provides corpora of annotated text in many languages—UD at the token level and UniMorph at the type level. As each corpus is built by different annotators, language-specific decisions hinder the goal of universal schemata. With compatibility of tags, each project’s annotations could be used to validate the other’s. Additionally, the availability of both type- and token-level resources would be a boon to tasks such as parsing and homograph disambiguation. To ease this interoperability, we present a deterministic mapping from Universal Dependencies v2 features into the UniMorph schema. We validate our approach by lookup in the UniMorph corpora and find a macro-average of 64.13% recall. We also note incompatibilities due to paucity of data on either side. Finally, we present a critical evaluation of the foundations, strengths, and weaknesses of the two annotation projects.

2017

pdf bib abs
Paradigm Completion for Derivational Morphology
Ryan Cotterell | Ekaterina Vylomova | Huda Khayrallah | Christo Kirov | David Yarowsky
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

The generation of complex derived word forms has been an overlooked problem in NLP; we fill this gap by applying neural sequence-to-sequence models to the task. We overview the theoretical motivation for a paradigmatic treatment of derivational morphology, and introduce the task of derivational paradigm completion as a parallel to inflectional paradigm completion. State-of-the-art neural models adapted from the inflection task are able to learn the range of derivation patterns, and outperform a non-neural baseline by 16.4%. However, due to semantic, historical, and lexical considerations involved in derivational morphology, future work will be needed to achieve performance parity with inflection-generating systems.

pdf bib abs
Deriving Consensus for Multi-Parallel Corpora: an English Bible Study
Patrick Xia | David Yarowsky
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

What can you do with multiple noisy versions of the same text? We present a method which generates a single consensus between multi-parallel corpora. By maximizing a function of linguistic features between word pairs, we jointly learn a single corpus-wide multiway alignment: a consensus between 27 versions of the English Bible. We additionally produce English paraphrases, word-level distributions of tags, and consensus dependency parses. Our method is language independent and applicable to any multi-parallel corpora. Given the Bible’s unique role as alignable bitext for over 800 of the world’s languages, this consensus alignment and resulting resources offer value for multilingual annotation projection, and also shed potential insights into the Bible itself.

2016

pdf bib abs
Automatic Construction of Morphologically Motivated Translation Models for Highly Inflected, Low-Resource Languages
John Hewitt | Matt Post | David Yarowsky
Conferences of the Association for Machine Translation in the Americas: MT Researchers' Track

Statistical Machine Translation (SMT) of highly inflected, low-resource languages suffers from the problem of low bitext availability, which is exacerbated by large inflectional paradigms. When translating into English, rich source inflections have a high chance of being poorly estimated or out-of-vocabulary (OOV). We present a source language-agnostic system for automatically constructing phrase pairs from foreign-language inflections and their morphological analyses using manually constructed datasets, including Wiktionary. We then demonstrate the utility of these phrase tables in improving translation into English from Finnish, Czech, and Turkish in simulated low-resource settings, finding substantial gains in translation quality. We report up to +2.58 BLEU in a simulated low-resource setting and +1.65 BLEU in a moderateresource setting. We release our morphologically-motivated translation models, with tens of thousands of inflections in each of 8 languages.

pdf bib abs
Remote Elicitation of Inflectional Paradigms to Seed Morphological Analysis in Low-Resource Languages
John Sylak-Glassman | Christo Kirov | David Yarowsky
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Structured, complete inflectional paradigm data exists for very few of the world’s languages, but is crucial to training morphological analysis tools. We present methods inspired by linguistic fieldwork for gathering inflectional paradigm data in a machine-readable, interoperable format from remotely-located speakers of any language. Informants are tasked with completing language-specific paradigm elicitation templates. Templates are constructed by linguists using grammatical reference materials to ensure completeness. Each cell in a template is associated with contextual prompts designed to help informants with varying levels of linguistic expertise (from professional translators to untrained native speakers) provide the desired inflected form. To facilitate downstream use in interoperable NLP/HLT applications, each cell is also associated with a language-independent machine-readable set of morphological tags from the UniMorph Schema. This data is useful for seeding morphological analysis and generation software, particularly when the data is representative of the range of surface morphological variation in the language. At present, we have obtained 792 lemmas and 25,056 inflected forms from 15 languages.

pdf bib abs
Very-large Scale Parsing and Normalization of Wiktionary Morphological Paradigms
Christo Kirov | John Sylak-Glassman | Roger Que | David Yarowsky
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Wiktionary is a large-scale resource for cross-lingual lexical information with great potential utility for machine translation (MT) and many other NLP tasks, especially automatic morphological analysis and generation. However, it is designed primarily for human viewing rather than machine readability, and presents numerous challenges for generalized parsing and extraction due to a lack of standardized formatting and grammatical descriptor definitions. This paper describes a large-scale effort to automatically extract and standardize the data in Wiktionary and make it available for use by the NLP research community. The methodological innovations include a multidimensional table parsing algorithm, a cross-lexeme, token-frequency-based method of separating inflectional form data from grammatical descriptors, the normalization of grammatical descriptors to a unified annotation scheme that accounts for cross-linguistic diversity, and a verification and correction process that exploits within-language, cross-lexeme table format consistency to minimize human effort. The effort described here resulted in the extraction of a uniquely large normalized resource of nearly 1,000,000 inflectional paradigms across 350 languages. Evaluation shows that even though the data is extracted using a language-independent approach, it is comparable in quantity and quality to data extracted using hand-tuned, language-specific approaches.

pdf bib
The SIGMORPHON 2016 Shared Task—Morphological Reinflection
Ryan Cotterell | Christo Kirov | John Sylak-Glassman | David Yarowsky | Jason Eisner | Mans Hulden
Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

2010

While the web provides a fantastic linguistic resource, collecting and processing data at web-scale is beyond the reach of most academic laboratories. Previous research has relied on search engines to collect online information, but this is hopelessly inefficient for building large-scale linguistic resources, such as lists of named-entity types or clusters of distributionally similar words. An alternative to processing web-scale text directly is to use the information provided in an N-gram corpus. An N-gram corpus is an efficient compression of large amounts of text. An N-gram corpus states how often each sequence of words (up to length N) occurs. We propose tools for working with enhanced web-scale N-gram corpora that include richer levels of source annotation, such as part-of-speech tags. We describe a new set of search tools that make use of these tags, and collectively lower the barrier for lexical learning and ambiguity resolution at web-scale. They will allow novel sources of information to be applied to long-standing natural language challenges.

2009

pdf bib
Structural, Transitive and Latent Models for Biographic Fact Extraction
Nikesh Garera | David Yarowsky
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

pdf bib
Modeling Latent Biographic Attributes in Conversational Genres
Nikesh Garera | David Yarowsky
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

pdf bib
Improving Translation Lexicon Induction from Monolingual Corpora via Dependency Contexts and Part-of-Speech Equivalences
Nikesh Garera | Chris Callison-Burch | David Yarowsky
Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009)

pdf bib
Ranking and Semi-supervised Classification on Large Scale Graphs Using Map-Reduce
Delip Rao | David Yarowsky
Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing (TextGraphs-4)

2008

pdf bib
Mining and Modeling Relations between Formal and Informal Chinese Phrases from Web Corpora
Zhifei Li | David Yarowsky
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

pdf bib
Translating Compounds by Learning Component Gloss Translation Models via Multiple Languages
Nikesh Garera | David Yarowsky
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I

pdf bib
Minimally Supervised Multilingual Taxonomy and Translation Lexicon Induction
Nikesh Garera | David Yarowsky
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I

pdf bib
Unsupervised Translation Induction for Chinese Abbreviations using Monolingual Corpora
Zhifei Li | David Yarowsky
Proceedings of ACL-08: HLT

pdf bib
Affinity Measures Based on the Graph Laplacian
Delip Rao | David Yarowsky | Chris Callison-Burch
Coling 2008: Proceedings of the 3rd Textgraphs workshop on Graph-based Algorithms for Natural Language Processing

2007

pdf bib
JHU1 : An Unsupervised Approach to Person Name Disambiguation using Web Snippets
Delip Rao | Nikesh Garera | David Yarowsky
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

2006

pdf bib abs
Machine Translation for Languages Lacking Bitext via Multilingual Gloss Transduction
Brock Pytlik | David Yarowsky
Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers

We propose and evaluate a new paradigm for machine translation of low resource languages via the learned surface transduction and paraphrase of multilingual glosses.

pdf bib abs
Minimally Supervised Morphological Segmentation with Applications to Machine Translation
Jason Riesa | David Yarowsky
Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers

Inflected languages in a low-resource setting present a data sparsity problem for statistical machine translation. In this paper, we present a minimally supervised algorithm for morpheme segmentation on Arabic dialects which reduces unknown words at translation time by over 50%, total vocabulary size by over 40%, and yields a significant increase in BLEU score over a previous state-of-the-art phrase-based statistical MT system.

pdf bib
Resolving and Generating Definite Anaphora by Modeling Hypernymy using Unlabeled Corpora
Nikesh Garera | David Yarowsky
Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X)

2005

pdf bib
Multi-Field Information Extraction and Cross-Document Fusion
Gideon Mann | David Yarowsky
Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05)

pdf bib
Induction of Fine-Grained Part-of-Speech Taggers via Classifier Combination and Crosslingual Projection
Elliott Drábek | David Yarowsky
Proceedings of the ACL Workshop on Building and Using Parallel Texts

2004

pdf bib
Exploiting Aggregate Properties of Bilingual Dictionaries For Distinguishing Senses of English Words and Inducing English Sense Clusters
Charles Schafer | David Yarowsky
Proceedings of the ACL Interactive Poster and Demonstration Sessions

pdf bib
Improving Bitext Word Alignments via Syntax-based Reordering of English
Elliott Franco Drabek | David Yarowsky
Proceedings of the ACL Interactive Poster and Demonstration Sessions

2003

pdf bib abs
A two-level syntax-based approach to Arabic-English statistical machine translation
Charles Schafer | David Yarowsky
Workshop on Machine Translation for Semitic languages: issues and approaches

We formulate an original model for statistical machine translation (SMT) inspired by characteristics of the Arabic-English translation task. Our approach incorporates part-of-speech tags and linguistically motivated phrase chunks in a 2-level shallow syntactic model of reordering. We implement and evaluate this model, showing it to have advantageous properties and to be competitive with an existing SMT baseline. We also describe cross-categorial lexical translation coercion, an interesting component and side-effect of our approach. Finally, we discuss the novel implementation of decoding for this model which saves much development work by constructing finite-state machine (FSM) representations of translation probability distributions and using generic FSM operations for search. Algorithmic details, examples and results focus on Arabic, and the paper includes discussion on the issues and challenges of Arabic statistical machine translation.

pdf bib
Minimally Supervised Induction of Grammatical Gender
Silviu Cucerzan | David Yarowsky
Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
Unsupervised Personal Name Disambiguation
Gideon Mann | David Yarowsky
Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003

pdf bib
Statistical Machine Translation Using Coercive Two-Level Syntactic Transduction
Charles Schafer | David Yarowsky
Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing

2002

pdf bib
Inducing Information Extraction Systems for New Languages via Cross-language Projection
Ellen Riloff | Charles Schafer | David Yarowsky
COLING 2002: The 19th International Conference on Computational Linguistics

pdf bib
Modeling Consensus: Classifier Combination for Word Sense Disambiguation
Radu Florian | David Yarowsky
Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002)

pdf bib
Augmented Mixture Models for Lexical Disambiguation
Silviu Cucerzan | David Yarowsky
Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002)

pdf bib
Bootstrapping a Multilingual Part-of-speech Tagger in One Person-day
Silviu Cucerzan | David Yarowsky
COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002)

pdf bib
Language Independent NER using a Unified Model of Internal and Contextual Evidence
Silviu Cucerzan | David Yarowsky
COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002)

pdf bib
Inducing Translation Lexicons via Diverse Similarity Measures and Bridge Languages
Charles Schafer | David Yarowsky
COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002)

2001

pdf bib
Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora
David Yarowsky | Grace Ngai | Richard Wicentowski
Proceedings of the First International Conference on Human Language Technology Research

pdf bib
Multipath Translation Lexicon Induction via Bridge Languages
Gideon S. Mann | David Yarowsky
Second Meeting of the North American Chapter of the Association for Computational Linguistics

pdf bib
Inducing Multilingual POS Taggers and NP Bracketers via Robust Projection Across Aligned Corpora
David Yarowsky | Grace Ngai
Second Meeting of the North American Chapter of the Association for Computational Linguistics

pdf bib
Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems
Judita Preiss | David Yarowsky
Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems

pdf bib
The John Hopkins SENSEVAL-2 System Descriptions
David Yarowsky | Silviu Cucerzan | Radu Florian | Charles Schafer | Richard Wicentowski
Proceedings of SENSEVAL-2 Second International Workshop on Evaluating Word Sense Disambiguation Systems

2000

pdf bib
Rule Writing or Annotation: Cost-efficient Resource Usage for Base Noun Phrase Chunking
Grace Ngai | David Yarowsky
Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics

pdf bib
Minimally Supervised Morphological Analysis by Multimodal Alignment
David Yarowsky | Richard Wicentowski
Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics

pdf bib
Language Independent, Minimally Supervised Induction of Lexical Probabilities
Silviu Cucerzan | David Yarowsky
Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics

1999

pdf bib
Dynamic Nonlocal Language Modeling via Hierarchical Topic-Based Adaptation
Radu Florian | David Yarowsky
Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics

pdf bib
Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence
Silviu Cucerzan | David Yarowsky
1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora

pdf bib
Taking the load off the conference chairs-towards a digital paper-routing assistant
David Yarowsky | Radu Florian
1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora

1997

pdf bib
A Perspective on Word Sense Disambiguation Methods and Their Evaluation
Philip Resnik | David Yarowsky
Tagging Text with Lexical Semantics: Why, What, and How?

1995

pdf bib
Unsupervised Word Sense Disambiguation Rivaling Supervised Methods
David Yarowsky
33rd Annual Meeting of the Association for Computational Linguistics

1994

pdf bib abs
A Comparison of Corpus-based Techniques for Restoring Accents in Spanish and French Text
David Yarowsky
Second Workshop on Very Large Corpora

This paper will explore and compare three corpus-based techniques for lexical ambiguity resolution, focusing on the problem of restoring missing accents to Spanish and French text. Many of the ambiguities created by missing accents are differences in part of speech: hence one of the methods considered is an N-gram tagger using Viterbi decoding, such as is found in stochastic part-of-speech taggers. A second technique, Bayesian classification, has been successfully applied to word-sense disambiguation and is well suited for some of the semantic ambiguities which arise from missing accents. The third approach, based on decision lists, combines the strengths of the two other methods, incorporating both local syntactic patterns and more distant collocational evidence, and outperforms them both. The problem of accent restoration is particularly well suited for demonstrating and testing the capabilities of the given algorithms because it requires the resolution of both semantic and syntactic ambiguity, and offers an objective ground truth for automatic evaluation. It is also a practical problem with immediate application.

pdf bib
DECISION LISTS FOR LEXICAL AMBIGUITY RESOLUTION: Application to Accent Restoration in Spanish and French
David Yarowsky
32nd Annual Meeting of the Association for Computational Linguistics