Joshua K. Hartshorne
2026
FormosanMT: A Multilingual Parallel Corpus of the Formosan Language Family
Hunter Scheppat | Joshua K. Hartshorne | Sema Koc | Éric Le Ferrand | Emily Prud'hommeaux
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Hunter Scheppat | Joshua K. Hartshorne | Sema Koc | Éric Le Ferrand | Emily Prud'hommeaux
Proceedings of the Fifteenth Language Resources and Evaluation Conference
While the quality of machine translation (MT) between widely-spoken languages has improved dramatically in recent years, training robust MT systems for languages with fewer resources remains a challenge. Endangered languages, which often lack the speaker population and written tradition needed to create text resources, are at a particular disadvantage. Developing robust MT architectures for very low-resource settings is hampered by the lack of suitable parallel corpora. To address this challenge, we introduce FormosanMT, a set of MT-ready parallel corpora for the Formosan family of endangered languages indigenous to Taiwan. Together the corpora total nearly 500,000 Formosan-Mandarin and Formosan-English sentence pairs. We share scripts for extracting these corpora from public sources, along with customizable tools for filtering, normalizing, and partitioning the data. In addition, we provide a new tokenizer for Traditional Chinese writing compatible with the popular No Language Left Behind (NLLB) MT architecture, along with updated and improved code for fine-tuning NLLB for any low-resource language pair. Finally we distribute our fully trained NLLB and OpenNMT models for the Formosan languages to and from both Mandarin and English. In addition to serving as a valuable resource for the Formosan language speaker communities, our data, code, and models will be available to NLP researchers working on endangered and low-resource language MT.
2017
Evaluating Hierarchies of Verb Argument Structure with Hierarchical Clustering
Jesse Mu | Joshua K. Hartshorne | Timothy O’Donnell
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
Jesse Mu | Joshua K. Hartshorne | Timothy O’Donnell
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
Verbs can only be used with a few specific arrangements of their arguments (syntactic frames). Most theorists note that verbs can be organized into a hierarchy of verb classes based on the frames they admit. Here we show that such a hierarchy is objectively well-supported by the patterns of verbs and frames in English, since a systematic hierarchical clustering algorithm converges on the same structure as the handcrafted taxonomy of VerbNet, a broad-coverage verb lexicon. We also show that the hierarchies capture meaningful psychological dimensions of generalization by predicting novel verb coercions by human participants. We discuss limitations of a simple hierarchical representation and suggest similar approaches for identifying the representations underpinning verb argument structure.
2014
The VerbCorner Project: Findings from Phase 1 of crowd-sourcing a semantic decomposition of verbs
Joshua K. Hartshorne | Claire Bonial | Martha Palmer
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Joshua K. Hartshorne | Claire Bonial | Martha Palmer
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)