Jennifer Tracey

2020

pdf bib abs
Basic Language Resources for 31 Languages (Plus English): The LORELEI Representative and Incident Language Packs
Jennifer Tracey | Stephanie Strassel
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

This paper documents and describes the thirty-one basic language resource packs created for the DARPA LORELEI program for use in development and testing of systems capable of providing language-independent situational awareness in emerging scenarios in a low resource language context. Twenty-four Representative Language Packs cover a broad range of language families and typologies, providing large volumes of monolingual and parallel text, smaller volumes of entity and semantic annotations, and a variety of grammatical resources and tools designed to support research into language universals and cross-language transfer. Seven Incident Language Packs provide test data to evaluate system capabilities on a previously unseen low resource language. We discuss the makeup of Representative and Incident Language Packs, the methods used to produce them, and the evolution of their design and implementation over the course of the multi-year LORELEI program. We conclude with a summary of the final language packs including their low-cost publication in the LDC catalog.

2019

2018

pdf bib
Laying the Groundwork for Knowledge Base Population: Nine Years of Linguistic Resources for TAC KBP
Jeremy Getman | Joe Ellis | Stephanie Strassel | Zhiyi Song | Jennifer Tracey
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Simple Semantic Annotation and Situation Frames: Two Approaches to Basic Text Understanding in LORELEI
Kira Griffitt | Jennifer Tracey | Ann Bies | Stephanie Strassel
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
VAST: A Corpus of Video Annotation for Speech Technologies
Jennifer Tracey | Stephanie Strassel
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

pdf bib abs
Uzbek-English and Turkish-English Morpheme Alignment Corpora
Xuansong Li | Jennifer Tracey | Stephen Grimes | Stephanie Strassel
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Morphologically-rich languages pose problems for machine translation (MT) systems, including word-alignment errors, data sparsity and multiple affixes. Current alignment models at word-level do not distinguish words and morphemes, thus yielding low-quality alignment and subsequently affecting end translation quality. Models using morpheme-level alignment can reduce the vocabulary size of morphologically-rich languages and overcomes data sparsity. The alignment data based on smallest units reveals subtle language features and enhances translation quality. Recent research proves such morpheme-level alignment (MA) data to be valuable linguistic resources for SMT, particularly for languages with rich morphology. In support of this research trend, the Linguistic Data Consortium (LDC) created Uzbek-English and Turkish-English alignment data which are manually aligned at the morpheme level. This paper describes the creation of MA corpora, including alignment and tagging process and approaches, highlighting annotation challenges and specific features of languages with rich morphology. The light tagging annotation on the alignment layer adds extra value to the MA data, facilitating users in flexibly tailoring the data for various MT model training.

pdf bib abs
LORELEI Language Packs: Data, Tools, and Resources for Technology Development in Low Resource Languages
Stephanie Strassel | Jennifer Tracey
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper, we describe the textual linguistic resources in nearly 3 dozen languages being produced by Linguistic Data Consortium for DARPA’s LORELEI (Low Resource Languages for Emergent Incidents) Program. The goal of LORELEI is to improve the performance of human language technologies for low-resource languages and enable rapid re-training of such technologies for new languages, with a focus on the use case of deployment of resources in sudden emergencies such as natural disasters. Representative languages have been selected to provide broad typological coverage for training, and surprise incident languages for testing will be selected over the course of the program. Our approach treats the full set of language packs as a coherent whole, maintaining LORELEI-wide specifications, tagsets, and guidelines, while allowing for adaptation to the specific needs created by each language. Each representative language corpus, therefore, both stands on its own as a resource for the specific language and forms part of a large multilingual resource for broader cross-language technology development.

pdf bib abs
Selection Criteria for Low Resource Language Programs
Christopher Cieri | Mike Maxwell | Stephanie Strassel | Jennifer Tracey
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper documents and describes the criteria used to select languages for study within programs that include low resource languages whether given that label or another similar one. It focuses on five US common task, Human Language Technology research and development programs in which the authors have provided information or consulting related to the choice of language. The paper does not describe the actual selection process which is the responsibility of program management and highly specific to a program’s individual goals and context. Instead it concentrates on the data and criteria that have been considered relevant previously with the thought that future program managers and their consultants may adapt these and apply them with different prioritization to future programs.

2015