2024
pdf
abs
Schema Learning Corpus: Data and Annotation Focused on Complex Events
Song Chen
|
Jennifer Tracey
|
Ann Bies
|
Stephanie Strassel
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
The Schema Learning Corpus (SLC) is a new linguistic resource designed to support research into the structure of complex events in multilingual, multimedia data. The SLC incorporates large volumes of background data in English, Spanish and Russian, and defines 100 complex events (CEs) across 12 domains, with CE profiles containing information about the typical steps and substeps and expected event categories for the CE. Multiple documents are labeled for each CE, with pointers to evidence in the document for each CE step, plus labeled events and relations along with their arguments across a large tag set. The SLC was designed to support development and evaluation of technology capable of understanding and reasoning about complex real-world events in multimedia, multilingual data streams in order to provide users with a deeper understanding of the potential relationships among seemingly disparate events and actors, and to allow users to make better predictions about how future events are likely to unfold. The Schema Learning Corpus will be made available to the research community through publication in Linguistic Data Consortium catalog.
pdf
abs
Spanless Event Annotation for Corpus-Wide Complex Event Understanding
Ann Bies
|
Jennifer Tracey
|
Ann O’Brien
|
Song Chen
|
Stephanie Strassel
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
We present a new approach to event annotation designed to promote whole-corpus understanding of complex events in multilingual, multimedia data as part of the DARPA Knowledge-directed Artificial Intelligence Reasoning Over Schemas (KAIROS) Program. KAIROS aims to build technology capable of reasoning about complex real-world events like a specific terrorist attack in order to provide actionable insights to end users. KAIROS systems extract events from a corpus, aggregate information into a coherent semantic representation, and instantiate observed events or predict unseen but expected events using a relevant event schema selected from a generalized schema library. To support development and testing for KAIROS Phase 2B we created a complex event annotation corpus that, instead of individual event mentions anchored in document spans with pre-defined event type labels, comprises a series of temporally ordered event frames populated with information aggregated from the whole corpus and labeled with an unconstrained tag set based on Wikidata Qnodes. The corpus makes a unique contribution to the resource landscape for information extraction, addressing gaps in the availability of multilingual, multimedia corpora for schema-based event representation. The corpus will be made available through publication in the Linguistic Data Consortium (LDC) catalog.
2022
pdf
abs
A Study in Contradiction: Data and Annotation for AIDA Focusing on Informational Conflict in Russia-Ukraine Relations
Jennifer Tracey
|
Ann Bies
|
Jeremy Getman
|
Kira Griffitt
|
Stephanie Strassel
Proceedings of the Thirteenth Language Resources and Evaluation Conference
This paper describes data resources created for Phase 1 of the DARPA Active Interpretation of Disparate Alternatives (AIDA) program, which aims to develop language technology that can help humans manage large volumes of sometimes conflicting information to develop a comprehensive understanding of events around the world, even when such events are described in multiple media and languages. Especially important is the need for the technology to be capable of building multiple hypotheses to account for alternative interpretations of data imbued with informational conflict. The corpus described here is designed to support these goals. It focuses on the domain of Russia-Ukraine relations and contains multimedia source data in English, Russian and Ukrainian, annotated to support development and evaluation of systems that perform extraction of entities, events, and relations from individual multimedia documents, aggregate the information across documents and languages, and produce multiple “hypotheses” about what has happened. This paper describes source data collection, annotation, and assessment.
pdf
abs
BeSt: The Belief and Sentiment Corpus
Jennifer Tracey
|
Owen Rambow
|
Claire Cardie
|
Adam Dalton
|
Hoa Trang Dang
|
Mona Diab
|
Bonnie Dorr
|
Louise Guthrie
|
Magdalena Markowska
|
Smaranda Muresan
|
Vinodkumar Prabhakaran
|
Samira Shaikh
|
Tomek Strzalkowski
Proceedings of the Thirteenth Language Resources and Evaluation Conference
We present the BeSt corpus, which records cognitive state: who believes what (i.e., factuality), and who has what sentiment towards what. This corpus is inspired by similar source-and-target corpora, specifically MPQA and FactBank. The corpus comprises two genres, newswire and discussion forums, in three languages, Chinese (Mandarin), English, and Spanish. The corpus is distributed through the LDC.
2020
pdf
abs
Basic Language Resources for 31 Languages (Plus English): The LORELEI Representative and Incident Language Packs
Jennifer Tracey
|
Stephanie Strassel
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)
This paper documents and describes the thirty-one basic language resource packs created for the DARPA LORELEI program for use in development and testing of systems capable of providing language-independent situational awareness in emerging scenarios in a low resource language context. Twenty-four Representative Language Packs cover a broad range of language families and typologies, providing large volumes of monolingual and parallel text, smaller volumes of entity and semantic annotations, and a variety of grammatical resources and tools designed to support research into language universals and cross-language transfer. Seven Incident Language Packs provide test data to evaluate system capabilities on a previously unseen low resource language. We discuss the makeup of Representative and Incident Language Packs, the methods used to produce them, and the evolution of their design and implementation over the course of the multi-year LORELEI program. We conclude with a summary of the final language packs including their low-cost publication in the LDC catalog.
2019
pdf
Corpus Building for Low Resource Languages in the DARPA LORELEI Program
Jennifer Tracey
|
Stephanie Strassel
|
Ann Bies
|
Zhiyi Song
|
Michael Arrigo
|
Kira Griffitt
|
Dana Delgado
|
Dave Graff
|
Seth Kulick
|
Justin Mott
|
Neil Kuster
Proceedings of the 2nd Workshop on Technologies for MT of Low Resource Languages
2018
pdf
Laying the Groundwork for Knowledge Base Population: Nine Years of Linguistic Resources for TAC KBP
Jeremy Getman
|
Joe Ellis
|
Stephanie Strassel
|
Zhiyi Song
|
Jennifer Tracey
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
pdf
Simple Semantic Annotation and Situation Frames: Two Approaches to Basic Text Understanding in LORELEI
Kira Griffitt
|
Jennifer Tracey
|
Ann Bies
|
Stephanie Strassel
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
pdf
VAST: A Corpus of Video Annotation for Speech Technologies
Jennifer Tracey
|
Stephanie Strassel
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2016
pdf
abs
Uzbek-English and Turkish-English Morpheme Alignment Corpora
Xuansong Li
|
Jennifer Tracey
|
Stephen Grimes
|
Stephanie Strassel
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Morphologically-rich languages pose problems for machine translation (MT) systems, including word-alignment errors, data sparsity and multiple affixes. Current alignment models at word-level do not distinguish words and morphemes, thus yielding low-quality alignment and subsequently affecting end translation quality. Models using morpheme-level alignment can reduce the vocabulary size of morphologically-rich languages and overcomes data sparsity. The alignment data based on smallest units reveals subtle language features and enhances translation quality. Recent research proves such morpheme-level alignment (MA) data to be valuable linguistic resources for SMT, particularly for languages with rich morphology. In support of this research trend, the Linguistic Data Consortium (LDC) created Uzbek-English and Turkish-English alignment data which are manually aligned at the morpheme level. This paper describes the creation of MA corpora, including alignment and tagging process and approaches, highlighting annotation challenges and specific features of languages with rich morphology. The light tagging annotation on the alignment layer adds extra value to the MA data, facilitating users in flexibly tailoring the data for various MT model training.
pdf
abs
LORELEI Language Packs: Data, Tools, and Resources for Technology Development in Low Resource Languages
Stephanie Strassel
|
Jennifer Tracey
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
In this paper, we describe the textual linguistic resources in nearly 3 dozen languages being produced by Linguistic Data Consortium for DARPA’s LORELEI (Low Resource Languages for Emergent Incidents) Program. The goal of LORELEI is to improve the performance of human language technologies for low-resource languages and enable rapid re-training of such technologies for new languages, with a focus on the use case of deployment of resources in sudden emergencies such as natural disasters. Representative languages have been selected to provide broad typological coverage for training, and surprise incident languages for testing will be selected over the course of the program. Our approach treats the full set of language packs as a coherent whole, maintaining LORELEI-wide specifications, tagsets, and guidelines, while allowing for adaptation to the specific needs created by each language. Each representative language corpus, therefore, both stands on its own as a resource for the specific language and forms part of a large multilingual resource for broader cross-language technology development.
pdf
abs
Selection Criteria for Low Resource Language Programs
Christopher Cieri
|
Mike Maxwell
|
Stephanie Strassel
|
Jennifer Tracey
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
This paper documents and describes the criteria used to select languages for study within programs that include low resource languages whether given that label or another similar one. It focuses on five US common task, Human Language Technology research and development programs in which the authors have provided information or consulting related to the choice of language. The paper does not describe the actual selection process which is the responsibility of program management and highly specific to a program’s individual goals and context. Instead it concentrates on the data and criteria that have been considered relevant previously with the thought that future program managers and their consultants may adapt these and apply them with different prioritization to future programs.
2015
pdf
A New Dataset and Evaluation for Belief/Factuality
Vinodkumar Prabhakaran
|
Tomas By
|
Julia Hirschberg
|
Owen Rambow
|
Samira Shaikh
|
Tomek Strzalkowski
|
Jennifer Tracey
|
Michael Arrigo
|
Rupayan Basu
|
Micah Clark
|
Adam Dalton
|
Mona Diab
|
Louise Guthrie
|
Anna Prokofieva
|
Stephanie Strassel
|
Gregory Werner
|
Yorick Wilks
|
Janyce Wiebe
Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics