2022
pdf
bib
abs
The NIEUW Project: Developing Language Resources through Novel Incentives
James Fiumara
|
Christopher Cieri
|
Mark Liberman
|
Chris Callison-Burch
|
Jonathan Wright
|
Robert Parker
Proceedings of the 2nd Workshop on Novel Incentives in Data Collection from People: models, implementations, challenges and results within LREC 2022
This paper provides an overview and update on the Linguistic Data Consortium’s (LDC) NIEUW (Novel Incentives and Workflows) project supported by the National Science Foundation and part of LDC’s larger goal of improving the cost, variety, scale, and quality of language resources available for education, research, and technology development. NIEUW leverages the power of novel incentives to elicit linguistic data and annotations from a wide variety of contributors including citizen scientists, game players, and language students and professionals. In order to align appropriate incentives with the various contributors, LDC has created three distinct web portals to bring together researchers and other language professionals with participants best suited to their project needs. These portals include LanguageARC designed for citizen scientists, Machina Pro Linguistica designed for students and language professionals, and LingoBoingo designed for game players. The design, interface, and underlying tools for each web portal were developed to appeal to the different incentives and motivations of their respective target audiences.
2010
pdf
abs
Technical Infrastructure at Linguistic Data Consortium: Software and Hardware Resources for Linguistic Data Creation
Kazuaki Maeda
|
Haejoong Lee
|
Stephen Grimes
|
Jonathan Wright
|
Robert Parker
|
David Lee
|
Andrea Mazzucchi
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Linguistic Data Consortium (LDC) at the University of Pennsylvania has participated as a data provider in a variety of governmentsponsored programs that support development of Human Language Technologies. As the number of projects increases, the quantity and variety of the data LDC produces have increased dramatically in recent years. In this paper, we describe the technical infrastructure, both hardware and software, that LDC has built to support these complex, large-scale linguistic data creation efforts at LDC. As it would not be possible to cover all aspects of LDCs technical infrastructure in one paper, this paper focuses on recent development. We also report on our plans for making our custom-built software resources available to the community as open source software, and introduce an initiative to collaborate with software developers outside LDC. We hope that our approaches and software resources will be useful to the community members who take on similar challenges.
pdf
abs
Wikipedia and the Web of Confusable Entities: Experience from Entity Linking Query Creation for TAC 2009 Knowledge Base Population
Heather Simpson
|
Stephanie Strassel
|
Robert Parker
|
Paul McNamee
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
The Text Analysis Conference (TAC) is a series of Natural Language Processing evaluation workshops organized by the National Institute of Standards and Technology. The Knowledge Base Population (KBP) track at TAC 2009, a hybrid descendant of the TREC Question Answering track and the Automated Content Extraction (ACE) evaluation program, is designed to support development of systems that are capable of automatically populating a knowledge base with information about entities mined from unstructured text. An important component of the KBP evaluation is the Entity Linking task, where systems must accurately associate text mentions of unknown Person (PER), Organization (ORG), and Geopolitical (GPE) names to entries in a knowledge base. Linguistic Data Consortium (LDC) at the University of Pennsylvania creates and distributes linguistic resources including data, annotations, system assessment, tools and specifications for the TAC KBP evaluations. This paper describes the 2009 resource creation efforts, with particular focus on the selection and development of named entity mentions for the Entity Linking task evaluation.
2008
pdf
abs
Annotation Tool Development for Large-Scale Corpus Creation Projects at the Linguistic Data Consortium
Kazuaki Maeda
|
Haejoong Lee
|
Shawn Medero
|
Julie Medero
|
Robert Parker
|
Stephanie Strassel
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
The Linguistic Data Consortium (LDC) creates a variety of linguistic resources - data, annotations, tools, standards and best practices - for many sponsored projects. The programming staff at LDC has created the tools and technical infrastructures to support the data creation efforts for these projects, creating tools and technical infrastructures for all aspects of data creation projects: data scouting, data collection, data selection, annotation, search, data tracking and worklow management. This paper introduces a number of samples of LDC programming staffs work, with particular focus on the recent additions and updates to the suite of software tools developed by LDC. Tools introduced include the GScout Web Data Scouting Tool, LDC Data Selection Toolkit, ACK - Annotation Collection Kit, XTrans Transcription and Speech Annotation Tool, GALE Distillation Toolkit, and the GALE MT Post Editing Workflow Management System.