2010
pdf
abs
Enriching Word Alignment with Linguistic Tags
Xuansong Li
|
Niyu Ge
|
Stephen Grimes
|
Stephanie M. Strassel
|
Kazuaki Maeda
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Incorporating linguistic knowledge into word alignment is becoming increasingly important for current approaches in statistical machine translation research. To improve automatic word alignment and ultimately machine translation quality, an annotation framework is jointly proposed by LDC (Linguistic Data Consortium) and IBM. The framework enriches word alignment corpora to capture contextual, syntactic and language-specific features by introducing linguistic tags to the alignment annotation. Two annotation schemes constitute the framework: alignment and tagging. The alignment scheme aims to identify minimum translation units and translation relations by using minimum-match and attachment annotation approaches. A set of word tags and alignment link tags are designed in the tagging scheme to describe these translation units and relations. The framework produces a solid ground-level alignment base upon which larger translation unit alignment can be automatically induced. To test the soundness of this work, evaluation is performed on a pilot annotation, resulting in inter- and intra- annotator agreement of above 90%. To date LDC has produced manual word alignment and tagging on 32,823 Chinese-English sentences following this framework.
pdf
abs
Enhanced Infrastructure for Creation and Collection of Translation Resources
Zhiyi Song
|
Stephanie Strassel
|
Gary Krug
|
Kazuaki Maeda
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Statistical Machine Translation (MT) systems have achieved impressive results in recent years, due in large part to the increasing availability of parallel text for system training and development. This paper describes recent efforts at Linguistic Data Consortium to create linguistic resources for MT, including corpora, specifications and resource infrastructure. We review LDC's three-pronged ap-proach to parallel text corpus development (acquisition of existing parallel text from known repositories, harvesting and aligning of potential parallel documents from the web, and manual creation of parallel text by professional translators), and describe recent adap-tations that have enabled significant expansions in the scope, variety, quality, efficiency and cost-effectiveness of translation resource creation at LDC.
pdf
abs
Transcription Methods for Consistency, Volume and Efficiency
Meghan Lammie Glenn
|
Stephanie M. Strassel
|
Haejoong Lee
|
Kazuaki Maeda
|
Ramez Zakhary
|
Xuansong Li
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
This paper describes recent efforts at Linguistic Data Consortium at the University of Pennsylvania to create manual transcripts as a shared resource for human language technology research and evaluation. Speech recognition and related technologies in particular call for substantial volumes of transcribed speech for use in system development, and for human gold standard references for evaluating performance over time. Over the past several years LDC has developed a number of transcription approaches to support the varied goals of speech technology evaluation programs in multiple languages and genres. We describe each transcription method in detail, and report on the results of a comparative analysis of transcriber consistency and efficiency, for two transcription methods in three languages and five genres. Our findings suggest that transcripts for planned speech are generally more consistent than those for spontaneous speech, and that careful transcription methods result in higher rates of agreement when compared to quick transcription methods. We conclude with a general discussion of factors contributing to transcription quality, efficiency and consistency.
pdf
abs
Technical Infrastructure at Linguistic Data Consortium: Software and Hardware Resources for Linguistic Data Creation
Kazuaki Maeda
|
Haejoong Lee
|
Stephen Grimes
|
Jonathan Wright
|
Robert Parker
|
David Lee
|
Andrea Mazzucchi
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Linguistic Data Consortium (LDC) at the University of Pennsylvania has participated as a data provider in a variety of governmentsponsored programs that support development of Human Language Technologies. As the number of projects increases, the quantity and variety of the data LDC produces have increased dramatically in recent years. In this paper, we describe the technical infrastructure, both hardware and software, that LDC has built to support these complex, large-scale linguistic data creation efforts at LDC. As it would not be possible to cover all aspects of LDCs technical infrastructure in one paper, this paper focuses on recent development. We also report on our plans for making our custom-built software resources available to the community as open source software, and introduce an initiative to collaborate with software developers outside LDC. We hope that our approaches and software resources will be useful to the community members who take on similar challenges.
2009
pdf
Basic Language Resources for Diverse Asian Languages: A Streamlined Approach for Resource Creation
Heather Simpson
|
Kazuaki Maeda
|
Christopher Cieri
Proceedings of the 7th Workshop on Asian Language Resources (ALR7)
2008
pdf
abs
Linguistic Resources and Evaluation Techniques for Evaluation of Cross-Document Automatic Content Extraction
Stephanie Strassel
|
Mark Przybocki
|
Kay Peterson
|
Zhiyi Song
|
Kazuaki Maeda
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
The NIST Automatic Content Extraction (ACE) Evaluation expands its focus in 2008 to encompass the challenge of cross-document and cross-language global integration and reconciliation of information. While past ACE evaluations have been limited to local (within-document) detection and disambiguation of entities, relations and events, the current evaluation adds global (cross-document and cross-language) entity disambiguation tasks for Arabic and English. This paper presents the 2008 ACE XDoc evaluation task and associated infrastructure. We describe the linguistic resources created by LDC to support the evaluation, focusing on new approaches required for data selection, data processing, annotation task definitions and annotation software, and we conclude with a discussion of the metrics developed by NIST to support the evaluation.
pdf
abs
Annotation Tool Development for Large-Scale Corpus Creation Projects at the Linguistic Data Consortium
Kazuaki Maeda
|
Haejoong Lee
|
Shawn Medero
|
Julie Medero
|
Robert Parker
|
Stephanie Strassel
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
The Linguistic Data Consortium (LDC) creates a variety of linguistic resources - data, annotations, tools, standards and best practices - for many sponsored projects. The programming staff at LDC has created the tools and technical infrastructures to support the data creation efforts for these projects, creating tools and technical infrastructures for all aspects of data creation projects: data scouting, data collection, data selection, annotation, search, data tracking and worklow management. This paper introduces a number of samples of LDC programming staffs work, with particular focus on the recent additions and updates to the suite of software tools developed by LDC. Tools introduced include the GScout Web Data Scouting Tool, LDC Data Selection Toolkit, ACK - Annotation Collection Kit, XTrans Transcription and Speech Annotation Tool, GALE Distillation Toolkit, and the GALE MT Post Editing Workflow Management System.
pdf
abs
Creating Sentence-Aligned Parallel Text Corpora from a Large Archive of Potential Parallel Text using BITS and Champollion
Kazuaki Maeda
|
Xiaoyi Ma
|
Stephanie Strassel
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Parallel text is one of the most valuable resources for development of statistical machine translation systems and other NLP applications. The Linguistic Data Consortium (LDC) has supported research on statistical machine translations and other NLP applications by creating and distributing a large amount of parallel text resources for the research communities. However, manual translations are very costly, and the number of known providers that offer complete parallel text is limited. This paper presents a cost effective approach to identify parallel document pairs from sources that provide potential parallel text - namely, sources that may contain whole or partial translations of documents in the source language - using the BITS and Champollion parallel text alignment systems developed by LDC.
2006
pdf
abs
An Efficient Approach to Gold-Standard Annotation: Decision Points for Complex Tasks
Julie Medero
|
Kazuaki Maeda
|
Stephanie Strassel
|
Christopher Walker
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
Inter-annotator consistency is a concern for any corpus building effort relying on human annotation. Adjudication is as effective way to locate and correct discrepancies of various kinds. It can also be both difficult and time-consuming. This paper introduces Linguistic Data Consortium (LDC)s model for decision point-based annotation and adjudication, and describes the annotation tools developed to enable this approach for the Automatic Content Extraction (ACE) Program. Using a customized user interface incorporating decision points, we improved adjudication efficiency over 2004 annotation rates, despite increased annotation task complexity. We examine the factors that lead to more efficient, less demanding adjudication. We further discuss how a decision point model might be applied to annotation tools designed for a wide range of annotation tasks. Finally, we consider issues of annotation tool customization versus development time in the context of a decision point model.
pdf
abs
Integrated Linguistic Resources for Language Exploitation Technologies
Stephanie Strassel
|
Christopher Cieri
|
Andrew Cole
|
Denise Dipersio
|
Mark Liberman
|
Xiaoyi Ma
|
Mohamed Maamouri
|
Kazuaki Maeda
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
Linguistic Data Consortium has recently embarked on an effort to create integrated linguistic resources and related infrastructure for language exploitation technologies within the DARPA GALE (Global Autonomous Language Exploitation) Program. GALE targets an end-to-end system consisting of three major engines: Transcription, Translation and Distillation. Multilingual speech or text from a variety of genres is taken as input and English text is given as output, with information of interest presented in an integrated and consolidated fashion to the end user. GALE's goals require a quantum leap in the performance of human language technology, while also demanding solutions that are more intelligent, more robust, more adaptable, more efficient and more integrated. LDC has responded to this challenge with a comprehensive approach to linguistic resource development designed to support GALE's research and evaluation needs and to provide lasting resources for the larger Human Language Technology community.
pdf
abs
Linguistic Resources for Speech Parsing
Ann Bies
|
Stephanie Strassel
|
Haejoong Lee
|
Kazuaki Maeda
|
Seth Kulick
|
Yang Liu
|
Mary Harper
|
Matthew Lease
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
We report on the success of a two-pass approach to annotating metadata, speech effects and syntactic structure in English conversational speech: separately annotating transcribed speech for structural metadata, or structural events, (fillers, speech repairs ( or edit dysfluencies) and SUs, or syntactic/semantic units) and for syntactic structure (treebanking constituent structure and shallow argument structure). The two annotations were then combined into a single representation. Certain alignment issues between the two types of annotation led to the discovery and correction of annotation errors in each, resulting in a more accurate and useful resource. The development of this corpus was motivated by the need to have both metadata and syntactic structure annotated in order to support synergistic work on speech parsing and structural event detection. Automatic detection of these speech phenomena would simultaneously improve parsing accuracy and provide a mechanism for cleaning up transcriptions for downstream text processing. Similarly, constraints imposed by text processing systems such as parsers can be used to help improve identification of disfluencies and sentence boundaries. This paper reports on our efforts to develop a linguistic resource providing both spoken metadata and syntactic structure information, and describes the resulting corpus of English conversational speech.
pdf
abs
Low-cost Customized Speech Corpus Creation for Speech Technology Applications
Kazuaki Maeda
|
Christopher Cieri
|
Kevin Walker
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
Speech technology applications, such as speech recognition, speech synthesis, and speech dialog systems, often require corpora based on highly customized specifications. Existing corpora available to the community, such as TIMIT and other corpora distributed by LDC and ELDA, do not always meet the requirements of such applications. In such cases, the developers need to create their own corpora. The creation of a highly customized speech corpus, however, could be a very expensive and time-consuming task, especially for small organizations. It requires multidisciplinary expertise in linguistics, management and engineering as it involves subtasks such as the corpus design, human subject recruitment, recording, quality assurance, and in some cases, segmentation, transcription and annotation. This paper describes LDC's recent involvement in the creation of a low-cost yet highly-customized speech corpus for a commercial organization under a novel data creation and licensing model, which benefits both the particular data requester and the general linguistic data user community.
pdf
abs
A New Phase in Annotation Tool Development at the Linguistic Data Consortium: The Evolution of the Annotation Graph Toolkit
Kazuaki Maeda
|
Haejoong Lee
|
Julie Medero
|
Stephanie Strassel
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
The Linguistic Data Consortium (LDC) has created various annotated linguistic data for a variety of common task evaluation programs and projects to create shared linguistic resources. The majority of these annotated linguistic data were created with highly customized annotation tools developed at LDC. The Annotation Graph Toolkit (AGTK) has been used as a primary infrastructure for annotation tool development at LDC in recent years. Thanks to the direct feedback from annotation task designers and annotators in-house, annotation tool development at LDC has entered a new, more mature and productive phase. This paper describes recent additions to LDC's annotation tools that are newly developed or significantly improved since our last report at the Fourth International Conference on Language Resource and Evaluation Conference in 2004. These tools are either directly based on AGTK or share a common philosophy with other AGTK tools.
2004
pdf
Annotation Tools for Large-Scale Corpus Development: Using AGTK at the Linguistic Data Consortium
Kazuaki Maeda
|
Stephanie Strassel
Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04)
2002
pdf
Models and Tools for Collaborative Annotation
Xiaoyi Ma
|
Haejoong Lee
|
Steven Bird
|
Kazuaki Maeda
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)
pdf
TableTrans, MultiTrans, InterTrans and TreeTrans: Diverse Tools Built on the Annotation Graph Toolkit
Steven Bird
|
Kazuaki Maeda
|
Xiaoyi Ma
|
Haejoong Lee
|
Beth Randall
|
Salim Zayat
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)
2001
pdf
The Annotation Graph Toolkit: Software Components for Building Linguistic Annotation Tools
Kazuaki Maeda
|
Steven Bird
|
Xiaoyi Ma
|
Haejoong Lee
Proceedings of the First International Conference on Human Language Technology Research
pdf
Annotation Tools Based on the Annotation Graph API
Steven Bird
|
Kazuaki Maeda
|
Xiaoyi Ma
|
Haejoong Lee
Proceedings of the ACL 2001 Workshop on Sharing Tools and Resources