2015
pdf
From Light to Rich ERE: Annotation of Entities, Relations, and Events
Zhiyi Song
|
Ann Bies
|
Stephanie Strassel
|
Tom Riese
|
Justin Mott
|
Joe Ellis
|
Jonathan Wright
|
Seth Kulick
|
Neville Ryant
|
Xiaoyi Ma
Proceedings of the 3rd Workshop on EVENTS: Definition, Detection, Coreference, and Representation
2014
pdf
abs
The RATS Collection: Supporting HLT Research with Degraded Audio Data
David Graff
|
Kevin Walker
|
Stephanie Strassel
|
Xiaoyi Ma
|
Karen Jones
|
Ann Sawyer
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
The DARPA RATS program was established to foster development of language technology systems that can perform well on speaker-to-speaker communications over radio channels that evince a wide range in the type and extent of signal variability and acoustic degradation. Creating suitable corpora to address this need poses an equally wide range of challenges for the collection, annotation and quality assessment of relevant data. This paper describes the LDCs multi-year effort to build the RATS data collection, summarizes the content and properties of the resulting corpora, and discusses the novel problems and approaches involved in ensuring that the data would satisfy its intended use, to provide speech recordings and annotations for training and evaluating HLT systems that perform 4 specific tasks on difficult radio channels: Speech Activity Detection (SAD), Language Identification (LID), Speaker Identification (SID) and Keyword Spotting (KWS).
2012
pdf
abs
LDC Forced Aligner
Xiaoyi Ma
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
This paper describes the LDC forced aligner which was designed to align audio and transcripts. Unlike existing forced aligners, LDC forced aligner can align partially transcribed audio files, and also audio files with large chunks of non-speech segments, such as noise, music, silence etc, by inserting optional wildcard phoneme sequences between sentence or paragraph boundaries. Based on the HTK tool kit, LDC forced aligner can align audio and transcript on sentence or word level. This paper also reports its usage on English and Mandarin Chinese data.
2008
pdf
abs
Creating Sentence-Aligned Parallel Text Corpora from a Large Archive of Potential Parallel Text using BITS and Champollion
Kazuaki Maeda
|
Xiaoyi Ma
|
Stephanie Strassel
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Parallel text is one of the most valuable resources for development of statistical machine translation systems and other NLP applications. The Linguistic Data Consortium (LDC) has supported research on statistical machine translations and other NLP applications by creating and distributing a large amount of parallel text resources for the research communities. However, manual translations are very costly, and the number of known providers that offer complete parallel text is limited. This paper presents a cost effective approach to identify parallel document pairs from sources that provide potential parallel text - namely, sources that may contain whole or partial translations of documents in the source language - using the BITS and Champollion parallel text alignment systems developed by LDC.
2006
pdf
abs
Integrated Linguistic Resources for Language Exploitation Technologies
Stephanie Strassel
|
Christopher Cieri
|
Andrew Cole
|
Denise Dipersio
|
Mark Liberman
|
Xiaoyi Ma
|
Mohamed Maamouri
|
Kazuaki Maeda
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
Linguistic Data Consortium has recently embarked on an effort to create integrated linguistic resources and related infrastructure for language exploitation technologies within the DARPA GALE (Global Autonomous Language Exploitation) Program. GALE targets an end-to-end system consisting of three major engines: Transcription, Translation and Distillation. Multilingual speech or text from a variety of genres is taken as input and English text is given as output, with information of interest presented in an integrated and consolidated fashion to the end user. GALE's goals require a quantum leap in the performance of human language technology, while also demanding solutions that are more intelligent, more robust, more adaptable, more efficient and more integrated. LDC has responded to this challenge with a comprehensive approach to linguistic resource development designed to support GALE's research and evaluation needs and to provide lasting resources for the larger Human Language Technology community.
pdf
abs
Champollion: A Robust Parallel Text Sentence Aligner
Xiaoyi Ma
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
This paper describes Champollion, a lexicon-based sentence aligner designed for robust alignment of potential noisy parallel text. Champollion increases the robustness of the alignment by assigning greater weights to less frequent translated words. Experiments on a manually aligned Chinese English parallel corpus show that Champollion achieves high precision and recall on noisy data. Champollion can be easily ported to new language pairs. Its freely available to the public.
pdf
abs
Corpus Support for Machine Translation at LDC
Xiaoyi Ma
|
Christopher Cieri
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)
This paper describes LDC's efforts in collecting, creating and processing different types of linguistic data, including lexicons, parallel text, multiple translation corpora, and human assessment of translation quality, to support the research and development in Machine Translation. Through a combination of different procedures and core technologies, the LDC was able to create very large, high quality, and cost-efficient corpora, which have contributed significantly to recent advances in Machine Translation. Multiple translation corpora and human assessment together facilitate, validate and improve automatic evaluation metrics, which are vital to the development of MT systems. The Bilingual Internet Text Search (BITS) and Champollion sentence aligner enable the finding and processing of large quantities of parallel text. All specifications and tools used by LDC and described in the paper are or will be available to the general public.
2002
pdf
Models and Tools for Collaborative Annotation
Xiaoyi Ma
|
Haejoong Lee
|
Steven Bird
|
Kazuaki Maeda
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)
pdf
TableTrans, MultiTrans, InterTrans and TreeTrans: Diverse Tools Built on the Annotation Graph Toolkit
Steven Bird
|
Kazuaki Maeda
|
Xiaoyi Ma
|
Haejoong Lee
|
Beth Randall
|
Salim Zayat
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)
pdf
Creating Annotation Tools with the Annotation Graph Toolkit
Kazauki Maeda
|
Steven Bird
|
Xiaoyi Ma
|
Haejoong Lee
Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02)
2001
pdf
The Annotation Graph Toolkit: Software Components for Building Linguistic Annotation Tools
Kazuaki Maeda
|
Steven Bird
|
Xiaoyi Ma
|
Haejoong Lee
Proceedings of the First International Conference on Human Language Technology Research
pdf
Annotation Tools Based on the Annotation Graph API
Steven Bird
|
Kazuaki Maeda
|
Xiaoyi Ma
|
Haejoong Lee
Proceedings of the ACL 2001 Workshop on Sharing Tools and Resources
1999
pdf
abs
Parallel text collections at Linguistic Data Consortium
Xiaoyi Ma
Proceedings of Machine Translation Summit VII
The Linguistic Data Consortium (LDC) is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for research and development purposes. This paper describes past and current work on creation of parallel text corpora, and reviews existing and upcoming collections at LDC.
pdf
abs
BITS: a method for bilingual text search over the Web
Xiaoyi Ma
|
Mark Y. Liberman
Proceedings of Machine Translation Summit VII
Parallel corpus are valuable resource for machine translation, multi-lingual text retrieval, language education and other applications, but for various reasons, its availability is very limited at present. Noticed that the World Word Web is a potential source to mine parallel text, researchers are making their efforts to explore the Web in order to get a big collection of bitext. This paper presents BITS (Bilingual Internet Text Search), a system which harvests multilingual texts over the World Wide Web with virtually no human intervention. The technique is simple, easy to port to any language pairs, and with high accuracy. The results of the experiments on German-English pair proved that the method is very successful.