William Lewis

Also published as: William D. Lewis

2018

pdf abs
Training Deployable General Domain MT for a Low Resource Language Pair: English-Bangla
Sandipan Dandapat | William Lewis
Proceedings of the 21st Annual Conference of the European Association for Machine Translation

A large percentage of the world’s population speaks a language of the Indian subcontinent, what we will call here Indic languages, comprising languages from both Indo-European (e.g., Hindi, Bangla, Gujarati, etc.) and Dravidian (e.g., Tamil, Telugu, Malayalam, etc.) families, upwards of 1.5 Billion people. A universal characteristic of Indic languages is their complex morphology, which, when combined with the general lack of sufficient quantities of high quality parallel data, can make developing machine translation (MT) for these languages difficult. In this paper, we describe our efforts towards developing general domain English–Bangla MT systems which are deployable to the Web. We initially developed and deployed SMT-based systems, but over time migrated to NMT-based systems. Our initial SMT-based systems had reasonably good BLEU scores, however, using NMT systems, we have gained significant improvement over SMT baselines. This is achieved using a number of ideas to boost the data store and counter data sparsity: crowd translation of intelligently selected monolingual data (throughput enhanced by an IME (Input Method Editor) designed specifically for QWERTY keyboard entry for Devanagari scripted languages), back-translation, different regularization techniques, dataset augmentation and early stopping.

2017

pdf
The Microsoft Speech Language Translation (MSLT) Corpus for Chinese and Japanese: Conversational Test data for Machine Translation and Speech Recognition
Christian Federmann | William D. Lewis
Proceedings of Machine Translation Summit XVI: Research Track

2016

pdf abs
Microsoft Speech Language Translation (MSLT) Corpus: The IWSLT 2016 release for English, French and German
Christian Federmann | William D. Lewis
Proceedings of the 13th International Conference on Spoken Language Translation

We describe the Microsoft Speech Language Translation (MSLT) corpus, which was created in order to evaluate end-to-end conversational speech translation quality. The corpus was created from actual conversations over Skype, and we provide details on the recording setup and the different layers of associated text data. The corpus release includes Test and Dev sets with reference transcripts for speech recognition. Additionally, cleaned up transcripts and reference translations are available for evaluation of machine translation quality. The IWSLT 2016 release described here includes the source audio, raw transcripts, cleaned up transcripts, and translations to or from English for both French and German.

2015

pdf
Enriching Interlinear Text using Automatically Constructed Annotators
Ryan Georgi | Fei Xia | William Lewis
Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)

pdf
Skype Translator: Breaking down language and hearing barriers. A behind the scenes look at near real-time speech translation
William Lewis
Proceedings of Translating and the Computer 37

pdf bib
Applying cross-entropy difference for selecting parallel training data from publicly available sources for conversational machine translation
William Lewis | Christian Federmann | Ying Xin
Proceedings of the 12th International Workshop on Spoken Language Translation: Papers

2014

pdf abs
Enriching ODIN
Fei Xia | William Lewis | Michael Wayne Goodman | Joshua Crowgey | Emily M. Bender
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we describe the expansion of the ODIN resource, a database containing many thousands of instances of Interlinear Glossed Text (IGT) for over a thousand languages harvested from scholarly linguistic papers posted to the Web. A database containing a large number of instances of IGT, which are effectively richly annotated and heuristically aligned bitexts, provides a unique resource for bootstrapping NLP tools for resource-poor languages. To make the data in ODIN more readily consumable by tool developers and NLP researchers, we propose a new XML format for IGT, called Xigt. We call the updated release ODIN-II.

2013

pdf
Enhanced and Portable Dependency Projection Algorithms Using Interlinear Glossed Text
Ryan Georgi | Fei Xia | William D. Lewis
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf
Dramatically Reducing Training Data Size Through Vocabulary Saturation
William Lewis | Sauleh Eetemadi
Proceedings of the Eighth Workshop on Statistical Machine Translation

pdf
Controlled Ascent: Imbuing Statistical MT with Linguistic Knowledge
William Lewis | Chris Quirk
Proceedings of the Second Workshop on Hybrid Approaches to Translation

2012

pdf
Improving Dependency Parsing with Interlinear Glossed Text and Syntactic Projection
Ryan Georgi | Fei Xia | William Lewis
Proceedings of COLING 2012: Posters

pdf abs
Measuring the Divergence of Dependency Structures Cross-Linguistically to Improve Syntactic Projection Algorithms
Ryan Georgi | Fei Xia | William Lewis
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Syntactic parses can provide valuable information for many NLP tasks, such as machine translation, semantic analysis, etc. However, most of the world's languages do not have large amounts of syntactically annotated corpora available for building parsers. Syntactic projection techniques attempt to address this issue by using parallel corpora between resource-poor and resource-rich languages, bootstrapping the resource-poor language with the syntactic analysis of the resource-rich language. In this paper, we investigate the possibility of using small, parallel, annotated corpora to automatically detect divergent structural patterns between two languages. These patterns can then be used to improve structural projection algorithms, allowing for better performing NLP tools for resource-poor languages, in particular those that may not have large amounts of annotated data necessary for traditional, fully-supervised methods. While this detection process is not exhaustive, we demonstrate that important instances of divergence are picked up with minimal prior knowledge of a given language pair.

pdf abs
Applications of data selection via cross-entropy difference for real-world statistical machine translation
Amittai Axelrod | QingJun Li | William D. Lewis
Proceedings of the 9th International Workshop on Spoken Language Translation: Papers

We broaden the application of data selection methods for domain adaptation to a larger number of languages, data, and decoders than shown in previous work, and explore comparable applications for both monolingual and bilingual cross-entropy difference methods. We compare domain adapted systems against very large general-purpose systems for the same languages, and do so without a bias to a particular direction. We present results against real-world generalpurpose systems tuned on domain-specific data, which are substantially harder to beat than standard research baseline systems. We show better performance for nearly all domain adapted systems, despite the fact that the domainadapted systems are trained on a fraction of the content of their general domain counterparts. The high performance of these methods suggest applicability to a wide variety of contexts, particularly in scenarios where only small supplies of unambiguously domain-specific data are available, yet it is believed that additional similar data is included in larger heterogenous-content general-domain corpora.

pdf abs
Building MT for a Severely Under-Resourced Language: White Hmong
William Lewis | Phong Yang
Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Research Papers

In this paper, we discuss the development of statistical machine translation for English to/from White Hmong (Language code: mww). White Hmong is a Hmong-Mien language, originally spoken mostly in Southeast Asia, but now predominantly spoken by a large diaspora throughout the world, with populations in the United States, Australia, France, Thailand and elsewhere. Building statistical translation systems for Hmong proved to be incredibly challenging since there are no known parallel or monolingual corpora for the language; in fact, finding data for Hmong proved to be one of the biggest challenges to getting the project off the ground. It was only through a close collaboration with the Hmong community, and active and tireless participation of Hmong speakers, that it became possible to build up a critical mass of data to make the translation project a reality. We see this effort as potentially replicable for other severely resource poor languages of the world, which is likely the case for the majority of the languages still spoken on the planet. Further, the work here suggests that research and work on other severely under-resourced languages can have significant positive impacts for the affected communities, both for accessibility and language preservation.

2011

pdf
Crisis MT: Developing A Cookbook for MT in Crisis Situations
William Lewis | Robert Munro | Stephan Vogel
Proceedings of the Sixth Workshop on Statistical Machine Translation

2010

pdf abs
Achieving Domain Specificity in SMT without Overt Siloing
William D. Lewis | Chris Wendt | David Bullock
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We examine pooling data as a method for improving Statistical Machine Translation (SMT) quality for narrowly defined domains, such as data for a particular company or public entity. By pooling all available data, building large SMT engines, and using domain-specific target language models, we see boosts in quality, and can achieve the generalizability and resiliency of a larger SMT but with the precision of a domain-specific engine.

pdf abs
The Problems of Language Identification within Hugely Multilingual Data Sets
Fei Xia | Carrie Lewis | William D. Lewis
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

As the data for more and more languages is finding its way into digital form, with an increasing amount of this data being posted to the Web, it has become possible to collect language data from the Web and create large multilingual resources, covering hundreds or even thousands of languages. ODIN, the Online Database of INterlinear text (Lewis, 2006), is such a resource. It currently consists of nearly 200,000 data points for over 1,000 languages, the data for which was harvested from linguistic documents on the Web. We identify a number of issues with language identification for such broad-coverage resources including the lack of training data, ambiguous language names, incomplete language code sets, and incorrect uses of language names and codes. After providing a short overview of existing language code sets maintained by the linguistic community, we discuss what linguists and the linguistic community can do to make the process of language identification easier.

pdf
Haitian Creole: How to Build and Ship an MT Engine from Scratch in 4 days, 17 hours, & 30 minutes
William Lewis
Proceedings of the 14th Annual Conference of the European Association for Machine Translation

pdf
Comparing Language Similarity across Genetic and Typologically-Based Groupings
Ryan Georgi | Fei Xia | William Lewis
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf
Intelligent Selection of Language Model Training Data
Robert C. Moore | William Lewis
Proceedings of the ACL 2010 Conference Short Papers

pdf bib
Proceedings of the 2010 Workshop on NLP and Linguistics: Finding the Common Ground
Fei Xia | William Lewis | Lori Levin
Proceedings of the 2010 Workshop on NLP and Linguistics: Finding the Common Ground