David Doermann

2012

Leveraging Statistical Transliteration for Dictionary-Based English-Bengali CLIR of OCR‘d Text
Utpal Garain | Arjun Das | David Doermann | Douglas Oard
Proceedings of COLING 2012: Posters

pdf bib abs

Linguistic Resources for Handwriting Recognition and Translation Evaluation
Zhiyi Song | Safa Ismael | Stephen Grimes | David Doermann | Stephanie Strassel
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We describe efforts to create corpora to support development and evaluation of handwriting recognition and translation technology. LDC has developed a stable pipeline and infrastructures for collecting and annotating handwriting linguistic resources to support the evaluation of MADCAT and OpenHaRT. We collect and annotate handwritten samples of pre-processed Arabic and Chinese data that has been already translated in English that is used in the GALE program. To date, LDC has recruited more than 600 scribes and collected, annotated and released more than 225,000 handwriting images. Most linguistic resources created for these programs will be made available to the larger research community by publishing in LDC's catalog. The phase 1 MADCAT corpus is now available.

pdf bib

A Random Forest System Combination Approach for Error Detection in Digital Dictionaries
Michael Bloodgood | Peng Ye | Paul Rodrigues | David Zajic | David Doermann
Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data

This paper describes an approach to analyzing the lexical structure of OCRed bilingual dictionaries to construct resources suited for machine translation of low-density languages, where online resources are limited. A rule-based, an HMM-based, and a post-processed HMM-based method are used for rapid construction of MT lexicons based on systematic structural clues provided in the original dictionary. We evaluate the effectiveness of our techniques, concluding that: (1) the rule-based method performs better with dictionaries where the font is not an important distinguishing feature for determining information types; (2) the post-processed stochastic method improves the results of the stochastic method for phrasal entries; and (3) Our resulting bilingual lexicons are comprehensive enough to provide the basis for reasonable translation results when compared to human translations.

pdf bib

Co-authors

Venues

naacl1

sigmorphon1

ws1

Fix author

David Doermann

2012

2011

2006

2003

Co-authors

Venues