Nolan King
2025
OJ4OCRMT: A Large Multilingual Dataset for OCR-MT Evaluation
Paul McNamee
|
Kevin Duh
|
Cameron Carpenter
|
Ron Colaianni
|
Nolan King
|
Kenton Murray
Proceedings of Machine Translation Summit XX: Volume 1
We introduce OJ4OCRMT, an Optical Character Recognition (OCR) dataset for Machine Translation (MT). The dataset supports research on automatic extraction, recognition, and translation of text from document images. The Official Journal of the European Union (OJEU), is the official gazette for the EU. Tens of thousands of pages of legislative acts and regulatory notices are published annually, and parallel translations are available in each of the official languages. Due to its large size, high degree of multilinguality, and carefully produced human translations, the OJEU is a singular resource for language processing research. We have assembled a large collection of parallel pages from the OJEU and have created a dataset to support translation of document images. In this work we introduce the dataset, describe the design decisions which we undertook, and report baseline performance figures for the translation task. It is our hope that this dataset will significantly add to the comparatively few resources presently available for evaluating OCR-MT systems.
2022
CAMIO: A Corpus for OCR in Multiple Languages
Michael Arrigo
|
Stephanie Strassel
|
Nolan King
|
Thao Tran
|
Lisa Mason
Proceedings of the Thirteenth Language Resources and Evaluation Conference
CAMIO (Corpus of Annotated Multilingual Images for OCR) is a new corpus created by Linguistic Data Consortium to serve as a resource to support the development and evaluation of optical character recognition (OCR) and related technologies for 35 languages across 24 unique scripts. The corpus comprises nearly 70,000 images of machine printed text, covering a wide variety of topics and styles, document domains, attributes and scanning/capture artifacts. Most images have been exhaustively annotated for text localization, resulting in over 2.3M line-level bounding boxes. For 13 of the 35 languages, 1250 images/language have been further annotated with orthographic transcriptions of each line plus specification of reading order, yielding over 2.4M tokens of transcribed text. The resulting annotations are represented in a comprehensive XML output format defined for this corpus. The paper discusses corpus design and implementation, challenges encountered, baseline performance results obtained on the corpus for text localization and OCR decoding, and plans for corpus publication.
Search
Fix author
Co-authors
- Michael Arrigo 1
- Cameron Carpenter 1
- Ron Colaianni 1
- Kevin Duh 1
- Lisa Mason 1
- show all...