OJ4OCRMT: A Large Multilingual Dataset for OCR-MT Evaluation

Paul McNamee, Kevin Duh, Cameron Carpenter, Ron Colaianni, Nolan King, Kenton Murray


Abstract
We introduce OJ4OCRMT, an Optical Character Recognition (OCR) dataset for Machine Translation (MT). The dataset supports research on automatic extraction, recognition, and translation of text from document images. The Official Journal of the European Union (OJEU), is the official gazette for the EU. Tens of thousands of pages of legislative acts and regulatory notices are published annually, and parallel translations are available in each of the official languages. Due to its large size, high degree of multilinguality, and carefully produced human translations, the OJEU is a singular resource for language processing research. We have assembled a large collection of parallel pages from the OJEU and have created a dataset to support translation of document images. In this work we introduce the dataset, describe the design decisions which we undertook, and report baseline performance figures for the translation task. It is our hope that this dataset will significantly add to the comparatively few resources presently available for evaluating OCR-MT systems.
Anthology ID:
2025.mtsummit-1.9
Volume:
Proceedings of Machine Translation Summit XX: Volume 1
Month:
June
Year:
2025
Address:
Geneva, Switzerland
Editors:
Pierrette Bouillon, Johanna Gerlach, Sabrina Girletti, Lise Volkart, Raphael Rubino, Rico Sennrich, Ana C. Farinha, Marco Gaido, Joke Daems, Dorothy Kenny, Helena Moniz, Sara Szoc
Venue:
MTSummit
SIG:
Publisher:
European Association for Machine Translation
Note:
Pages:
113–125
Language:
URL:
https://preview.aclanthology.org/mtsummit-25-ingestion/2025.mtsummit-1.9/
DOI:
Bibkey:
Cite (ACL):
Paul McNamee, Kevin Duh, Cameron Carpenter, Ron Colaianni, Nolan King, and Kenton Murray. 2025. OJ4OCRMT: A Large Multilingual Dataset for OCR-MT Evaluation. In Proceedings of Machine Translation Summit XX: Volume 1, pages 113–125, Geneva, Switzerland. European Association for Machine Translation.
Cite (Informal):
OJ4OCRMT: A Large Multilingual Dataset for OCR-MT Evaluation (McNamee et al., MTSummit 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/mtsummit-25-ingestion/2025.mtsummit-1.9.pdf