You’ve translated it, now what?

Michael Maxwell, Shabnam Tafreshi, Aquia Richburg, Balaji Kodali, Kymani Brown


Abstract
Humans use document formatting to discover document and section titles, and important phrases. But when machines process a paper–especially documents OCRed from images–these cues are often invisible to downstream processes: words in footnotes or body text are treated as just as important as words in titles. It would be better for indexing and summarization tools to be guided by implicit document structure. In an ODNI-sponsored project, ARLIS looked at discovering formatting in OCRed text as a way to infer document structure. Most OCR engines output results as hOCR (an XML format), giving bounding boxes around characters. In theory, this also provides style information such as bolding and italicization, but in practice, this capability is limited. For example, the Tesseract OCR tool provides bounding boxes, but does not attempt to detect bold text (relevant to author emphasis and specialized fields in e.g. print dictionaries), and its discrimination of italicization is poor. Our project inferred font size from hOCR bounding boxes, and using that and other cues (e.g. the fact that titles tend to be short) determined which text constituted section titles; from this, a document outline can be created. We also experimented with algorithms for detecting bold text. Our best algorithm has a much improved recall and precision, although the exact numbers are font-dependent. The next step is to incorporate inferred structure into the output of machine translation. One way is to embed XML tags for inferred structure into the text extracted from the imaged document, and to either pass the strings enclosed by XML tags to the MT engine individually, or pass the tags through the MT engine without modification. This structural information can guide downstream bulk processing tasks such as summarization and search, and also enables building tables of contents for human users examining individual documents.
Anthology ID:
2022.amta-upg.27
Volume:
Proceedings of the 15th Biennial Conference of the Association for Machine Translation in the Americas (Volume 2: Users and Providers Track and Government Track)
Month:
September
Year:
2022
Address:
Orlando, USA
Venue:
AMTA
SIG:
Publisher:
Association for Machine Translation in the Americas
Note:
Pages:
394–404
Language:
URL:
https://aclanthology.org/2022.amta-upg.27
DOI:
Bibkey:
Cite (ACL):
Michael Maxwell, Shabnam Tafreshi, Aquia Richburg, Balaji Kodali, and Kymani Brown. 2022. You’ve translated it, now what?. In Proceedings of the 15th Biennial Conference of the Association for Machine Translation in the Americas (Volume 2: Users and Providers Track and Government Track), pages 394–404, Orlando, USA. Association for Machine Translation in the Americas.
Cite (Informal):
You’ve translated it, now what? (Maxwell et al., AMTA 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2022.amta-upg.27.pdf