Textual and Visual Characteristics of Mathematical Expressions in Scholar Documents

Vidas Daudaravicius


Abstract
Mathematical expressions (ME) are widely used in scholar documents. In this paper we analyze characteristics of textual and visual MEs characteristics for the image-to-LaTeX translation task. While there are open data-sets of LaTeX files with MEs included it is very complicated to extract these MEs from a document and to compile the list of MEs. Therefore we release a corpus of open-access scholar documents with PDF and JATS-XML parallel files. The MEs in these documents are LaTeX encoded and are document independent. The data contains more than 1.2 million distinct annotated formulae and more than 80 million raw tokens of LaTeX MEs in more than 8 thousand documents. While the variety of textual lengths and visual sizes of MEs are not well defined we found that the task of analyzing MEs in scholar documents can be reduced to the subtask of a particular text length, image width and height bounds, and display MEs can be processed as arrays of partial MEs.
Anthology ID:
W19-2610
Volume:
Proceedings of the Workshop on Extracting Structured Knowledge from Scientific Publications
Month:
June
Year:
2019
Address:
Minneapolis, Minnesota
Venues:
NAACL | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
72–81
Language:
URL:
https://aclanthology.org/W19-2610
DOI:
10.18653/v1/W19-2610
Bibkey:
Cite (ACL):
Vidas Daudaravicius. 2019. Textual and Visual Characteristics of Mathematical Expressions in Scholar Documents. In Proceedings of the Workshop on Extracting Structured Knowledge from Scientific Publications, pages 72–81, Minneapolis, Minnesota. Association for Computational Linguistics.
Cite (Informal):
Textual and Visual Characteristics of Mathematical Expressions in Scholar Documents (Daudaravicius, 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/update-css-js/W19-2610.pdf