Abstract
Natural language models often fall short when understanding and generating mathematical notation. What is not clear is whether these shortcomings are due to fundamental limitations of the models, or the absence of appropriate tasks. In this paper, we explore the extent to which natural language models can learn semantics between mathematical notation and their surrounding text. We propose two notation prediction tasks, and train a model that selectively masks notation tokens and encodes left and/or right sentences as context. Compared to baseline models trained by masked language modeling, our method achieved significantly better performance at the two tasks, showing that this approach is a good first step towards modeling mathematical texts. However, the current models rarely predict unseen symbols correctly, and token-level predictions are more accurate than symbol-level predictions, indicating more work is needed to represent structural patterns. Based on the results, we suggest future works toward modeling mathematical texts.- Anthology ID:
- 2021.findings-emnlp.266
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2021
- Month:
- November
- Year:
- 2021
- Address:
- Punta Cana, Dominican Republic
- Editors:
- Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
- Venue:
- Findings
- SIG:
- SIGDAT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 3102–3115
- Language:
- URL:
- https://aclanthology.org/2021.findings-emnlp.266
- DOI:
- 10.18653/v1/2021.findings-emnlp.266
- Cite (ACL):
- Hwiyeol Jo, Dongyeop Kang, Andrew Head, and Marti A. Hearst. 2021. Modeling Mathematical Notation Semantics in Academic Papers. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3102–3115, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Cite (Informal):
- Modeling Mathematical Notation Semantics in Academic Papers (Jo et al., Findings 2021)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-2/2021.findings-emnlp.266.pdf
- Data
- S2ORC