Formula-Text Cross-Retrieval: A Benchmarking Study of Dense Embedding Methods for Mathematical Information Retrieval

Zichao Li


Abstract
Mathematical information retrieval requires understanding the complex relationship between natural language and formulae. This paper presents a benchmarking study on Formula-Text Cross-Retrieval, comparing a sparse baseline (BM25), off-the-shelf dense embeddings (OpenAI, BGE), and a fine-tuned dual-encoder model. Our model, trained with a contrastive objective on the ARQAR dataset, significantly outperforms all baselines, achieving state-of-the-art results. Ablation studies confirm the importance of linearization, a shared-weight architecture, and the Multiple Negatives Ranking loss. The work provides a strong foundation for mathematical NLP applications.
Anthology ID:
2025.mathnlp-main.9
Volume:
Proceedings of The 3rd Workshop on Mathematical Natural Language Processing (MathNLP 2025)
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Marco Valentino, Deborah Ferreira, Mokanarangan Thayaparan, Leonardo Ranaldi, Andre Freitas
Venues:
MathNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
124–133
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.mathnlp-main.9/
DOI:
Bibkey:
Cite (ACL):
Zichao Li. 2025. Formula-Text Cross-Retrieval: A Benchmarking Study of Dense Embedding Methods for Mathematical Information Retrieval. In Proceedings of The 3rd Workshop on Mathematical Natural Language Processing (MathNLP 2025), pages 124–133, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Formula-Text Cross-Retrieval: A Benchmarking Study of Dense Embedding Methods for Mathematical Information Retrieval (Li, MathNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.mathnlp-main.9.pdf