Shufan Jiang
2025
MathD2: Towards Disambiguation of Mathematical Terms
Shufan Jiang
|
Mary Ann Tan
|
Harald Sack
Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025)
In mathematical literature, terms can have multiple meanings based on context. Manual disambiguation across scholarly articles demands massive efforts from mathematicians. This paper addresses the challenge of automatically determining whether two definitions of a mathematical term are semantically different. Specifically, the difficulties and how contextualized textual representation can help resolve the problem, are investigated. A new dataset MathD2 for mathematical term disambiguation is constructed with ProofWiki’s disambiguation pages. Then three approaches based on the contextualized textual representation are studied: (1) supervised classification based on the embedding of concatenated definition and title; (2) zero-shot prediction based on semantic textual similarity(STS) between definition and title and (3) zero-shot LLM prompting. The first two approaches achieve accuracy greater than 0.9 on the ground truth dataset, demonstrating the effectiveness of our methods for the automatic disambiguation of mathematical definitions. Our dataset and source code are available here: https://github.com/sufianj/MathTermDisambiguation.
2024
How to Turn Card Catalogs into LLM Fodder
Mary Ann Tan
|
Shufan Jiang
|
Harald Sack
Proceedings of the Workshop on Deep Learning and Linked Data (DLnLD) @ LREC-COLING 2024
Bibliographical metadata collections describing pre-modern objects suffer from incompleteness and inaccuracies. This hampers the identification of literary works. In addition, titles often contain voluminous descriptive texts that do not adhere to contemporary title conventions. This paper explores several NLP approaches where greater textual length in titles is leveraged to enhance descriptive information.
2023
Extracting Definienda in Mathematical Scholarly Articles with Transformers
Shufan Jiang
|
Pierre Senellart
Proceedings of the Second Workshop on Information Extraction from Scientific Publications