Harald Sack


2025

pdf bib
Proceedings of the Workshop on Generative AI and Knowledge Graphs (GenAIK)
Genet Asefa Gesese | Harald Sack | Heiko Paulheim | Albert Merono-Penuela | Lihu Chen
Proceedings of the Workshop on Generative AI and Knowledge Graphs (GenAIK)

pdf bib
MathD2: Towards Disambiguation of Mathematical Terms
Shufan Jiang | Mary Ann Tan | Harald Sack
Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025)

In mathematical literature, terms can have multiple meanings based on context. Manual disambiguation across scholarly articles demands massive efforts from mathematicians. This paper addresses the challenge of automatically determining whether two definitions of a mathematical term are semantically different. Specifically, the difficulties and how contextualized textual representation can help resolve the problem, are investigated. A new dataset MathD2 for mathematical term disambiguation is constructed with ProofWiki’s disambiguation pages. Then three approaches based on the contextualized textual representation are studied: (1) supervised classification based on the embedding of concatenated definition and title; (2) zero-shot prediction based on semantic textual similarity(STS) between definition and title and (3) zero-shot LLM prompting. The first two approaches achieve accuracy greater than 0.9 on the ground truth dataset, demonstrating the effectiveness of our methods for the automatic disambiguation of mathematical definitions. Our dataset and source code are available here: https://github.com/sufianj/MathTermDisambiguation.

2024

pdf bib
How to Turn Card Catalogs into LLM Fodder
Mary Ann Tan | Shufan Jiang | Harald Sack
Proceedings of the Workshop on Deep Learning and Linked Data (DLnLD) @ LREC-COLING 2024

Bibliographical metadata collections describing pre-modern objects suffer from incompleteness and inaccuracies. This hampers the identification of literary works. In addition, titles often contain voluminous descriptive texts that do not adhere to contemporary title conventions. This paper explores several NLP approaches where greater textual length in titles is leveraged to enhance descriptive information.

2016

pdf bib
Crowdsourced Corpus with Entity Salience Annotations
Milan Dojchinovski | Dinesh Reddy | Tomáš Kliegr | Tomáš Vitvar | Harald Sack
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper, we present a crowdsourced dataset which adds entity salience (importance) annotations to the Reuters-128 dataset, which is subset of Reuters-21578. The dataset is distributed under a free license and publish in the NLP Interchange Format, which fosters interoperability and re-use. We show the potential of the dataset on the task of learning an entity salience classifier and report on the results from several experiments.