Won Gyu Kim

2026

Accurate labeling of relevance between biomedical abstracts is essential for improving information retrieval, semantic similarity modeling, training of ranking systems and other Natural Language Processing tasks. However, manual annotations are time-consuming, labor intensive and costly. Studies show that large language models (LLMs) can facilitate automated annotation, but their performance still falls short of human expert-level accuracy, especially in domain-specific tasks. It has been shown that combining annotations from multiple non-expert annotators can achieve performance comparable to, or even exceeding, that of trained experts. Based on this evidence, we treat AI-generated annotations as contributions from non-expert annotators and combine them using Learning to Rank framework. Our results show significant improvement in overall annotation quality. The proposed method looks promising to reduce reliance on human annotation while maintaining reliable performance for large-scale biomedical applications.

2021

pdf bib abs

Measuring the relative importance of full text sections for information retrieval from scientific literature.
Lana Yeganova | Won Gyu Kim | Donald Comeau | W John Wilbur | Zhiyong Lu
Proceedings of the 20th Workshop on Biomedical Language Processing

With the growing availability of full-text articles, integrating abstracts and full texts of documents into a unified representation is essential for comprehensive search of scientific literature. However, previous studies have shown that naïvely merging abstracts with full texts of articles does not consistently yield better performance. Balancing the contribution of query terms appearing in the abstract and in sections of different importance in full text articles remains a challenge both with traditional bag-of-words IR approaches and for neural retrieval methods. In this work we establish the connection between the BM25 score of a query term appearing in a section of a full text document and the probability of that document being clicked or identified as relevant. Probability is computed using Pool Adjacent Violators (PAV), an isotonic regression algorithm, providing a maximum likelihood estimate based on the observed data. Using this probabilistic transformation of BM25 scores we show an improved performance on the PubMed Click dataset developed and presented in this study, as well as the 2007 TREC Genomics collection.

2018

pdf bib abs

MeSH-based dataset for measuring the relevance of text retrieval
Won Gyu Kim | Lana Yeganova | Donald Comeau | W John Wilbur | Zhiyong Lu
Proceedings of the BioNLP 2018 workshop

Creating simulated search environments has been of a significant interest in infor-mation retrieval, in both general and bio-medical search domains. Existing collec-tions include modest number of queries and are constructed by manually evaluat-ing retrieval results. In this work we pro-pose leveraging MeSH term assignments for creating synthetic test beds. We select a suitable subset of MeSH terms as queries, and utilize MeSH term assignments as pseudo-relevance rankings for retrieval evaluation. Using well studied retrieval functions, we show that their performance on the proposed data is consistent with similar findings in previous work. We further use the proposed retrieval evaluation framework to better understand how to combine heterogeneous sources of textual information.

Co-authors

Venues

BioNLP3
WS1

Fix author