Mengyi Yan


Fixing paper assignments

  1. Please select all papers that belong to the same person.
  2. Indicate below which author they should be assigned to.
Provide a valid ORCID iD here. This will be used to match future papers to this author.
Provide the name of the school or the university where the author has received or will receive their highest degree (e.g., Ph.D. institution for researchers, or current affiliation for students). This will be used to form the new author page ID, if needed.

TODO: "submit" and "cancel" buttons here


2025

pdf bib
PUER: Boosting Few-shot Positive-Unlabeled Entity Resolution with Reinforcement Learning
Yaoshu Wang | Mengyi Yan | Wei Wang
Findings of the Association for Computational Linguistics: EMNLP 2025

Entity resolution is a fundamental problem in data management that aims to identify all duplicate entries within collections of multi-attribute tuples. Most existing works focus on supervised learning, relying on large amounts of high-quality labeled data, including both positive and negative tuple pairs that are meticulously prepared. However, in reality, the manual annotation process is labor-intensive; in particular, selecting high-quality negative data for labeling is both important and challenging. In this paper, we propose an end-to-end ER solution, PUER, to address low-resource entity resolution (ER) by leveraging Large Language Models (LLMs) in a Positive-Unlabeled (PU) learning setting, where only a small number of positively labeled examples, e.g., 50, and unlabeled data are provided. Unlike directly fine-tuning LLMs in a supervised manner, we solve the entity matching task using reinforcement learning and propose a self-adaptive reward function in the process of RL. To enhance performance, we design an iterative workflow based on the co-training mechanism that fully utilizes entity blocking component to assist the entity matching. This workflow aims to improve the robustness and quality of pseudo-labels so that the performance of entity matching improves. Comprehensive experimental results on various benchmark datasets demonstrate the superiority of PUER. Full version and code are available.