Jiaxu Yan


Fixing paper assignments

  1. Please select all papers that belong to the same person.
  2. Indicate below which author they should be assigned to.
Provide a valid ORCID iD here. This will be used to match future papers to this author.
Provide the name of the school or the university where the author has received or will receive their highest degree (e.g., Ph.D. institution for researchers, or current affiliation for students). This will be used to form the new author page ID, if needed.

TODO: "submit" and "cancel" buttons here


2025

pdf bib
Improve LLM-as-a-Judge Ability as a General Ability
Jiachen Yu | Shaoning Sun | Xiaohui Hu | Jiaxu Yan | Kaidong Yu | Xuelong Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

LLM-as-a-Judge leverages the generative and reasoning capabilities of large language models (LLMs) to evaluate LLM responses across diverse scenarios, providing accurate preference signals. This approach plays a vital role in aligning LLMs with human values. Recent studies have raised many methods to train LLM as generative judges, but most of them are data consuming or lack accuracy, and only focus on LLM’s judge ability. In this work, we conceptualize judging ability as a general capability of LLMs and adapt the two-stage SFT-DPO training framework—commonly used in traditional general model training—to the development of judge models. We introduce an efficient data synthesis method, which includes the automatic generation of various judge templates, dual verification for data accuracy and consistency. A difficulty-based data stratification strategy allows us to distribute more effective data to the SFT and DPO stages respectively. Experimental results demonstrate that our approach, utilizing only about 2% to 40% of the data required by other methods, achieves SOTA performance on RewardBench. Furthermore, our training method enhances the general capabilities of the model by constructing complicated judge task with CoT outputs. We further validate the effectiveness of our model by deploying it to provide reward signals in a real-world RLHF scenarios. We will open-source our model weights and training data to facilitate further research.