Shengjie Ma
2026
JudgeAgent: Beyond Static Benchmarks for Knowledge-Driven and Dynamic LLM Evaluation
Zhichao Shi | Xuhui Jiang | Chengjin Xu | Cangli Yao | Shengjie Ma | Yinghan Shen | Zixuan Li | Jian Guo | Yuanzhuo Wang
Findings of the Association for Computational Linguistics: ACL 2026
Zhichao Shi | Xuhui Jiang | Chengjin Xu | Cangli Yao | Shengjie Ma | Yinghan Shen | Zixuan Li | Jian Guo | Yuanzhuo Wang
Findings of the Association for Computational Linguistics: ACL 2026
Current evaluation methods for large language models (LLMs) primarily rely on static benchmarks, presenting two major challenges: limited knowledge coverage and fixed difficulties that mismatch with the evaluated LLMs. These limitations lead to superficial assessments of LLM knowledge, thereby impeding the targeted model optimizations.To bridge this gap, we propose JudgeAgent, a knowledge-driven and dynamic evaluation framework for LLMs.To address the challenge of limited knowledge coverage, JudgeAgent leverages LLM agents equipped with context graphs to traverse knowledge structures systematically for question generation.Furthermore, to mitigate data contamination and difficulty mismatch, it adopts a difficulty-adaptive and multi-turn interview mechanism.Thereby, JudgeAgent can achieve comprehensive evaluations and facilitate more effective improvement of LLMs.Empirical results demonstrate that JudgeAgent enables more comprehensive evaluations and facilitates effective model iterations, highlighting the potential of this knowledge-driven and dynamic evaluation paradigm.The source code is available on https://github.com/DataArcTech/JudgeAgent.
Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models
Teng Wang | Jiang Zhangyi | Zhenqi He | Hailei Gong | Shenyang Tong | Wenhan Yang | Zeyu Li | Yanan Zheng | Zifan He | Zewen Ye | Shengjie Ma | Jianping Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Teng Wang | Jiang Zhangyi | Zhenqi He | Hailei Gong | Shenyang Tong | Wenhan Yang | Zeyu Li | Yanan Zheng | Zifan He | Zewen Ye | Shengjie Ma | Jianping Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Large Language Models (LLMs) have demonstrated strong mathematical reasoning abilities through supervised fine-tuning and reinforcement learning. However, existing Process Reward Models (PRMs) are vulnerable to reward hacking and require expensive, large-scale annotation of reasoning steps, limiting their reliability and scalability. To address the first problem, we propose a novel reward model approach, Hierarchical Reward Model (HRM), which evaluates both individual and consecutive reasoning steps from fine-grained and coarse-grained level. HRM excels at assessing multi-step mathematical reasoning coherence, particularly in cases where a flawed step is later corrected through self-reflection. Furthermore, to address the inefficiency of autonomously annotating PRM training data via Monte Carlo Tree Search (MCTS), we propose a lightweight data augmentation strategy, Hierarchical Node Compression (HNC), which merges consecutive reasoning steps within the tree structure. Applying HNC to MCTS-generated reasoning trajectories increases the diversity and robustness of HRM training data, while introducing controlled noise with minimal computational overhead. Empirical results on the PRM800K dataset demonstrate that HRM, in conjunction with HNC, achieves superior stability and reliability in evaluation compared to PRM. Furthermore, cross-domain evaluations on MATH500 and GSM8K dataset confirm HRM’s superior generalization and robustness across diverse mathematical reasoning tasks.
2025
LongFaith: Enhancing Long-Context Reasoning in LLMs with Faithful Synthetic Data
Cehao Yang | Xueyuan Lin | Chengjin Xu | Xuhui Jiang | Shengjie Ma | Aofan Liu | Hui Xiong | Jian Guo
Findings of the Association for Computational Linguistics: ACL 2025
Cehao Yang | Xueyuan Lin | Chengjin Xu | Xuhui Jiang | Shengjie Ma | Aofan Liu | Hui Xiong | Jian Guo
Findings of the Association for Computational Linguistics: ACL 2025
Despite the growing development of long-context large language models (LLMs), data-centric approaches relying on synthetic data have been hindered by issues related to faithfulness, which limit their effectiveness in enhancing model performance on tasks such as long-context reasoning and question answering (QA). These challenges are often exacerbated by misinformation caused by lack of verification, reasoning without attribution, and potential knowledge conflicts. We propose LongFaith, a novel pipeline for synthesizing faithful long-context reasoning instruction datasets. By integrating ground truth and citation-based reasoning prompts, we eliminate distractions and improve the accuracy of reasoning chains, thus mitigating the need for costly verification processes. We open-source two synthesized datasets—LongFaith-SFT and LongFaith-PO—which systematically address multiple dimensions of faithfulness, including verified reasoning, attribution, and contextual grounding. Extensive experiments on multi-hop reasoning datasets and LongBench demonstrate that models fine-tuned on these datasets significantly improve performance. Our ablation studies highlight the scalability and adaptability of the LongFaith pipeline, showcasing its broad applicability in developing long-context LLMs.
STAND-Guard: A Small Task-Adaptive Content Moderation Model
Minjia Wang | Pingping Lin | Siqi Cai | Shengnan An | Shengjie Ma | Zeqi Lin | Congrui Huang | Bixiong Xu
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track
Minjia Wang | Pingping Lin | Siqi Cai | Shengnan An | Shengjie Ma | Zeqi Lin | Congrui Huang | Bixiong Xu
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track
Content moderation, the process of reviewing and monitoring the safety of generated content, is important for development of welcoming online platforms and responsible large language models. Content moderation contains various tasks, each with its unique requirements tailored to specific scenarios. Therefore, it is crucial to develop a model that can be easily adapted to novel or customized content moderation tasks accurately without extensive model tuning. This paper presents STAND-Guard, a Small Task-Adaptive coNtent moDeration model. The basic motivation is: by performing instruct tuning on various content moderation tasks, we can unleash the power of small language models (SLMs) on unseen (out-of-distribution) content moderation tasks. We also carefully study the effects of training tasks and model size on the efficacy of cross-task fine-tuning mechanism. Experiments demonstrate STAND-Guard is comparable to GPT-3.5-Turbo across over 40 public datasets, as well as proprietary datasets derived from real-world business scenarios. Remarkably, STAND-Guard achieved nearly equivalent results to GPT-4-Turbo on unseen English binary classification tasks.
Search
Fix author
Co-authors
- Jian Guo 2
- Xuhui Jiang 2
- Chengjin Xu 2
- Shengnan An 1
- Siqi Cai 1
- Hailei Gong 1
- Zhenqi He 1
- Zifan He 1
- Congrui Huang 1
- Zixuan Li 1
- Zeyu Li 1
- Xueyuan Lin 1
- Pingping Lin 1
- Zeqi Lin 1
- Aofan Liu 1
- Yinghan Shen 1
- Zhichao Shi 1
- Shenyang Tong 1
- Yuanzhuo Wang 1
- Teng Wang 1
- Minjia Wang 1
- Hui Xiong 1
- Bixiong Xu 1
- Cehao Yang 1
- Wenhan Yang 1
- Cangli Yao 1
- Zewen Ye 1
- Jianping Zhang 1
- Jiang Zhangyi 1
- Yanan Zheng 1