Chen Hu
2026
CoPA: Benchmarking Personalized Question Answering with Data-Informed Cognitive Factors
Hang Su | Zequn Liu | Chen Hu | Xuesong Lu | Yingce Xia | Liu Zhen
Findings of the Association for Computational Linguistics: ACL 2026
Hang Su | Zequn Liu | Chen Hu | Xuesong Lu | Yingce Xia | Liu Zhen
Findings of the Association for Computational Linguistics: ACL 2026
While LLMs have demonstrated remarkable potential in Question Answering (QA), evaluating personalization remains a critical bottleneck. Existing paradigms predominantly rely on surface-level similarity or manual heuristics, often lacking sufficient data-driven validation. We address this by mining Community-Individual Preference Divergence (CIPD)—where individual choices override consensus—to distill six key personalization factors as evaluative dimensions. Accordingly, we introduce CoPA, a benchmark with 1,985 user profiles for fine-grained, factor-level assessment. By quantifying the alignment between model outputs and user-specific cognitive preferences inferred from interaction patterns, CoPA provides a more comprehensive and discriminative standard for evaluating personalized QA than generic metrics. The code is available at https://github.com/bjzgcai/CoPA.
Revisiting Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning
Xiaoyun Zhang | Xiaojian Yuan | Di Huang | Wang You | Chen Hu | Jingqing Ruan | Kejiang Chen | Xing Hu
Findings of the Association for Computational Linguistics: ACL 2026
Xiaoyun Zhang | Xiaojian Yuan | Di Huang | Wang You | Chen Hu | Jingqing Ruan | Kejiang Chen | Xing Hu
Findings of the Association for Computational Linguistics: ACL 2026
Reasoning ability has become a defining capability of Large Language Models (LLMs), with Reinforcement Learning with Verifiable Rewards (RLVR) emerging as a key paradigm to enhance it. However, RLVR training often suffers from policy entropy collapse, where the policy becomes overly deterministic, hindering exploration and limiting reasoning performance. While entropy regularization is a common remedy, its effectiveness is highly sensitive to the fixed coefficient, making it unstable across tasks and models. In this work, we revisit entropy regularization in RLVR and argue that its potential has been largely underestimated. Our analysis shows that (i) tasks of varying difficulty demand distinct exploration intensities, and (ii) balanced exploration may require the policy entropy to be maintained within a moderate range below its initial level. Therefore, we propose Adaptive Entropy Regularization (AER) — a framework that dynamically balances exploration and exploitation via three components: difficulty-aware coefficient allocation, initial-anchored target entropy, and dynamic global coefficient adjustment. Experiments on multiple mathematical reasoning benchmarks show that AER consistently outperforms baselines, improving both reasoning accuracy and exploration capability. Codes are available at https://anonymous.4open.science/r/AER-ACL .
2025
Unearthing Gems from Stones: Policy Optimization with Negative Sample Augmentation for LLM Reasoning
Zhaohui Yang | Yuxiao Ye | Shilei Jiang | Shihong Deng | Chen Hu | Linjing Li | Daxin Jiang
Findings of the Association for Computational Linguistics: EMNLP 2025
Zhaohui Yang | Yuxiao Ye | Shilei Jiang | Shihong Deng | Chen Hu | Linjing Li | Daxin Jiang
Findings of the Association for Computational Linguistics: EMNLP 2025
Recent advances in reasoning language models have witnessed a paradigm shift from short to long CoT pattern. Given the substantial computational cost of rollouts in long CoT models, maximizing the utility of fixed training datasets becomes crucial. Our analysis reveals that negative responses contain valuable components such as self-reflection and error-correction steps, yet primary existing methods either completely discard negative samples (RFT) or apply equal penalization across all tokens (RL), failing to leverage these potential learning signals. In light of this, we propose Behavior Constrained Policy Gradient with Negative Sample Augmentation (BCPG-NSA), a fine-grained offline RL framework that encompasses three stages: 1) sample segmentation, 2) consensus-based step correctness assessment combining LLM and PRM judgers, and 3) policy optimization with NSA designed to effectively mine positive steps within negative samples. Experimental results show that BCPG-NSA outperforms baselines on several challenging math/coding reasoning benchmarks using the same training dataset, achieving improved sample efficiency and demonstrating robustness and scalability when extended to multiple iterations.
2024
Predicting Entity Salience in Extremely Short Documents
Benjamin Bullough | Harrison Lundberg | Chen Hu | Weihang Xiao
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Benjamin Bullough | Harrison Lundberg | Chen Hu | Weihang Xiao
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
A frequent challenge in applications that use entities extracted from text documents is selecting the most salient entities when only a small number can be used by the application (e.g., displayed to a user). Solving this challenge is particularly difficult in the setting of extremely short documents, such as the response from a digital assistant, where traditional signals of salience such as position and frequency are less likely to be useful. In this paper, we propose a lightweight and data-efficient approach for entity salience detection on short text documents. Our experiments show that our approach achieves competitive performance with respect to complex state-of-the-art models, such as GPT-4, at a significant advantage in latency and cost. In limited data settings, we show that a semi-supervised fine-tuning process can improve performance further. Furthermore, we introduce a novel human-labeled dataset for evaluating entity salience on short question-answer pair documents.