Pengfei Liu
Other people with similar names: Pengfei Liu
Unverified author pages with similar names: Pengfei Liu
2026
SepSeq: A Training-Free Framework for Long Numerical Sequence Processing in LLMs
Jie Sun | Yu Liu | Lu Han | Qiwen Deng | Xiang Shu | Yang Xiao | Lintao Ma | Xingyu Lu | Jun Zhou | Pengfei Liu | Jiancan Wu | Xiang Wang
Findings of the Association for Computational Linguistics: ACL 2026
Jie Sun | Yu Liu | Lu Han | Qiwen Deng | Xiang Shu | Yang Xiao | Lintao Ma | Xingyu Lu | Jun Zhou | Pengfei Liu | Jiancan Wu | Xiang Wang
Findings of the Association for Computational Linguistics: ACL 2026
While transformer-based Large Language Models (LLMs) theoretically support massive context windows, they suffer from severe performance degradation when processing long numerical sequences. We attribute this failure to the attention dispersion in the Softmax mechanism, which prevents the model from concentrating attention. To overcome this, we propose Separate Sequence (SepSeq), a training-free, plug-and-play framework to mitigate dispersion by strategically inserting separator tokens. Mechanistically, we demonstrate that separator tokens act as an attention anchor, recalibrating attention to focus on local segments while preserving global context. Extensive evaluations on 9 widely-adopted LLMs confirm the effectiveness of our approach: SepSeq yields an average relative accuracy improvement of 35.6% across diverse domains while reducing 16.4% inference token consumption.
DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language Models
Yakun Zhu | Zhongzhen Huang | Linjie Mu | Yutong Huang | Wei Nie | Jiaji Liu | Shaoting Zhang | Pengfei Liu | Xiaofan Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Yakun Zhu | Zhongzhen Huang | Linjie Mu | Yutong Huang | Wei Nie | Jiaji Liu | Shaoting Zhang | Pengfei Liu | Xiaofan Zhang
Findings of the Association for Computational Linguistics: ACL 2026
The emergence of groundbreaking large language models capable of performing complex reasoning tasks holds significant promise for addressing various scientific challenges, including those arising in complex clinical scenarios. To enable their safe and effective deployment in real-world healthcare settings, it is urgently necessary to benchmark the diagnostic capabilities of current models systematically. Given the limitations of existing medical benchmarks in evaluating advanced diagnostic reasoning, we present DiagnosisArena, a comprehensive and challenging benchmark designed to rigorously assess professional-level diagnostic competence. DiagnosisArena consists of 1,113 pairs of segmented patient cases and corresponding diagnoses, spanning 28 medical specialties, deriving from clinical case reports published in 10 top-tier medical journals. The benchmark is developed through a meticulous construction pipeline, involving multiple rounds of screening and review by both AI systems and human experts, with thorough checks conducted to prevent data leakage. Our study reveals that even the most advanced reasoning models, o3-mini, o1, and DeepSeek-R1, achieve only 45.82%, 31.09%, and 17.79% accuracy, respectively. This finding highlights a significant generalization bottleneck in current large language models when faced with clinical diagnostic reasoning challenges. Through DiagnosisArena, we aim to drive further advancements in AI’s diagnostic reasoning capabilities, enabling more effective solutions for real-world clinical diagnostic challenges. We openly share the benchmark and evaluation tools for further research and development.
LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces
Yukang Feng | Jianwen Sun | Zelai Yang | Jiaxin Ai | Chuanhao Li | Zizhen Li | Fanrui Zhang | Kang He | Rui Ma | Jifan Lin | Jie Sun | Yang Xiao | Sizhuo Zhou | Wenxiao Wu | Yiming Liu | Pengfei Liu | Shenglin Zhang | Kaipeng Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Yukang Feng | Jianwen Sun | Zelai Yang | Jiaxin Ai | Chuanhao Li | Zizhen Li | Fanrui Zhang | Kang He | Rui Ma | Jifan Lin | Jie Sun | Yang Xiao | Sizhuo Zhou | Wenxiao Wu | Yiming Liu | Pengfei Liu | Shenglin Zhang | Kaipeng Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Recent advances in AI-assisted programming have empowered agents to execute complex workflows via command-line interfaces, however, existing benchmarks are limited by short task horizons, data contamination from GitHub scraping, and a lack of fine-grained evaluation metrics, fail to rigorously evaluate the long-horizon planning and execution capabilities essential for realistic software engineering. To address these gaps, we introduce LongCLI-Bench, a comprehensive benchmark designed to evaluate agentic capabilities across long-horizon, realistic, sequential engineering tasks. We curated 20 high-quality, long-horizon tasks from over 1,000 computer science assignments and real-world workflows, covering four engineering categories: from scratch, feature addition, bug fixing, and refactoring. LongCLI-Bench employs a dual-set testing protocol, which measures requirement fulfillment (fail(→)pass) and regression avoidance (pass(→)pass), and incorporates step-level scoring to pinpoint execution failures. Extensive experiments reveal that even state-of-the-art agents achieve pass rates below 20% in LongCLI-Bench. Step-level analysis further indicates that the majority of tasks stall at less than 30% completion, highlighting that critical failures often occur in the early stages. Although self-correction offers marginal gains, human-agent collaboration through plan injection and interactive guidance yields significantly higher improvements. These results highlight that future research must emphasize the development of synergistic human-agent workflows alongside advances in agents’ planning and execution capabilities to overcome key challenges in long-horizon task performance.
AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts
Keyu Li | Junhao Shi | Yang Xiao | Mohan Jiang | Jie Sun | Yunze Wu | Dayuan Fu | Shijie Xia | Xiaojie Cai | Tianze Xu | Weiye Si | Wenjie Li | Dequan Wang | Pengfei Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Keyu Li | Junhao Shi | Yang Xiao | Mohan Jiang | Jie Sun | Yunze Wu | Dayuan Fu | Shijie Xia | Xiaojie Cai | Tianze Xu | Weiye Si | Wenjie Li | Dequan Wang | Pengfei Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture long-horizon real-world scenarios. Moreover, the reliance on human-in-the-loop feedback for realistic tasks creates a scalability bottleneck, hindering automated rollout collection and evaluation. To bridge this gap, we introduce AgencyBench, a comprehensive benchmark derived from daily AI usage, evaluating 6 core agentic capabilities across 32 real-world scenarios, comprising 138 tasks with specific queries, deliverables, and rubrics. These scenarios require an average of 90 tool calls, 1 million tokens, and hours of execution time to resolve. To enable automated evaluation, we employ a user simulation agent to provide iterative feedback, and a Docker sandbox to conduct visual and functional rubric-based assessment. Experiments reveal that closed-source models significantly outperform open-source models (48.4% vs 32.1%). Further analysis reveals significant disparities across models in resource efficiency, feedback-driven self-correction, and specific tool-use preferences.
SciPedia: Unlocking the Value of Scientific Data for Pre-training
Yiwei Qin | Zhen Huang | Tiantian Mi | Weiye Si | Qipeng Guo | Siyuan Feng | Pengfei Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yiwei Qin | Zhen Huang | Tiantian Mi | Weiye Si | Qipeng Guo | Siyuan Feng | Pengfei Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
High-quality scientific data is critical for advancing LLMs, yet academic literature remains largely underutilized. This work addresses the fundamental question: How can we systematically unlock scientific data’s value for pre-training? First, we construct a large-scale raw scientific corpus but identify a critical Learnability Gap, revealing that direct pre-training yields negligible gains. To bridge this, we develop a multi-stage pipeline featuring content cleaning and pedagogical augmentation, resulting in SciPedia, a 900B-token corpus. Finally, we establish a controlled verification framework: we develop SciPedia-Eval benchmark and conduct 600B tokens of continued pre-training (CPT) starting from transparent base models (3B/7B) trained from scratch. Compared to a CPT baseline trained with general-purpose data, our approach with SciPedia data boosts average performance by +2.12 (3B) and +2.95 (7B), reaching +5.60 and +8.40 on in-domain tasks. This setup further allows us to derive empirical guidelines for data composition and model configurations.
Search
Fix author
Co-authors
- Jie Sun 3
- Yang Xiao 3
- Weiye Si 2
- Jiaxin Ai 1
- Xiaojie Cai 1
- Qiwen Deng 1
- Siyuan Feng 1
- Yukang Feng 1
- Dayuan Fu 1
- Qipeng Guo 1
- Lu Han 1
- Kang He 1
- Yutong Huang 1
- Zhen Huang 1
- Zhongzhen Huang 1
- Mohan Jiang 1
- Chuanhao Li 1
- Keyu Li 1
- Wenjie Li 1
- Zizhen Li 1
- Jifan Lin 1
- Jiaji Liu 1
- Yiming Liu 1
- Yu Liu 1
- Xingyu Lu 1
- Lintao Ma 1
- Rui Ma 1
- Tiantian Mi 1
- Linjie Mu 1
- Wei Nie 1
- Yiwei Qin 1
- Junhao Shi 1
- Xiang Shu 1
- Jianwen Sun 1
- Dequan Wang 1
- Xiang Wang 1
- Jiancan Wu 1
- Wenxiao Wu 1
- Yunze Wu 1
- Shijie Xia 1
- Tianze Xu 1
- Zelai Yang 1
- Fanrui Zhang 1
- Kaipeng Zhang 1
- Shaoting Zhang 1
- Shenglin Zhang 1
- Xiaofan Zhang 1
- Jun Zhou 1
- Sizhuo Zhou 1
- Yakun Zhu 1