Jie Sun
2026
LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces
Yukang Feng | Jianwen Sun | Zelai Yang | Jiaxin Ai | Chuanhao Li | Zizhen Li | Fanrui Zhang | Kang He | Rui Ma | Jifan Lin | Jie Sun | Yang Xiao | Sizhuo Zhou | Wenxiao Wu | Yiming Liu | Pengfei Liu | Shenglin Zhang | Kaipeng Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Yukang Feng | Jianwen Sun | Zelai Yang | Jiaxin Ai | Chuanhao Li | Zizhen Li | Fanrui Zhang | Kang He | Rui Ma | Jifan Lin | Jie Sun | Yang Xiao | Sizhuo Zhou | Wenxiao Wu | Yiming Liu | Pengfei Liu | Shenglin Zhang | Kaipeng Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Recent advances in AI-assisted programming have empowered agents to execute complex workflows via command-line interfaces, however, existing benchmarks are limited by short task horizons, data contamination from GitHub scraping, and a lack of fine-grained evaluation metrics, fail to rigorously evaluate the long-horizon planning and execution capabilities essential for realistic software engineering. To address these gaps, we introduce LongCLI-Bench, a comprehensive benchmark designed to evaluate agentic capabilities across long-horizon, realistic, sequential engineering tasks. We curated 20 high-quality, long-horizon tasks from over 1,000 computer science assignments and real-world workflows, covering four engineering categories: from scratch, feature addition, bug fixing, and refactoring. LongCLI-Bench employs a dual-set testing protocol, which measures requirement fulfillment (fail(→)pass) and regression avoidance (pass(→)pass), and incorporates step-level scoring to pinpoint execution failures. Extensive experiments reveal that even state-of-the-art agents achieve pass rates below 20% in LongCLI-Bench. Step-level analysis further indicates that the majority of tasks stall at less than 30% completion, highlighting that critical failures often occur in the early stages. Although self-correction offers marginal gains, human-agent collaboration through plan injection and interactive guidance yields significantly higher improvements. These results highlight that future research must emphasize the development of synergistic human-agent workflows alongside advances in agents’ planning and execution capabilities to overcome key challenges in long-horizon task performance.
AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts
Keyu Li | Junhao Shi | Yang Xiao | Mohan Jiang | Jie Sun | Yunze Wu | Dayuan Fu | Shijie Xia | Xiaojie Cai | Tianze Xu | Weiye Si | Wenjie Li | Dequan Wang | Pengfei Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Keyu Li | Junhao Shi | Yang Xiao | Mohan Jiang | Jie Sun | Yunze Wu | Dayuan Fu | Shijie Xia | Xiaojie Cai | Tianze Xu | Weiye Si | Wenjie Li | Dequan Wang | Pengfei Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture long-horizon real-world scenarios. Moreover, the reliance on human-in-the-loop feedback for realistic tasks creates a scalability bottleneck, hindering automated rollout collection and evaluation. To bridge this gap, we introduce AgencyBench, a comprehensive benchmark derived from daily AI usage, evaluating 6 core agentic capabilities across 32 real-world scenarios, comprising 138 tasks with specific queries, deliverables, and rubrics. These scenarios require an average of 90 tool calls, 1 million tokens, and hours of execution time to resolve. To enable automated evaluation, we employ a user simulation agent to provide iterative feedback, and a Docker sandbox to conduct visual and functional rubric-based assessment. Experiments reveal that closed-source models significantly outperform open-source models (48.4% vs 32.1%). Further analysis reveals significant disparities across models in resource efficiency, feedback-driven self-correction, and specific tool-use preferences.
SepSeq: A Training-Free Framework for Long Numerical Sequence Processing in LLMs
Jie Sun | Yu Liu | Lu Han | Qiwen Deng | Xiang Shu | Yang Xiao | Lintao Ma | Xingyu Lu | Jun Zhou | Pengfei Liu | Jiancan Wu | Xiang Wang
Findings of the Association for Computational Linguistics: ACL 2026
Jie Sun | Yu Liu | Lu Han | Qiwen Deng | Xiang Shu | Yang Xiao | Lintao Ma | Xingyu Lu | Jun Zhou | Pengfei Liu | Jiancan Wu | Xiang Wang
Findings of the Association for Computational Linguistics: ACL 2026
While transformer-based Large Language Models (LLMs) theoretically support massive context windows, they suffer from severe performance degradation when processing long numerical sequences. We attribute this failure to the attention dispersion in the Softmax mechanism, which prevents the model from concentrating attention. To overcome this, we propose Separate Sequence (SepSeq), a training-free, plug-and-play framework to mitigate dispersion by strategically inserting separator tokens. Mechanistically, we demonstrate that separator tokens act as an attention anchor, recalibrating attention to focus on local segments while preserving global context. Extensive evaluations on 9 widely-adopted LLMs confirm the effectiveness of our approach: SepSeq yields an average relative accuracy improvement of 35.6% across diverse domains while reducing 16.4% inference token consumption.
2025
Robust Preference Optimization via Dynamic Target Margins
Jie Sun | Junkang Wu | Jiancan Wu | Zhibo Zhu | Xingyu Lu | Jun Zhou | Lintao Ma | Xiang Wang
Findings of the Association for Computational Linguistics: ACL 2025
Jie Sun | Junkang Wu | Jiancan Wu | Zhibo Zhu | Xingyu Lu | Jun Zhou | Lintao Ma | Xiang Wang
Findings of the Association for Computational Linguistics: ACL 2025
The alignment of Large Language Models (LLMs) is crucial for ensuring their safety and reliability in practical applications. Direct Preference Optimization (DPO) has emerged as an efficient method that directly optimizes models using preference pairs, significantly reducing resource demands. However, the effectiveness of DPO heavily depends on the data quality, which is frequently compromised by noise. In this work, we propose 𝛾-PO, a dynamic target margin preference optimization algorithm that adjust reward margins at the pairwise level. By introducing instance-specific margin calibration, 𝛾-PO strategically prioritizes high-confidence pairs (those demonstrating higher reward margins) while suppressing potential noise from ambiguous pairs. Moreover, 𝛾-PO is a plug-and-play method, compatible with variants of DPO that rely on reward margin between preference pairs. Across benchmarks such as AlpacaEval2 and Arena-Hard, 𝛾-PO achieves an average 4.4% improvement over other baselines, setting new benchmarks for state-of-the-art performance. Additionally, 𝛾-PO requires minimal code changes and has a negligible impact on training efficiency, making it a robust solution for enhancing LLMs alignment. Our codes are available at https://github.com/sunjie279/gammaPO.
LaMP-Val: Large Language Models Empower Personalized Valuation in Auction
Jie Sun | Tianyu Zhang | Houcheng Jiang | Kexin Huang | Xiang Shu | Zhibo Zhu | Lintao Ma | Xingyu Lu | Jun Zhou | Junkang Wu | Chi Luo | An Zhang | Jiancan Wu | Xiang Wang
Findings of the Association for Computational Linguistics: EMNLP 2025
Jie Sun | Tianyu Zhang | Houcheng Jiang | Kexin Huang | Xiang Shu | Zhibo Zhu | Lintao Ma | Xingyu Lu | Jun Zhou | Junkang Wu | Chi Luo | An Zhang | Jiancan Wu | Xiang Wang
Findings of the Association for Computational Linguistics: EMNLP 2025
Auctions are a vital economic mechanism used to determine the market value of goods or services through competitive bidding within a specific framework. However, much of the current research primarily focuses on the bidding algorithms used within auction mechanisms. This often neglects the potential benefits of incorporating individual users’ unique preferences into the valuation process. Our theoretical and empirical analysis demonstrates that valuation errors can significantly impact the overall utility. To bridge this gap, we propose a personalized valuation framework, namely Large Language Models-powered Personalized Valuation (LaMP-Val), which integrates Large Language Models to incorporate personalized semantic preference into users valuation process. LaMP-Val integrating three components: data, learning, and evaluation. The data component tackles the challenge of building a novel dataset specifically for LLMs fine-tuning in personalized valuation modeling. The learning component introduces a diversity template to enhance LLMs’ capacity for modeling fine-grained personal valuation patterns. The evaluation component establishes a closed-loop system where LLM-generated valuations interact with bidding strategies and auction. It proposes two novel metrics to quantify valuation precision and bidding intention accuracy in personalized scenarios. Extensive experiments show that LaMP-Val more accurately captures personalized values and achieves greater profits than baseline approaches.
Search
Fix author
Co-authors
- Pengfei Liu 3
- Xingyu Lu 3
- Lintao Ma 3
- Xiang Wang 3
- Jiancan Wu 3
- Yang Xiao 3
- Jun Zhou 3
- Xiang Shu 2
- Junkang Wu 2
- Zhibo Zhu 2
- Jiaxin Ai 1
- Xiaojie Cai 1
- Qiwen Deng 1
- Yukang Feng 1
- Dayuan Fu 1
- Lu Han 1
- Kang He 1
- Kexin Huang 1
- Mohan Jiang 1
- Houcheng Jiang 1
- Chuanhao Li 1
- Zizhen Li 1
- Keyu Li 1
- Wenjie Li 1
- Jifan Lin 1
- Yiming Liu 1
- Yu Liu 1
- Chi Luo 1
- Rui Ma 1
- Junhao Shi 1
- Weiye Si 1
- Jianwen Sun 1
- Dequan Wang 1
- Wenxiao Wu 1
- Yunze Wu 1
- Shijie Xia 1
- Tianze Xu 1
- Zelai Yang 1
- Fanrui Zhang 1
- Shenglin Zhang 1
- Kaipeng Zhang 1
- Tianyu Zhang 1
- An Zhang 1
- Sizhuo Zhou 1