Yiming Liu
2026
LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces
Yukang Feng | Jianwen Sun | Zelai Yang | Jiaxin Ai | Chuanhao Li | Zizhen Li | Fanrui Zhang | Kang He | Rui Ma | Jifan Lin | Jie Sun | Yang Xiao | Sizhuo Zhou | Wenxiao Wu | Yiming Liu | Pengfei Liu | Shenglin Zhang | Kaipeng Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Yukang Feng | Jianwen Sun | Zelai Yang | Jiaxin Ai | Chuanhao Li | Zizhen Li | Fanrui Zhang | Kang He | Rui Ma | Jifan Lin | Jie Sun | Yang Xiao | Sizhuo Zhou | Wenxiao Wu | Yiming Liu | Pengfei Liu | Shenglin Zhang | Kaipeng Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Recent advances in AI-assisted programming have empowered agents to execute complex workflows via command-line interfaces, however, existing benchmarks are limited by short task horizons, data contamination from GitHub scraping, and a lack of fine-grained evaluation metrics, fail to rigorously evaluate the long-horizon planning and execution capabilities essential for realistic software engineering. To address these gaps, we introduce LongCLI-Bench, a comprehensive benchmark designed to evaluate agentic capabilities across long-horizon, realistic, sequential engineering tasks. We curated 20 high-quality, long-horizon tasks from over 1,000 computer science assignments and real-world workflows, covering four engineering categories: from scratch, feature addition, bug fixing, and refactoring. LongCLI-Bench employs a dual-set testing protocol, which measures requirement fulfillment (fail(→)pass) and regression avoidance (pass(→)pass), and incorporates step-level scoring to pinpoint execution failures. Extensive experiments reveal that even state-of-the-art agents achieve pass rates below 20% in LongCLI-Bench. Step-level analysis further indicates that the majority of tasks stall at less than 30% completion, highlighting that critical failures often occur in the early stages. Although self-correction offers marginal gains, human-agent collaboration through plan injection and interactive guidance yields significantly higher improvements. These results highlight that future research must emphasize the development of synergistic human-agent workflows alongside advances in agents’ planning and execution capabilities to overcome key challenges in long-horizon task performance.
Uncertainty-Calibrated Elastic Alignment for Multimodal Sentiment Analysis with Missing Modalities
Kang He | Yuzhe Ding | Rao Fu | Yukang Feng | Kaipeng Zhang | Yiming Liu | Fei Li | Chong Teng | Donghong Ji
Findings of the Association for Computational Linguistics: ACL 2026
Kang He | Yuzhe Ding | Rao Fu | Yukang Feng | Kaipeng Zhang | Yiming Liu | Fei Li | Chong Teng | Donghong Ji
Findings of the Association for Computational Linguistics: ACL 2026
Multimodal sentiment analysis (MSA) in real-world scenarios is often challenged by dynamically missing modalities. Existing methods predominantly rely on deterministic imputation and rigid alignment, which compels the model to overfit noise in ambiguous regions while neglecting the decision shift induced by modality inertia. To address these issues, we propose a novel uncertainty-calibrated elastic alignment framework, termed EASE. Specifically, we employ probabilistic imputation to capture cross-modal ambiguity and leverage the estimated uncertainty to drive elastic alignment, thereby adaptively relaxing constraints in ambiguous regions to avoid rigid fitting. Meanwhile, we introduce cross-view predictive consistency constraints to unify discriminative logic across different modality views, stabilizing the decision boundary under modality degradation. Extensive experiments demonstrate that EASE consistently outperforms existing state-of-the-art baselines across multiple benchmarks, exhibiting exceptional robustness particularly under high missing-rate scenarios.
2025
OpenRLHF: A Ray-based Easy-to-use, Scalable and High-performance RLHF Framework
Jian Hu | Xibin Wu | Wei Shen | Jason Klein Liu | Weixun Wang | Songlin Jiang | Haoran Wang | Hao Chen | Bin Chen | Wenkai Fang | Xianyu | Yu Cao | Haotian Xu | Yiming Liu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Jian Hu | Xibin Wu | Wei Shen | Jason Klein Liu | Weixun Wang | Songlin Jiang | Haoran Wang | Hao Chen | Bin Chen | Wenkai Fang | Xianyu | Yu Cao | Haotian Xu | Yiming Liu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Large Language Models (LLMs) fine-tuned via Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) significantly improve the alignment of human-AI values and further raise the upper bound of AI capabilities, particularly in reasoning-intensive, long-context Chain-of-Thought (long-CoT) tasks. However, existing RLHF (or RLVR) frameworks commonly face challenges such as inference bottlenecks and complexity barriers, restricting their accessibility for newcomers. To bridge this gap, we introduce OpenRLHF, a user-friendly, scalable, and easy-to-learn open-source RLHF framework built upon Ray, vLLM, DeepSpeed, and HuggingFace Transformers, featuring a simplified design, clear code structure, and comprehensive documentation to facilitate entry for researchers and practitioners. Experimental results show that OpenRLHF achieves superior training efficiency with speedups ranging from 1.22× to 1.68× across different model sizes compared to state-of-the-art frameworks, while requiring significantly fewer lines of code for implementation. OpenRLHF is publicly available at https://github.com/OpenRLHF/OpenRLHF, and has already been adopted by leading institutions to accelerate RLHF research and learning.
Data or Language Supervision: What Makes CLIP Better than DINO?
Yiming Liu | Yuhui Zhang | Dhruba Ghosh | Ludwig Schmidt | Serena Yeung-Levy
Findings of the Association for Computational Linguistics: EMNLP 2025
Yiming Liu | Yuhui Zhang | Dhruba Ghosh | Ludwig Schmidt | Serena Yeung-Levy
Findings of the Association for Computational Linguistics: EMNLP 2025
CLIP outperforms self-supervised models like DINO as vision encoders for vision-language models (VLMs), but it remains unclear whether this advantage stems from CLIP’s language supervision or its much larger training data. To disentangle these factors, we pre-train CLIP and DINO under controlled settings—using the same architecture, dataset, and training configuration—achieving similar ImageNet accuracy. Embedding analysis shows that CLIP captures high-level semantics (e.g., object categories, text), while DINO is more responsive to low-level features like colors and styles. When integrated into VLMs and evaluated on 20 VQA benchmarks, CLIP excels at text-intensive tasks, while DINO slightly outperforms on vision-centric ones. Variants of language supervision (e.g., sigmoid loss, pre-trained language encoders) yield limited gains. Our findings provide scientific insights into vision encoder design and its impact on VLM performance.
LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models
Jiayi Gui | Yiming Liu | Jiale Cheng | Xiaotao Gu | Xiao Liu | Hongning Wang | Yuxiao Dong | Jie Tang | Minlie Huang
Findings of the Association for Computational Linguistics: ACL 2025
Jiayi Gui | Yiming Liu | Jiale Cheng | Xiaotao Gu | Xiao Liu | Hongning Wang | Yuxiao Dong | Jie Tang | Minlie Huang
Findings of the Association for Computational Linguistics: ACL 2025
Large Language Models (LLMs) have demonstrated notable capabilities across various tasks, showcasing complex problem-solving abilities. Understanding and executing complex rules, along with multi-step planning, are fundamental to logical reasoning and critical for practical LLM agents and decision-making systems. However, evaluating LLMs as effective rule-based executors and planners remains underexplored. In this paper, we introduce LogicGame, a novel benchmark designed to evaluate the comprehensive rule understanding, execution, and planning capabilities of LLMs. Unlike traditional benchmarks, LogicGame provides diverse games that contain a series of rules with an initial state, requiring models to comprehend and apply predefined regulations to solve problems. We create simulated scenarios in which models execute or plan operations to achieve specific outcomes. These game scenarios are specifically designed to distinguish logical reasoning from mere knowledge by relying exclusively on predefined rules. This separation allows for a pure assessment of rule-based reasoning capabilities. The evaluation considers not only final outcomes but also intermediate steps, providing a comprehensive assessment of model performance. Moreover, these intermediate steps are deterministic and can be automatically verified. LogicGame defines game scenarios with varying difficulty levels, from simple rule applications to complex reasoning chains, in order to offer a precise evaluation of model performance on rule understanding and multi-step execution. Utilizing LogicGame, we test various LLMs and identify notable shortcomings in their rule-based logical reasoning abilities.
NegVQA: Can Vision Language Models Understand Negation?
Yuhui Zhang | Yuchang Su | Yiming Liu | Serena Yeung-Levy
Findings of the Association for Computational Linguistics: ACL 2025
Yuhui Zhang | Yuchang Su | Yiming Liu | Serena Yeung-Levy
Findings of the Association for Computational Linguistics: ACL 2025
Negation is a fundamental linguistic phenomenon that can entirely reverse the meaning of a sentence. As vision language models (VLMs) continue to advance and are deployed in high-stakes applications, assessing their ability to comprehend negation becomes essential. To address this, we introduce NegVQA, a visual question answering (VQA) benchmark consisting of 7,379 two-choice questions covering diverse negation scenarios and image-question distributions. We construct NegVQA by leveraging large language models to generate negated versions of questions from existing VQA datasets. Evaluating 20 state-of-the-art VLMs across seven model families, we find that these models struggle significantly with negation, exhibiting a substantial performance drop compared to their responses to the original questions. Furthermore, we uncover a U-shaped scaling trend, where increasing model size initially degrades performance on NegVQA before leading to improvements. Our benchmark reveals critical gaps in VLMs’ negation understanding and offers insights into future VLM development. Project page available at https://yuhui-zh15.github.io/NegVQA/.
Search
Fix author
Co-authors
- Yukang Feng 2
- Kang He 2
- Serena Yeung-Levy 2
- Kaipeng Zhang 2
- Yuhui Zhang 2
- Jiaxin Ai 1
- Yu Cao 1
- Hao Chen 1
- Bin Chen 1
- Jiale Cheng 1
- Yuzhe Ding 1
- Yuxiao Dong 1
- Wenkai Fang 1
- Rao Fu 1
- Dhruba Ghosh 1
- Xiaotao Gu 1
- Jiayi Gui 1
- Jian Hu 1
- Minlie Huang 1
- Donghong Ji 1
- Songlin Jiang 1
- Chuanhao Li 1
- Zizhen Li 1
- Fei Li 1
- Jifan Lin 1
- Pengfei Liu 1
- Jason Klein Liu 1
- Xiao Liu 1
- Rui Ma 1
- Ludwig Schmidt 1
- Wei Shen 1
- Yuchang Su 1
- Jianwen Sun 1
- Jie Sun 1
- Jie Tang 1
- Chong Teng 1
- Weixun Wang 1
- Haoran Wang 1
- Hongning Wang 1
- Wenxiao Wu 1
- Xibin Wu 1
- Xianyu 1
- Yang Xiao 1
- Haotian Xu 1
- Zelai Yang 1
- Fanrui Zhang 1
- Shenglin Zhang 1
- Sizhuo Zhou 1