Mengyang Liu
2026
CODERL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment
Xue Jiang | Yihong Dong | Mengyang Liu | Deng Hongyi | Tian Wang | Yongding Tao | Zhi Jin | Wenpin Jiao | Ge Li
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xue Jiang | Yihong Dong | Mengyang Liu | Deng Hongyi | Tian Wang | Yongding Tao | Zhi Jin | Wenpin Jiao | Ge Li
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
While Large Language Models (LLMs) excel at code generation by learning from vast code corpora, a fundamental semantic gap remains between their training on textual patterns and the goal of functional correctness, which is governed by formal execution semantics. Reinforcement Learning with Verifiable Rewards (RLVR) approaches attempt to bridge this gap using outcome rewards from executing test cases. However, solely relying on binary pass/fail signals is inefficient for establishing a well-aligned connection between the textual representation of code and its execution semantics, especially for subtle logical errors within the code. In this paper, we propose CODERL+, a novel approach that integrates execution semantics alignment into the RLVR training pipeline for code generation. CODERL+ enables the model to infer variable-level execution trajectory, providing a direct learning signal of execution semantics. CODERL+ can construct execution semantics alignment directly using existing on-policy rollouts and integrates seamlessly with various RL algorithms. Extensive experiments demonstrate that CODERL+ outperforms post-training baselines (including RLVR and Distillation), achieving a 4.6% average relative improvement in pass@1. CODERL+ generalizes effectively to other coding tasks, yielding 15.5% and 4.4% higher accuracy on code-reasoning and test-output-generation benchmarks, respectively. CODERL+ shows strong applicability across diverse RL algorithms and LLMs. Furthermore, probe analyses provide compelling evidence that CODERL+ strengthens the alignment between code’s textual representations and its underlying execution semantics.
2025
Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation
Qiyue Gao | Xinyu Pi | Kevin Liu | Junrong Chen | Ruolan Yang | Xinqi Huang | Xinyu Fang | Lu Sun | Gautham Kishore | Bo Ai | Stone Tao | Mengyang Liu | Jiaxi Yang | Chao-Jung Lai | Chuanyang Jin | Jiannan Xiang | Benhao Huang | Zeming Chen | David Danks | Hao Su | Tianmin Shu | Ziqiao Ma | Lianhui Qin | Zhiting Hu
Findings of the Association for Computational Linguistics: ACL 2025
Qiyue Gao | Xinyu Pi | Kevin Liu | Junrong Chen | Ruolan Yang | Xinqi Huang | Xinyu Fang | Lu Sun | Gautham Kishore | Bo Ai | Stone Tao | Mengyang Liu | Jiaxi Yang | Chao-Jung Lai | Chuanyang Jin | Jiannan Xiang | Benhao Huang | Zeming Chen | David Danks | Hao Su | Tianmin Shu | Ziqiao Ma | Lianhui Qin | Zhiting Hu
Findings of the Association for Computational Linguistics: ACL 2025
Internal world models (WMs) enable agents to understand the world’s state and predict transitions, serving as the basis for advanced deliberative reasoning.Recent large Vision-Language Models (VLMs), such as GPT-4o and Gemini, exhibit potential as general-purpose WMs. While the latest studies have evaluated and shown limitations in specific capabilities such as visual understanding, a systematic evaluation of VLMs’ fundamental WM abilities remains absent. Drawing on comparative psychology and cognitive science, we propose a two-stage framework that assesses **perception** (visual, spatial, temporal, quantitative, and motion) and **prediction** (mechanistic simulation, transitive inference, compositional inference) to provide an atomic evaluation of VLMs as WMs. Guided by this framework, we introduce **WM-ABench**, a large-scale benchmark comprising 23 fine-grained evaluation dimensions across 6 diverse simulated environments with controlled counterfactual simulations. Through 660 experiments on 15 latest commercial and open-source VLMs, we find that these models exhibit striking limitations in basic world modeling abilities. For instance, all models perform at near-random accuracy when distinguishing motion trajectories. Additionally, they lack disentangled understanding—e.g., they tend to believe blue objects move faster than green ones. More rich results and analyses reveal significant gaps between VLMs and human-level world modeling.
Search
Fix author
Co-authors
- Bo Ai 1
- Junrong Chen 1
- Zeming Chen 1
- David Danks 1
- Yihong Dong 1
- Xinyu Fang 1
- Qiyue Gao 1
- Deng Hongyi 1
- Zhiting Hu 1
- Benhao Huang 1
- Xinqi Huang 1
- Xue Jiang 1
- Wenpin Jiao 1
- Chuanyang Jin 1
- Zhi Jin 1
- Gautham Kishore 1
- Chao-Jung Lai 1
- Ge Li 1
- Kevin Liu 1
- Ziqiao Ma 1
- Xinyu Pi 1
- Lianhui Qin 1
- Tianmin Shu 1
- Hao Su 1
- Lu Sun 1
- Stone Tao 1
- Yongding Tao 1
- Tian Wang 1
- Jiannan Xiang 1
- Jiaxi Yang 1
- Ruolan Yang 1