2025
pdf
bib
abs
Multi-Programming Language Sandbox for LLMs
Shihan Dou
|
Jiazheng Zhang
|
Jianxiang Zang
|
Yunbo Tao
|
Weikang Zhou
|
Haoxiang Jia
|
Shichun Liu
|
Yuming Yang
|
Shenxi Wu
|
Zhiheng Xi
|
Muling Wu
|
Rui Zheng
|
Changze Lv
|
Limao Xiong
|
Shaoqing Zhang
|
Lin Zhang
|
Wenyu Zhan
|
Rongxiang Weng
|
Jingang Wang
|
Xunliang Cai
|
Yueming Wu
|
Ming Wen
|
Yixin Cao
|
Tao Gui
|
Xipeng Qiu
|
Qi Zhang
|
Xuanjing Huang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
We introduce MPLSandbox, an out-of-the-box multi-programming language sandbox designed to provide unified and comprehensive feedback from compiler and analysis tools for Large Language Models (LLMs). It can automatically identify the programming language of the code, compiling and executing it within an isolated sub-sandbox to ensure safety and stability. In addition, MPLSandbox integrates both traditional and LLM-based code analysis tools, providing a comprehensive analysis of generated code. It also can be effortlessly integrated into the training and deployment of LLMs to improve the quality and correctness of generated code. It also helps researchers streamline their workflows for various LLM-based code-related tasks, reducing the development cost. To validate the effectiveness of MPLSandbox, we conduct extensive experiments by integrating it into several training and deployment scenarios, and employing it to optimize workflows for a wide range of downstream code tasks. Our goal is to enhance researcher productivity on LLM-based code tasks by simplifying and automating workflows through delegation to MPLSandbox.
pdf
bib
abs
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
Yuhang Zang
|
Xiaoyi Dong
|
Pan Zhang
|
Yuhang Cao
|
Ziyu Liu
|
Shengyuan Ding
|
Shenxi Wu
|
Yubo Ma
|
Haodong Duan
|
Wenwei Zhang
|
Kai Chen
|
Dahua Lin
|
Jiaqi Wang
Findings of the Association for Computational Linguistics: ACL 2025
Despite the promising performance of Large Vision Language Models (LVLMs) in visual understanding, they occasionally generate incorrect outputs. While reward models (RMs) with reinforcement learning or test-time scaling offer the potential for improving generation quality, a critical gap remains: publicly available multi-modal RMs for LVLMs are scarce, and the implementation details of proprietary models are often unclear. We bridge this gap with InternLM-XComposer2.5-Reward (IXC-2.5-Reward), a simple yet effective multi-modal reward model that aligns LVLMs with human preferences. To ensure the robustness and versatility of IXC-2.5-Reward, we set up a high-quality multi-modal preference corpus spanning text, image, and video inputs across diverse domains, such as instruction following, general understanding, text-rich documents, mathematical reasoning, and video understanding. IXC-2.5-Reward achieves excellent results on the latest multi-modal reward model benchmark and shows competitive performance on text-only reward model benchmarks. We further demonstrate three key applications of IXC-2.5-Reward: (1) Providing a supervisory signal for RL training. We integrate IXC-2.5-Reward with Proximal Policy Optimization (PPO) yields IXC-2.5-Chat, which shows consistent improvements in instruction following and multi-modal open-ended dialogue; (2) Selecting the best response from candidate responses for test-time scaling; and (3) Filtering outlier or noisy samples from existing image and video instruction tuning training data.