Fei Ma

2026

Large Language Models (LLMs) achieve strong results on code generation, but single model inference remains brittle on tasks that require iterative refinement. Existing multi agent frameworks improve reliability, yet they often incur substantial token and latency overhead. We introduce PairCoder, a framework that brings pair programming to autonomous LLM collaboration. PairCoder assigns one model to code generation and the other to review, and switches roles when repeated errors suggest that the current interaction has stalled. Across 13 LLMs on HumanEval, PairCoder consistently improves over single model inference. On eight representative backbones, it reaches 91.0% pass@1 and improves over single model inference by up to 20.3% while reducing token usage by 40% to 70% relative to multi agent baselines. Many heterogeneous pairings also outperform both constituent models, suggesting that the framework generalizes across model families. These results position PairCoder as an effective and deployment conscious alternative to heavier multi agent systems.Code is available at https://github.com/yisuanwang/PairCoder

2025

pdf bib abs

VisualRWKV: Exploring Recurrent Neural Networks for Visual Language Models
Haowen Hou | Peigen Zeng | Fei Ma | Fei Richard Yu
Proceedings of the 31st International Conference on Computational Linguistics

Visual Language Models (VLMs) have rapidly progressed with the recent success of large language models. However, there have been few attempts to incorporate efficient linear Recurrent Neural Networks (RNNs) architectures into VLMs. In this study, we introduce VisualRWKV, the first application of a linear RNN model to multimodal learning tasks, leveraging the pre-trained RWKV language model. We propose a data-dependent recurrence and sandwich prompts to enhance our modeling capabilities, along with a 2D image scanning mechanism to enrich the processing of visual sequences. Extensive experiments demonstrate that VisualRWKV achieves competitive performance compared to Transformer-based models like LLaVA-1.5 on various benchmarks. Compared to LLaVA-1.5, VisualRWKV has a speed advantage of 3.98 times and can save 54% of GPU memory when reaching an inference length of 24K tokens. To facilitate further research and analysis, we have made the checkpoints and the associated code publicly accessible at the following GitHub repository: https://github.com/howard-hou/VisualRWKV.

pdf bib abs

Large Vision-Language Models (LVLMs) have shown exceptional performance in multimodal tasks, but their effectiveness in complex visual reasoning is still constrained, especially when employing Chain-of-Thought prompting techniques. In this paper, we propose VReST, a novel training-free approach that enhances Reasoning in LVLMs through Monte Carlo Tree Search and Self-Reward mechanisms. VReST meticulously traverses the reasoning landscape by establishing a search tree, where each node encapsulates a reasoning step, and each path delineates a comprehensive reasoning sequence. Our innovative multimodal Self-Reward mechanism assesses the quality of reasoning steps by integrating the utility of sub-questions, answer correctness, and the relevance of vision-language clues, all without the need for additional models. VReST surpasses current prompting methods and secures state-of-the-art performance across three multimodal mathematical reasoning benchmarks. Furthermore, it substantiates the efficacy of test-time scaling laws in multimodal tasks, offering a promising direction for future research.

Co-authors

Qi Tian 1

Venues

Fix author