Jun Xiao
Other people with similar names: Jun Xiao
Unverified author pages with similar names: Jun Xiao
2026
CoVerRL: Breaking the Consensus Trap in Label-Free Reasoning via Generator-Verifier Co-Evolution
Teng Pan | Yuchen Yan | Zixuan Wang | Ruiqing Zhang | Guiyang Hou | Wenqi Zhang | Weiming Lu | Jun Xiao | Yongliang Shen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Teng Pan | Yuchen Yan | Zixuan Wang | Ruiqing Zhang | Guiyang Hou | Wenqi Zhang | Weiming Lu | Jun Xiao | Yongliang Shen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Label-free reinforcement learning enables large language models to improve reasoning capabilities without ground-truth supervision, typically by treating majority-voted answers as pseudo-labels. However, we identify a critical failure mode: as training maximizes self-consistency, output diversity collapses, causing the model to confidently reinforce systematic errors that evade detection. We term this the consensus trap. To escape it, we propose CoVerRL, a framework where a single model alternates between generator and verifier roles, with each capability bootstrapping the other. Majority voting provides noisy but informative supervision for training the verifier, while the improving verifier progressively filters self-consistent errors from pseudo-labels. This co-evolution creates a virtuous cycle that maintains high reward accuracy throughout training. Experiments across Qwen and Llama model families demonstrate that CoVerRL outperforms label-free baselines by 4.7-5.9% on mathematical reasoning benchmarks. Moreover, self-verification accuracy improves from around 55% to over 85%, confirming that both capabilities genuinely co-evolve.
Pause or Fabricate? Training Language Models for Grounded Reasoning
Yiwen Qiu | Linjuan Wu | Yizhou Liu | Yuchen Yan | Jin Ma | Xu Tan | Yao Hu | Daoxin Zhang | Wenqi Zhang | Weiming Lu | Jun Xiao | Yongliang Shen
Findings of the Association for Computational Linguistics: ACL 2026
Yiwen Qiu | Linjuan Wu | Yizhou Liu | Yuchen Yan | Jin Ma | Xu Tan | Yao Hu | Daoxin Zhang | Wenqi Zhang | Weiming Lu | Jun Xiao | Yongliang Shen
Findings of the Association for Computational Linguistics: ACL 2026
Large language models have achieved remarkable progress on complex reasoning tasks. However, they often implicitly fabricate information when inputs are incomplete, producing confident but unreliable conclusions—a failure mode we term ungrounded reasoning. We argue that this issue arises not from insufficient reasoning capability, but from the lack of inferential boundary awareness—the ability to recognize when the necessary premises for valid inference are missing. To address this issue, we propose Grounded Reasoning via Interactive Reinforcement Learning (GRIL), a multi-turn reinforcement learning framework for grounded reasoning under incomplete information. GRIL decomposes the reasoning process into two stages: clarify and pause, which identifies whether the available information is sufficient, and grounded reasoning, which performs task solving once the necessary premises are established. We design stage-specific rewards to penalize hallucinations, enabling models to detect gaps, stop proactively, and resume reasoning after clarification. Experiments on GSM8K-Insufficient and MetaMATH-Insufficient show that GRIL significantly improves premise detection (up to 45%), leading to a 30% increase in task success while reducing average response length by over 20%. Additional analyses confirm robustness to noisy user responses and generalization to out-of-distribution tasks.
PILOT: Planning via Internalized Latent Optimization Trajectories for Large Language Models
Haoyu Zheng | Yun Zhu | Yuqian Yuan | Bo Yuan | Wenqiao Zhang | Siliang Tang | Jun Xiao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Haoyu Zheng | Yun Zhu | Yuqian Yuan | Bo Yuan | Wenqiao Zhang | Siliang Tang | Jun Xiao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Strategic planning is critical for multi-step reasoning, yet compact Language Language Models (LLMs) often lack the capacity to formulate global strategies, leading to error propagation in long-horizon tasks. Our analysis reveals that LLMs possess latent reasoning capabilities that can be unlocked when conditioned on explicit plans from a teacher model; however, runtime reliance on external guidance is often impractical due to latency and availability constraints. To bridge this gap, we propose PILOT (Planning via Internalized Latent Optimization Trajectories), a non-invasive framework designed to internalize the strategic oversight of large models into intrinsic Latent Guidance. Instead of altering backbone weights, PILOT employs a lightweight Hyper-Network to synthesize a query-conditioned Latent Guidance. This vector acts as an internal steering mechanism, guiding the model’s representations toward optimal reasoning paths. Extensive experiments on mathematical and coding benchmarks demonstrate that PILOT effectively stabilizes reasoning trajectories, consistently outperforming strong baselines (e.g., +8.9% on MATH500) with negligible inference latency. Our code is available at: https://anonymous.4open.science/r/PILOT-B266
Experience-driven Multi-turn Reinforcement Learning for GUI Agents
Zhengxi Lu | Jiabo Ye | Fei Tang | Yongliang Shen | Haiyang Xu | Ziwei Zheng | Weiming Lu | Ming Yan | Fei Huang | Jun Xiao | Yueting Zhuang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhengxi Lu | Jiabo Ye | Fei Tang | Yongliang Shen | Haiyang Xu | Ziwei Zheng | Weiming Lu | Ming Yan | Fei Huang | Jun Xiao | Yueting Zhuang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
GUI agents have demonstrated remarkable progress in automating complex user interface interactions. However, training such agents for long-horizon tasks remains challenging. Single-turn reinforcement learning conditions on expert histories during training but self-generated histories during deployment, causing distribution mismatch. Online multi-turn methods eliminate this gap via environment interaction but suffer from sparse rewards and prohibitive costs. We propose ̲Experience-driven ̲Multi-turn ̲Policy ̲Optimization (EMPO), which leverages expert trajectories as environment experiences for on-policy multi-turn training. The agent constructs self-generated history throughout rollouts; when actions match expert experiences, the trajectory provides valid state transitions, and a Patch Module recovers mismatched steps to maintain on-policy rollouts. EMPO further incorporates discounted future rewards and dual-level advantage estimation to capture long-horizon dependencies. We also propose AndroidControl-Real, an evaluation metric strongly correlated with real-world performance (R2=0.934). With only 1K public trajectories as RL experiences, our method achieves substantial gains over the base model (e.g., +12.0% on AndroidWorld and +23.8% on AITW) and achieves competitive performance against strong baselines such as UI-TARS-7B and GPT-4o, demonstrating better generalization than prior single-turn RL approaches. Code available: https://anonymous.4open.science/r/UI-S1-0DAF.
CoMoL: Efficient Mixture of LoRA Experts via Dynamic Core Space Merging
Jie Cao | Zhenxuan Fan | Zhuonan Wang | Tianwei Lin | Ziyuan Zhao | Rolan Yan | Wenqiao Zhang | Feifei Shao | Hongwei Wang | Jun Xiao | Siliang Tang
Findings of the Association for Computational Linguistics: ACL 2026
Jie Cao | Zhenxuan Fan | Zhuonan Wang | Tianwei Lin | Ziyuan Zhao | Rolan Yan | Wenqiao Zhang | Feifei Shao | Hongwei Wang | Jun Xiao | Siliang Tang
Findings of the Association for Computational Linguistics: ACL 2026
Large language models (LLMs) achieve remarkable performance on diverse downstream and domain-specific tasks via parameter-efficient fine-tuning (PEFT). However, existing PEFT methods, particularly MoE-LoRA architectures, suffer from limited parameter efficiency and coarse-grained adaptation due to the proliferation of LoRA experts and instance-level routing. To address these issues, we propose Core Space Mixture of LoRA (CoMoL), a novel MoE-LoRA framework that incorporates expert diversity, parameter efficiency, and fine-grained adaptation. Specifically, CoMoL introduces two key components: core space experts and core space routing. Core space experts store each expert in a compact core matrix, preserving diversity while controlling parameter growth. Core space routing dynamically selects and activates the appropriate core experts for each token, enabling fine-grained, input-adaptive routing. Activated core experts are then merged via a soft-merging strategy into a single core expert, which is combined with a shared LoRA to form a specialized LoRA module. Besides, the routing network is projected into the same low-rank space as the LoRA matrices, further reducing parameter overhead without compromising expressiveness. Extensive experiments demonstrate that CoMoL retains the adaptability of MoE-LoRA architectures while achieving parameter efficiency comparable to standard LoRA, consistently outperforming existing methods across multiple tasks. Our code is available at https://github.com/DCDmllm/CoMoL.
UI-Copilot: Advancing Long-Horizon GUI Automation via Tool-Integrated Policy Optimization
Zhengxi Lu | Fei Tang | Guangyi Liu | Jin Ma | Kaitao Song | Xu Tan | Wenqi Zhang | Weiming Lu | Jun Xiao | Yueting Zhuang | Yongliang Shen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhengxi Lu | Fei Tang | Guangyi Liu | Jin Ma | Kaitao Song | Xu Tan | Wenqi Zhang | Weiming Lu | Jun Xiao | Yueting Zhuang | Yongliang Shen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
MLLM-based GUI agents have demonstrated strong capabilities in complex user interface interaction tasks. However, long-horizon scenarios remain challenging, as these agents are burdened with tasks beyond their intrinsic capabilities, suffering from memory degradation, progress confusion, and math hallucination. To address these challenges, we present UI-Copilot, a collaborative framework where the GUI agent focuses on task execution while a lightweight copilot provides on-demand assistance for memory retrieval and numerical computation. We introduce memory decoupling to separate persistent observations from transient execution context, and train the policy agent to selectively invoke the copilot as Retriever or Calculator based on task demands. To enable effective tool invocation learning, we propose ̲Tool- ̲Integrated ̲Policy ̲Optimization (TIPO), which separately optimizes tool selection through single-turn prediction and task execution through on-policy multi-turn rollouts. Experimental results show that UI-Copilot-7B achieves state-of-the-art performance on challenging MemGUI-Bench, outperforming strong 7B-scale GUI agents such as GUI-Owl-7B and UI-TARS-1.5-7B. Moreover, UI-Copilot-7B delivers a 17.1% absolute improvement on AndroidWorld over the base Qwen model, highlighting UI-Copilot’s strong generalization to real-world GUI tasks. Code website: https://anonymous.4open.science/r/UI-Copilot-0535.
Search
Fix author
Co-authors
- Weiming Lu 4
- Yongliang Shen 4
- Wenqi Zhang 3
- Zhengxi Lu 2
- Jin Ma 2
- Xu Tan 2
- Siliang Tang 2
- Fei Tang 2
- Yuchen Yan 2
- Wenqiao Zhang 2
- Yueting Zhuang 2
- Jie Cao 1
- Zhenxuan Fan 1
- Guiyang Hou 1
- Yao Hu 1
- Fei Huang 1
- Tianwei Lin 1
- Yizhou Liu 1
- Guangyi Liu 1
- Teng Pan 1
- Yiwen Qiu 1
- Feifei Shao 1
- Kaitao Song 1
- Zixuan Wang 1
- Zhuonan Wang 1
- Hongwei Wang 1
- Linjuan Wu 1
- Haiyang Xu 1
- Ming Yan 1
- Rolan Yan 1
- Jiabo Ye 1
- Yuqian Yuan 1
- Bo Yuan 1
- Ruiqing Zhang 1
- Daoxin Zhang 1
- Ziyuan Zhao 1
- Haoyu Zheng 1
- Ziwei Zheng 1
- Yun Zhu 1