Jiajun Wu
2026
UCoder: Unsupervised Code Generation by Internal Probing of Large Language Models
Jiajun Wu | Jian Yang | Wei Zhang | Linzheng Chai | Yuchi Ma | Ensheng Shi | Yuqing Ma | Zhoujun Li | Xianglong Liu
Findings of the Association for Computational Linguistics: ACL 2026
Jiajun Wu | Jian Yang | Wei Zhang | Linzheng Chai | Yuchi Ma | Ensheng Shi | Yuqing Ma | Zhoujun Li | Xianglong Liu
Findings of the Association for Computational Linguistics: ACL 2026
Large language models (LLMs) have demonstrated remarkable capabilities in code generation tasks. However, their effectiveness heavily relies on supervised training with extensive labeled (e.g., question-answering pairs) or unlabeled datasets (e.g., code snippets), which are often expensive and difficult to obtain at scale. To address this limitation, this paper introduces a method IPC, an unsupervised framework that leverages Internal Probing of LLMs for Code generation without any external corpus, even unlabeled code snippets. We introduce the problem space probing, test understanding probing, solution space probing, and knowledge consolidation and reinforcement to probe the internal knowledge and confidence patterns existing in LLMs. Further, IPC identifies reliable code candidates through self-consistency mechanisms and representation-based quality estimation to train UCoder (coder with unsupervised learning). We validate the proposed approach across multiple code benchmarks, demonstrating that unsupervised methods can achieve competitive performance compared to supervised approaches while significantly reducing the dependency on labeled data and computational resources. Analytic experiments reveal that internal model states contain rich signals about code quality and correctness, and that properly harnessing these signals enables effective unsupervised learning for code generation tasks, opening new directions for training code LLMs in resource-constrained scenarios.
V-GameGym: Visual Game Generation for Code Large Language Models
Wei Zhang | Jian Yang | Renshuai Tao | Linzheng Chai | Shuyue Guo | Jiajun Wu | Xiaoming Chen | Ganqu Cui | Ning Ding | Xander Xu | HU Wei | Bowen Zhou
Findings of the Association for Computational Linguistics: ACL 2026
Wei Zhang | Jian Yang | Renshuai Tao | Linzheng Chai | Shuyue Guo | Jiajun Wu | Xiaoming Chen | Ganqu Cui | Ning Ding | Xander Xu | HU Wei | Bowen Zhou
Findings of the Association for Computational Linguistics: ACL 2026
Code large language models have demonstrated remarkable capabilities in programming tasks, yet current benchmarks primarily focus on single modality rather than visual game development. Most existing code-related benchmarks evaluate syntax correctness and execution accuracy, overlooking critical game-specific metrics such as playability, visual aesthetics, and user engagement that are essential for real-world deployment. To address the gap between current LLM capabilities in algorithmic problem-solving and competitive programming versus the comprehensive requirements of practical game development, we present V-GameGym, a comprehensive benchmark comprising 2,219 high-quality samples across 100 thematic clusters derived from real-world repositories, adopting a novel clustering-based curation methodology to ensure both diversity and structural completeness. Further, we introduce a multimodal evaluation framework with an automated LLM-driven pipeline for visual code synthesis using complete UI sandbox environments. Our extensive analysis reveals that V-GameGym effectively bridges the gap between code generation accuracy and practical game development workflows, providing quantifiable quality metrics for visual programming and interactive element generation.
EvoHyper: Evolving Hypergraph Topologies for Unified Collaboration in Multi-Agent Communication
Heng Zhang | Yihao Zhong | Lubin Gan | Zhihe Chen | Jiajun Wu | Yuling Shi | Xiaodong Gu | Hao Zhang | Haochen You | Jin Huang
Findings of the Association for Computational Linguistics: ACL 2026
Heng Zhang | Yihao Zhong | Lubin Gan | Zhihe Chen | Jiajun Wu | Yuling Shi | Xiaodong Gu | Hao Zhang | Haochen You | Jin Huang
Findings of the Association for Computational Linguistics: ACL 2026
Multi-agent systems powered by large language models have achieved strong performance on complex tasks, yet naive collaboration topologies often cause high communication costs and redundant context. Existing methods usually use a fixed communication graph and manage collaboration structure and shared memory in separate modules. Our log analysis of several representative systems shows that this separation leads to multiple copies of the same key facts in dialogue, memory and model inputs. We address this issue with EvoHyper, a framework based on an evolving hypergraph topology for multi-agent collaboration. In EvoHyper, a single hypergraph represents agents and shared memory, and each hyperedge serves as a collaboration unit that binds a group of agents to that shared memory. During execution a controller edits the hypergraph through a small set of predefined evolution operations, so collaboration units can spawn, update and merge as tasks unfold. Experiments on four benchmarks covering mathematical reasoning and code generation show that EvoHyper is (I) high-performing, achieving 3.2% to 7.8% accuracy gains over state-of-the-art methods, (II) efficient, reducing token consumption by up to 23.5%, and (III) adaptive, adjusting topology complexity according to task requirements.
OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Models
Qiguang Chen | Chengyu Luan | Jiajun Wu | Qiming Yu | Yi Yang | Yizhuo Li | Jingqi Tong | Xiachong Feng | Libo Qin | Wanxiang Che
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Qiguang Chen | Chengyu Luan | Jiajun Wu | Qiming Yu | Yi Yang | Yizhuo Li | Jingqi Tong | Xiachong Feng | Libo Qin | Wanxiang Che
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal reasoning benchmarks for these models often emphasize single-image analysis and fail to exploit contextual information across multiple images. We present OMIBench, a benchmark designed to evaluate Olympiad-level reasoning when the required evidence is distributed over multiple images. It contains problems from biology, chemistry, mathematics, and physics Olympiads, together with manually annotated rationales and evaluation protocols for both exact and semantic answer matching. Across extensive experiments on OMIBench, we observe meaningful performance gaps in existing models. Even the strongest LVLMs, such as Gemini-3-Pro, attain only about 50% on the benchmark. These results position OMIBench as a focused resources for studying and improving multi-image reasoning in LVLMs.
LoopCoder: Scaling Code Intelligence via Looped Language Models
Jian Yang | Wei Zhang | Shuyue Guo | Yizhi LI | Linzheng Chai | Zhengmao Ye | Shukai Liu | Yuyang Song | Jiajun Wu | Che Liu | Tianyu Zheng | Siwei Wu | Leo L | Xudong Ma | Chuan Hao | Ran Tao | Yan Xing | Jianzhou Wang | Mingjie Tang | Aishan Liu | Zhoujun Li | Xianglong Liu | Weifeng Lv | Bryan Dai
Findings of the Association for Computational Linguistics: ACL 2026
Jian Yang | Wei Zhang | Shuyue Guo | Yizhi LI | Linzheng Chai | Zhengmao Ye | Shukai Liu | Yuyang Song | Jiajun Wu | Che Liu | Tianyu Zheng | Siwei Wu | Leo L | Xudong Ma | Chuan Hao | Ran Tao | Yan Xing | Jianzhou Wang | Mingjie Tang | Aishan Liu | Zhoujun Li | Xianglong Liu | Weifeng Lv | Bryan Dai
Findings of the Association for Computational Linguistics: ACL 2026
While large language models (LLMs) have mastered syntax-level code generation, complex algorithmic reasoning remains a challenge, typically addressed by scaling model depth and parameter count. Universal Transformers (UT) offer a compelling alternative by introducing a recurrent inductive bias that aligns with the recursive nature of programming logic. However, training looped architectures at scale has historically been hindered by severe instability and optimization difficulties associated with backpropagation through time (BPTT). We present LoopCoder (40B-A80B) pre-trained on 12T+ code and general tokens, along with LoopCoder-Thinking and LoopCoder-Instruct variants—the first large-scale looped transformer for code, achieving comparable performance to standard dense architectures with more parameters. Unlike prior approaches that restrict recurrence to small-scale tasks, we implement a comprehensive looped training protocol spanning both pre-training and post-training phases. We initiate the model via dense-to-loop transformation, folding a pre-trained dense checkpoint to initialize a recurrent block, followed by rigorous looped pre-training and specialized post-training for instruction following and reasoning. Our results establish a robust recipe for scaling coding intelligence via recurrent computation, proving that dense checkpoints serve as an optimal foundation for evolving into dynamic, looped reasoners.
2021
Search
Fix author
Co-authors
- Linzheng Chai 3
- Jian Yang 3
- Wei Zhang 3
- Shuyue Guo 2
- Zhoujun Li 2
- Xianglong Liu 2
- Wanxiang Che (车万翔) 1
- Xiaoming Chen 1
- Zhihe Chen 1
- Qiguang Chen (陈麒光) 1
- Ganqu Cui 1
- Bryan Dai 1
- Ning Ding 1
- Xiachong Feng 1
- Lubin Gan 1
- Samuel Gershman 1
- Xiaodong Gu 1
- Chuan Hao 1
- Jin Huang 1
- Leo L 1
- Yizhuo Li 1
- Yizhi Li 1
- Shukai Liu 1
- Che Liu 1
- Aishan Liu 1
- Chengyu Luan 1
- Weifeng Lv 1
- Yuchi Ma 1
- Yuqing Ma 1
- Xudong Ma 1
- Jiayuan Mao 1
- Libo Qin 1
- Ensheng Shi 1
- Yuling Shi 1
- Yuyang Song 1
- Mingjie Tang 1
- Renshuai Tao 1
- Ran Tao 1
- Jingqi Tong 1
- Ruocheng Wang 1
- Jianzhou Wang 1
- HU Wei 1
- Siwei Wu 1
- Yan Xing 1
- Xander Xu 1
- Yi Yang 1
- Zhengmao Ye 1
- Haochen You 1
- Qiming Yu 1
- Heng Zhang 1
- Hao Zhang 1
- Tianyu Zheng 1
- Yihao Zhong 1
- Bowen Zhou 1