Jiaqi Wang
Other people with similar names: Jiaqi Wang, Jiaqi Wang
Unverified author pages with similar names: Jiaqi Wang
2026
GeometryZero: Advancing Geometry Solving via Group Contrastive Policy Optimization
Yikun Wang | Yibin Wang | Dianyi Wang | Zimian Peng | Qipeng Guo | Dacheng Tao | Jiaqi Wang
Findings of the Association for Computational Linguistics: ACL 2026
Yikun Wang | Yibin Wang | Dianyi Wang | Zimian Peng | Qipeng Guo | Dacheng Tao | Jiaqi Wang
Findings of the Association for Computational Linguistics: ACL 2026
Recent progress in large language models (LLMs) has boosted mathematical reasoning, yet geometry remains challenging where auxiliary construction is often essential. Prior methods either underperform or depend on very large models (e.g., GPT-4o), making them costly. We argue that reinforcement learning with verifiable rewards (e.g., GRPO) can train smaller models to couple auxiliary construction with solid geometric reasoning. However, naively applying GRPO yields unconditional rewards, encouraging indiscriminate and sometimes harmful constructions. We propose Group Contrastive Policy Optimization (GCPO), an RL framework with two components: (1) Group Contrastive Masking, which assigns positive/negative construction rewards based on contextual utility, and (2) a Length Reward that encourages longer reasoning chains. On top of GCPO, we build GeometryZero, an affordable family of geometry reasoning models that selectively use auxiliary construction. Experiments on Geometry3K and MathVista show GeometryZero consistently outperforms RL baselines (e.g., GRPO, ToRL).
Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better
Dianyi Wang | Wei Song | Yikun Wang | Siyuan Wang | Kaicheng Yu | Zhongyu Wei | Jiaqi Wang
Findings of the Association for Computational Linguistics: ACL 2026
Dianyi Wang | Wei Song | Yikun Wang | Siyuan Wang | Kaicheng Yu | Zhongyu Wei | Jiaqi Wang
Findings of the Association for Computational Linguistics: ACL 2026
Typical large vision-language models (LVLMs) apply autoregressive supervision primarily to textual responses, without fully exploiting causal learning over rich visual inputs. As a result, these models often emphasize vision-to-language alignment while potentially overlooking fine-grained visual information. While prior work has explored autoregressive image generation, effectively leveraging autoregressive visual supervision to enhance image understanding remains an open challenge. In this paper, we introduce Autoregressive Semantic Visual Reconstruction (ASVR), which enables joint learning of visual and textual modalities within a unified autoregressive framework. ASVR trains models to autoregressively reconstruct the semantic content of input images, which consistently enhances multimodal comprehension. Notably, we show that even when provided with continuous image features as input, models can effectively reconstruct discrete semantic tokens, resulting in stable and consistent improvements across various multimodal understanding benchmarks. ASVR delivers significant performance gains and scalability across varying data scales, visual input, visual supervision and model architectures. In particular, ASVR generally improves baselines by 2-3% across 14 multimodal benchmarks.
VideoPro: Adaptive Program Reasoning for Long Video Understanding
Chenglin Li | Feng Han | Yikun Wang | Ruilin Li | Shuai Dong | Haowen Hou | Haitao Li | Qianglong Chen | Feng Tao | Jingqi Tong | Yin Zhang | Jiaqi Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chenglin Li | Feng Han | Yikun Wang | Ruilin Li | Shuai Dong | Haowen Hou | Haitao Li | Qianglong Chen | Feng Tao | Jingqi Tong | Yin Zhang | Jiaqi Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Understanding long videos remains challenging due to the sparsity of visual evidence relevant to a given query. Prior work has explored program-based visual grounding, typically relying on executable programs generated by auxiliary large language models. However, when scaling to long videos, existing approaches face several critical limitations: (1) frame-centric vision modules are often insufficient for long video processing; (2) naively applying program-based reasoning to all queries incurs considerable computational overhead; and (3) errors arising from low-confidence predictions and imperfect program execution are difficult to recover from. To address these challenges, we propose VideoPro, a unified framework that enables VideoLLMs to adaptively reason over long videos and refine their predictions through executable programs. VideoPro first performs adaptive reasoning, dynamically determining whether a query can be resolved directly by the native VideoLLM or requires explicit multi-step program reasoning. For complex queries, the model decomposes the task into executable programs that invoke specialized vision modules for precise temporal and semantic grounding. To further improve robustness, VideoPro incorporates a self-refinement mechanism that leverages execution feedback and confidence signals to correct erroneous executions and refine low-confidence reasoning programs. By tightly integrating adaptive reasoning with self-refinement, VideoPro consistently outperforms prior methods across multiple long-video understanding benchmarks, yielding an average 6.7% improvement for Qwen3-VL-8B.