Jiaqi Tang
2026
LongVideoAgent: Multi-Agent Reasoning with Long Videos
Runtao Liu | Ziyi Liu | Jiaqi Tang | Yue Ma | Renjie Pi | Jipeng Zhang | Qifeng Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Runtao Liu | Ziyi Liu | Jiaqi Tang | Yue Ma | Renjie Pi | Jipeng Zhang | Qifeng Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent advances in multimodal LLMs and systems that use tools for long-video QA point to the promise of reasoning over hour-long episodes. However, many methods still compress content into lossy summaries or rely on limited toolsets, weakening temporal grounding and missing fine-grained cues. We propose a multi-agent framework in which a master LLM coordinates a grounding agent to localize question-relevant segments and a vision agent to extract targeted textual observations. The master agent plans with a step limit, and is trained with reinforcement learning to encourage concise, correct, and efficient multi-agent cooperation. This design helps the master agent focus on relevant clips via grounding, complements subtitles with visual detail, and yields interpretable trajectories. On our proposed *LongTVQA* and *LongTVQA+* which are episode-level datasets aggregated from TVQA/TVQA+, our multi-agent system significantly outperforms strong non-agent baselines. Experiments also show reinforcement learning further strengthens reasoning and planning for the trained agent.
LPO: Towards Accurate GUI Agent Interaction via Location Preference Optimization
Jiaqi Tang | Yu Xia | Yi-Feng Wu | Yuwei Hu | Chen Yuhui | Qing-Guo Chen | Xiaogang Xu | Xiangyu Wu | Hao LU | Yanqing Ma | Shiyin Lu | Qifeng Chen
Findings of the Association for Computational Linguistics: ACL 2026
Jiaqi Tang | Yu Xia | Yi-Feng Wu | Yuwei Hu | Chen Yuhui | Qing-Guo Chen | Xiaogang Xu | Xiangyu Wu | Hao LU | Yanqing Ma | Shiyin Lu | Qifeng Chen
Findings of the Association for Computational Linguistics: ACL 2026
The advent of autonomous agents is transforming interactions with Graphical User Interfaces (GUIs) by employing natural language as a powerful intermediary. Despite the predominance of supervised fine-tuning (SFT) methods in current GUI agents for achieving spatial localization, these methods face substantial challenges due to their limited capacity to accurately perceive positional data. Existing strategies, such as reinforcement learning, often fail to assess positional accuracy effectively, thereby restricting their utility. In response, we introduce Location Preference Optimization (LPO), a novel approach that leverages locational data to optimize interaction preferences. LPO uses information entropy to predict interaction positions by focusing on zones rich in information. Besides, we further introduce a dynamic location reward function based on physical distance, reflecting the varying importance of interaction positions. Supported by Group Relative Preference Optimization (GRPO), LPO facilitates an extensive exploration of GUI environments and significantly enhances interaction precision. Comprehensive experiments demonstrate LPO’s superior performance, achieving SOTA results across both offline benchmarks and real-world online evaluations.
Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding
Ke Ma | Jiaqi Tang | Bin Guo | Xueting Han | Ruonan Xu | Qingfeng He | Ziheng Wang | Xu Wang | Qifeng Chen | Zhiwen Yu | Yunhao Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ke Ma | Jiaqi Tang | Bin Guo | Xueting Han | Ruonan Xu | Qingfeng He | Ziheng Wang | Xu Wang | Qifeng Chen | Zhiwen Yu | Yunhao Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Proactive streaming video understanding requires Video-LLMs to decide when to respond as a video unfolds, a task where existing methods often fall short due to their implicit, query-agnostic modeling of visual evidence. We introduce Response-G1, a novel framework that establishes explicit, structured alignment between the accumulated video evidence and the query’s expected response conditions via scene graphs. The framework operates in three fine-tuning-free stages: (1) online query-guided scene graph generation from streaming clips; (2) memory-based retrieval of the most semantically relevant historical scene graphs; and (3) retrieval-augmented trigger prompting for per-frame "silence/response" decisions. By grounding both evidence and conditions in a shared graph representation, Response-G1 achieves more interpretable and accurate response timing decisions. Experimental results on established benchmarks demonstrate the superiority of our method in both proactive and reactive tasks, validating the advantage of explicit scene graph modeling and retrieval in streaming video understanding.