Qifeng Chen
2026
LongVideoAgent: Multi-Agent Reasoning with Long Videos
Runtao Liu | Ziyi Liu | Jiaqi Tang | Yue Ma | Renjie Pi | Jipeng Zhang | Qifeng Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Runtao Liu | Ziyi Liu | Jiaqi Tang | Yue Ma | Renjie Pi | Jipeng Zhang | Qifeng Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent advances in multimodal LLMs and systems that use tools for long-video QA point to the promise of reasoning over hour-long episodes. However, many methods still compress content into lossy summaries or rely on limited toolsets, weakening temporal grounding and missing fine-grained cues. We propose a multi-agent framework in which a master LLM coordinates a grounding agent to localize question-relevant segments and a vision agent to extract targeted textual observations. The master agent plans with a step limit, and is trained with reinforcement learning to encourage concise, correct, and efficient multi-agent cooperation. This design helps the master agent focus on relevant clips via grounding, complements subtitles with visual detail, and yields interpretable trajectories. On our proposed *LongTVQA* and *LongTVQA+* which are episode-level datasets aggregated from TVQA/TVQA+, our multi-agent system significantly outperforms strong non-agent baselines. Experiments also show reinforcement learning further strengthens reasoning and planning for the trained agent.
Response-G1: Explicit Scene Graph Modeling for Proactive Streaming Video Understanding
Ke Ma | Jiaqi Tang | Bin Guo | Xueting Han | Ruonan Xu | Qingfeng He | Ziheng Wang | Xu Wang | Qifeng Chen | Zhiwen Yu | Yunhao Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ke Ma | Jiaqi Tang | Bin Guo | Xueting Han | Ruonan Xu | Qingfeng He | Ziheng Wang | Xu Wang | Qifeng Chen | Zhiwen Yu | Yunhao Liu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Proactive streaming video understanding requires Video-LLMs to decide when to respond as a video unfolds, a task where existing methods often fall short due to their implicit, query-agnostic modeling of visual evidence. We introduce Response-G1, a novel framework that establishes explicit, structured alignment between the accumulated video evidence and the query’s expected response conditions via scene graphs. The framework operates in three fine-tuning-free stages: (1) online query-guided scene graph generation from streaming clips; (2) memory-based retrieval of the most semantically relevant historical scene graphs; and (3) retrieval-augmented trigger prompting for per-frame "silence/response" decisions. By grounding both evidence and conditions in a shared graph representation, Response-G1 achieves more interpretable and accurate response timing decisions. Experimental results on established benchmarks demonstrate the superiority of our method in both proactive and reactive tasks, validating the advantage of explicit scene graph modeling and retrieval in streaming video understanding.
LPO: Towards Accurate GUI Agent Interaction via Location Preference Optimization
Jiaqi Tang | Yu Xia | Yi-Feng Wu | Yuwei Hu | Chen Yuhui | Qing-Guo Chen | Xiaogang Xu | Xiangyu Wu | Hao LU | Yanqing Ma | Shiyin Lu | Qifeng Chen
Findings of the Association for Computational Linguistics: ACL 2026
Jiaqi Tang | Yu Xia | Yi-Feng Wu | Yuwei Hu | Chen Yuhui | Qing-Guo Chen | Xiaogang Xu | Xiangyu Wu | Hao LU | Yanqing Ma | Shiyin Lu | Qifeng Chen
Findings of the Association for Computational Linguistics: ACL 2026
The advent of autonomous agents is transforming interactions with Graphical User Interfaces (GUIs) by employing natural language as a powerful intermediary. Despite the predominance of supervised fine-tuning (SFT) methods in current GUI agents for achieving spatial localization, these methods face substantial challenges due to their limited capacity to accurately perceive positional data. Existing strategies, such as reinforcement learning, often fail to assess positional accuracy effectively, thereby restricting their utility. In response, we introduce Location Preference Optimization (LPO), a novel approach that leverages locational data to optimize interaction preferences. LPO uses information entropy to predict interaction positions by focusing on zones rich in information. Besides, we further introduce a dynamic location reward function based on physical distance, reflecting the varying importance of interaction positions. Supported by Group Relative Preference Optimization (GRPO), LPO facilitates an extensive exploration of GUI environments and significantly enhances interaction precision. Comprehensive experiments demonstrate LPO’s superior performance, achieving SOTA results across both offline benchmarks and real-world online evaluations.
2022
CI-AVSR: A Cantonese Audio-Visual Speech Datasetfor In-car Command Recognition
Wenliang Dai | Samuel Cahyawijaya | Tiezheng Yu | Elham J. Barezi | Peng Xu | Cheuk Tung Yiu | Rita Frieske | Holy Lovenia | Genta Winata | Qifeng Chen | Xiaojuan Ma | Bertram Shi | Pascale Fung
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Wenliang Dai | Samuel Cahyawijaya | Tiezheng Yu | Elham J. Barezi | Peng Xu | Cheuk Tung Yiu | Rita Frieske | Holy Lovenia | Genta Winata | Qifeng Chen | Xiaojuan Ma | Bertram Shi | Pascale Fung
Proceedings of the Thirteenth Language Resources and Evaluation Conference
With the rise of deep learning and intelligent vehicles, the smart assistant has become an essential in-car component to facilitate driving and provide extra functionalities. In-car smart assistants should be able to process general as well as car-related commands and perform corresponding actions, which eases driving and improves safety. However, there is a data scarcity issue for low resource languages, hindering the development of research and applications. In this paper, we introduce a new dataset, Cantonese In-car Audio-Visual Speech Recognition (CI-AVSR), for in-car command recognition in the Cantonese language with both video and audio data. It consists of 4,984 samples (8.3 hours) of 200 in-car commands recorded by 30 native Cantonese speakers. Furthermore, we augment our dataset using common in-car background noises to simulate real environments, producing a dataset 10 times larger than the collected one. We provide detailed statistics of both the clean and the augmented versions of our dataset. Moreover, we implement two multimodal baselines to demonstrate the validity of CI-AVSR. Experiment results show that leveraging the visual signal improves the overall performance of the model. Although our best model can achieve a considerable quality on the clean test set, the speech recognition quality on the noisy data is still inferior and remains an extremely challenging task for real in-car speech recognition systems. The dataset and code will be released at https://github.com/HLTCHKUST/CI-AVSR.
Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset
Tiezheng Yu | Rita Frieske | Peng Xu | Samuel Cahyawijaya | Cheuk Tung Yiu | Holy Lovenia | Wenliang Dai | Elham J. Barezi | Qifeng Chen | Xiaojuan Ma | Bertram Shi | Pascale Fung
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Tiezheng Yu | Rita Frieske | Peng Xu | Samuel Cahyawijaya | Cheuk Tung Yiu | Holy Lovenia | Wenliang Dai | Elham J. Barezi | Qifeng Chen | Xiaojuan Ma | Bertram Shi | Pascale Fung
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Automatic speech recognition (ASR) on low resource languages improves the access of linguistic minorities to technological advantages provided by artificial intelligence (AI). In this paper, we address the problem of data scarcity for the Hong Kong Cantonese language by creating a new Cantonese dataset. Our dataset, Multi-Domain Cantonese Corpus (MDCC), consists of 73.6 hours of clean read speech paired with transcripts, collected from Cantonese audiobooks from Hong Kong. It comprises philosophy, politics, education, culture, lifestyle and family domains, covering a wide range of topics. We also review all existing Cantonese datasets and analyze them according to their speech type, data source, total size and availability. We further conduct experiments with Fairseq S2T Transformer, a state-of-the-art ASR model, on the biggest existing dataset, Common Voice zh-HK, and our proposed MDCC, and the results show the effectiveness of our dataset. In addition, we create a powerful and robust Cantonese ASR model by applying multi-dataset learning on MDCC and Common Voice zh-HK.
ASCEND: A Spontaneous Chinese-English Dataset for Code-switching in Multi-turn Conversation
Holy Lovenia | Samuel Cahyawijaya | Genta Winata | Peng Xu | Yan Xu | Zihan Liu | Rita Frieske | Tiezheng Yu | Wenliang Dai | Elham J. Barezi | Qifeng Chen | Xiaojuan Ma | Bertram Shi | Pascale Fung
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Holy Lovenia | Samuel Cahyawijaya | Genta Winata | Peng Xu | Yan Xu | Zihan Liu | Rita Frieske | Tiezheng Yu | Wenliang Dai | Elham J. Barezi | Qifeng Chen | Xiaojuan Ma | Bertram Shi | Pascale Fung
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Code-switching is a speech phenomenon occurring when a speaker switches language during a conversation. Despite the spontaneous nature of code-switching in conversational spoken language, most existing works collect code-switching data from read speech instead of spontaneous speech. ASCEND (A Spontaneous Chinese-English Dataset) is a high-quality Mandarin Chinese-English code-switching corpus built on spontaneous multi-turn conversational dialogue sources collected in Hong Kong. We report ASCEND’s design and procedure for collecting the speech data, including annotations. ASCEND consists of 10.62 hours of clean speech, collected from 23 bilingual speakers of Chinese and English. Furthermore, we conduct baseline experiments using pre-trained wav2vec 2.0 models, achieving a best performance of 22.69% character error rate and 27.05% mixed error rate.
Search
Fix author
Co-authors
- Elham J. Barezi 3
- Samuel Cahyawijaya 3
- Wenliang Dai 3
- Rita Frieske 3
- Pascale Fung 3
- Holy Lovenia 3
- Xiaojuan Ma 3
- Bertram Shi 3
- Jiaqi Tang 3
- Peng Xu 3
- Tiezheng Yu 3
- Genta Indra Winata 2
- Cheuk Tung Yiu 2
- Qing-Guo Chen 1
- Bin Guo 1
- Xueting Han 1
- Qingfeng He 1
- Yuwei Hu 1
- Hao LU 1
- Runtao Liu 1
- Ziyi Liu 1
- Yunhao Liu 1
- Zihan Liu 1
- Shiyin Lu 1
- Yue Ma 1
- Ke Ma 1
- Yanqing Ma 1
- Renjie Pi 1
- Ziheng Wang 1
- Xu Wang 1
- Yi-Feng Wu 1
- Xiangyu Wu 1
- Yu Xia 1
- Ruonan Xu 1
- Xiaogang Xu 1
- Yan Xu 1
- Zhiwen Yu 1
- Chen Yuhui 1
- Jipeng Zhang 1