Zicheng Su
2026
RTCFake: Speech Deepfake Detection in Real-Time Communication
Jun Xue | Zhuolin Yi | Yihuan Huang | Yanzhen Ren | Yujie Chen | Cunhang Fan | Zicheng Su | Yongcheng Zhang | Bo Cai
Findings of the Association for Computational Linguistics: ACL 2026
Jun Xue | Zhuolin Yi | Yihuan Huang | Yanzhen Ren | Yujie Chen | Cunhang Fan | Zicheng Su | Yongcheng Zhang | Bo Cai
Findings of the Association for Computational Linguistics: ACL 2026
With the rapid advancement of speech generation technologies, the threat posed by speech deepfakes in real-time communication (RTC) scenarios has intensified. However, existing detection studies mainly focus on offline simulations and struggle to cope with the complex distortions introduced during RTC transmission, including unknown speech enhancement processes (e.g., noise suppression) and codec compression. To address this challenge, we present the first large-scale speech deepfake dataset tailored for RTC scenarios, termed RTCFake, totaling approximately 600 hours. The dataset is constructed by transmitting speech through multiple mainstream social media and conferencing platforms (e.g., Zoom), enabling precise pairing between offline and online speech. In addition, we propose a phoneme-guided consistency learning (PCL) strategy that enforces models to learn platform-invariant semantic structural representations. In this paper, the RTCFake dataset is divided into training, development, and evaluation sets. The evaluation set further includes both unseen RTC platforms and unseen complex noise conditions, thereby providing a more realistic and challenging evaluation benchmark for speech deepfake detection. Furthermore, the proposed PCL strategy achieves significant improvements in both cross-platform generalization and noise robustness, offering an effective and generalizable modeling paradigm.
2025
FlightGPT: Towards Generalizable and Interpretable UAV Vision-and-Language Navigation with Vision-Language Models
Hengxing Cai | Jinhan Dong | Jingjun Tan | Jingcheng Deng | Sihang Li | Zhifeng Gao | Haidong Wang | Zicheng Su | Agachai Sumalee | Renxin Zhong
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Hengxing Cai | Jinhan Dong | Jingjun Tan | Jingcheng Deng | Sihang Li | Zhifeng Gao | Haidong Wang | Zicheng Su | Agachai Sumalee | Renxin Zhong
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Unmanned Aerial Vehicle (UAV) Vision-and-Language Navigation (VLN) is vital for applications such as disaster response, logistics delivery, and urban inspection. However, existing methods often struggle with insufficient multimodal fusion, weak generalization, and poor interpretability. To address these challenges, we propose FlightGPT, a novel UAV VLN framework built upon Vision-Language Models (VLMs) with powerful multimodal perception capabilities. We design a two-stage training pipeline: first, Supervised Fine-Tuning (SFT) using high-quality demonstrations to improve initialization and structured reasoning; then, Group Relative Policy Optimization (GRPO) algorithm, guided by a composite reward that considers goal accuracy, reasoning quality, and format compliance, to enhance generalization and adaptability. Furthermore, FlightGPT introduces a Chain-of-Thought (CoT)-based reasoning mechanism to improve decision interpretability. Extensive experiments on the city-scale dataset CityNav demonstrate that FlightGPT achieves state-of-the-art performance across all scenarios, with a 9.22% higher success rate than the strongest baseline in unseen environments. Our implementation is publicly available.