Yan Lu


2025

pdf bib
SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training
Wenxi Chen | Ziyang Ma | Ruiqi Yan | Yuzhe Liang | Xiquan Li | Ruiyang Xu | Zhikang Niu | Yanqiao Zhu | Yifan Yang | Zhanxun Liu | Kai Yu | Yuxuan Hu | Jinyu Li | Yan Lu | Shujie Liu | Xie Chen
Findings of the Association for Computational Linguistics: ACL 2025

Recent advancements highlight the potential of end-to-end real-time spoken dialogue systems, showcasing their low latency and high quality. In this paper, we introduce SLAM-Omni, a timbre-controllable, end-to-end voice interaction system with single-stage training. SLAM-Omni achieves zero-shot timbre control by modeling spoken language with semantic tokens and decoupling speaker information to a vocoder. By predicting grouped speech semantic tokens at each step, our method significantly reduces the sequence length of audio tokens, accelerating both training and inference. Additionally, we propose historical text prompting to compress dialogue history, facilitating efficient multi-round interactions. Comprehensive evaluations reveal that SLAM-Omni outperforms prior models of similar scale, requiring only 15 hours of training on 4 GPUs with limited data. Notably, it is the first spoken dialogue system to achieve competitive performance with a single-stage training approach, eliminating the need for pre-training on TTS or ASR tasks. Further experiments validate its multilingual and multi-turn dialogue capabilities on larger datasets.

pdf bib
UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis
Xinyi Liu | Xiaoyi Zhang | Ziyun Zhang | Yan Lu
Findings of the Association for Computational Linguistics: ACL 2025

Recent advancements in Large Vision-Language Models are accelerating the development of Graphical User Interface (GUI) agents that utilize human-like vision perception capabilities to enhance productivity on digital devices. Compared to approaches predicated on GUI metadata, which are platform-dependent and vulnerable to implementation variations, vision-based approaches offer broader applicability.In this vision-based paradigm, the GUI instruction grounding, which maps user instruction to the location of corresponding element on the given screenshot, remains a critical challenge, particularly due to limited public training dataset and resource-intensive manual instruction data annotation.In this paper, we delve into unexplored challenges in this task including element-to-screen ratio, unbalanced element type, and implicit instruction. To address these challenges, we introduce a large-scale data synthesis pipeline UI-E2I-Synth for generating varying complex instruction datasets using GPT-4o instead of human annotators. Furthermore, we propose a new GUI instruction grounding benchmark UI-I2E-Bench, which is designed to address the limitations of existing benchmarks by incorporating diverse annotation aspects.Our model, trained on the synthesized data, achieves superior performance in GUI instruction grounding, demonstrating the advancements of proposed data synthesis pipeline.The proposed benchmark, accompanied by extensive analyses, provides practical insights for future research in this domain. We will release our dataset and benchmark to facilitate further development of GUI instruction grounding community.

2012

pdf bib
Entends-tu mes attitudes ? Perception de la prosodie des affects sociaux en chinois Mandarin (Do you hear my attitudes? Perception of Mandarin Chinese social affects’ prosody) [in French]
Yan Lu | Véronique Aubergé | Albert Rilliard
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 1: JEP