Yuan Ge
2026
Bypassing Neural Evaluations for Fast Audio Editing via Adaptive Trajectory Extrapolation
Xiaoqian Liu | Zhengkun Ge | Jianjin Wang | Haoran Zhang | Yuan Ge | Kaiyan Chang | Chen Xu | Tong Xiao | Zhengtao Yu | Linfeng Zhang | JingBo Zhu
Findings of the Association for Computational Linguistics: ACL 2026
Xiaoqian Liu | Zhengkun Ge | Jianjin Wang | Haoran Zhang | Yuan Ge | Kaiyan Chang | Chen Xu | Tong Xiao | Zhengtao Yu | Linfeng Zhang | JingBo Zhu
Findings of the Association for Computational Linguistics: ACL 2026
Recent advancements in audio diffusion models have significantly improved text-to-audio editing via inversion techniques. However, these models typically rely on dense, fixed-step sampling trajectories to maintain structural integrity during inversion and generation, leading to prohibitive computational costs. We propose AdaTE, a model-agnostic Adaptive Trajectory Extrapolation framework that accelerates the inversion-based editing process by dynamically evaluating only the most critical generative phases. Specifically, we introduce a hierarchical probing mechanism that monitors curvature acceleration and information gain to detect pivotal transitions within the latent flow. This allows the model to selectively skip redundant segments via linear extrapolation while preserving dense neural evaluations for complex semantic changes. Extensive experiments across AudioLDM2, Auffusion, and Tango2 demonstrate that AdaTE achieves up to a 3.9× speedup with negligible loss in fidelity. AdaTE significantly shifts the Pareto frontier, providing an efficient solution for high-fidelity audio synthesis and editing.
On the Emotion Understanding of Synthesized Speech
Yuan Ge | Haishu Zhao | AoKai Hao | Junxiang Zhang | Bei Li | Xiaoqian Liu | Chenglong Wang | Jianjin Wang | Bingsen Zhou | Bingyu Liu | JingBo Zhu | Zhengtao Yu | Tong Xiao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yuan Ge | Haishu Zhao | AoKai Hao | Junxiang Zhang | Bei Li | Xiaoqian Liu | Chenglong Wang | Jianjin Wang | Bingsen Zhou | Bingyu Liu | JingBo Zhu | Zhengtao Yu | Tong Xiao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Emotion is a core paralinguistic feature in voice interaction. It is widely believed that emotion understanding models learn fundamental representations that transfer to synthesized speech, making emotion understanding results a plausible reward or evaluation metric for assessing emotional expressiveness in speech synthesis. In this work, we critically examine this assumption by systematically evaluating Speech Emotion Recognition (SER) on synthesized speech across datasets, discriminative and generative SER models, and diverse synthesis models. We find that current SER models can not generalize to synthesized speech, largely because speech token prediction during synthesis induces a representation mismatch between synthesized and human speech. Moreover, generative Speech Language Models (SLMs) tend to infer emotion from textual semantics while ignoring paralinguistic cues. Overall, our findings suggest that existing SER models often exploit non-robust shortcuts rather than capturing fundamental features, and paralinguistic understanding in SLMs remains challenging.
2025
Step-level Verifier-guided Hybrid Test-Time Scaling for Large Language Models
Kaiyan Chang | Yonghao Shi | Chenglong Wang | Hang Zhou | Chi Hu | Xiaoqian Liu | Yingfeng Luo | Yuan Ge | Tong Xiao | JingBo Zhu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Kaiyan Chang | Yonghao Shi | Chenglong Wang | Hang Zhou | Chi Hu | Xiaoqian Liu | Yingfeng Luo | Yuan Ge | Tong Xiao | JingBo Zhu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Test-Time Scaling (TTS) is a promising approach to progressively elicit the model’s intelligence during inference. Recently, training-based TTS methods, such as continued reinforcement learning (RL), have further surged in popularity, while training-free TTS methods are gradually fading from prominence. However, the additional computation overhead of training amplifies the burden on test-time scaling.In this paper, we focus on training-free TTS methods for reasoning. We first design Conditional Step-level Self-refinement, a fine-grained sequential scaling method guided by process verification. On top of its effectiveness, we further combine it with other classical parallel scaling methods at the step level, to introduce a novel inference paradigm called Hybrid Test-Time Scaling. Extensive experiments on five instruction-tuned LLMs across different scales (3B-14B) and families demonstrate that hybrid strategy incorporating various training-free TTS methods at a fine granularity has considerable potential for expanding the reasoning performance boundaries of LLMs.
2024
RankPrompt: Step-by-Step Comparisons Make Language Models Better Reasoners
Chi Hu | Yuan Ge | Xiangnan Ma | Hang Cao | Qiang Li | Yonghua Yang | Tong Xiao | Jingbo Zhu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Chi Hu | Yuan Ge | Xiangnan Ma | Hang Cao | Qiang Li | Yonghua Yang | Tong Xiao | Jingbo Zhu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Large Language Models (LLMs) have achieved impressive performance across various reasoning tasks. However, even state-of-the-art LLMs such as ChatGPT are prone to logical errors during their reasoning processes. Existing solutions, such as deploying task-specific verifiers or voting over multiple reasoning paths, either require extensive human annotations or fail in scenarios with inconsistent responses. To address these challenges, we introduce RankPrompt, a new prompting method that enables LLMs to self-rank their responses without additional resources. RankPrompt breaks down the ranking problem into a series of comparisons among diverse responses, leveraging the inherent capabilities of LLMs to generate chains of comparison as contextual exemplars. Our experiments across 11 arithmetic and commonsense reasoning tasks show that RankPrompt significantly enhances the reasoning performance of ChatGPT and GPT-4, with improvements of up to 13%. Moreover, RankPrompt excels in LLM-based automatic evaluations for open-ended tasks, aligning with human judgments 74% of the time in the AlpacaEval dataset. It also exhibits robustness to variations in response order and consistency. Collectively, our results validate RankPrompt as an effective method for eliciting high-quality feedback from language models.
Clustering and Ranking: Diversity-preserved Instruction Selection through Expert-aligned Quality Estimation
Yuan Ge | Yilun Liu | Chi Hu | Weibin Meng | Shimin Tao | Xiaofeng Zhao | Mahong Xia | Zhang Li | Boxing Chen | Hao Yang | Bei Li | Tong Xiao | JingBo Zhu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Yuan Ge | Yilun Liu | Chi Hu | Weibin Meng | Shimin Tao | Xiaofeng Zhao | Mahong Xia | Zhang Li | Boxing Chen | Hao Yang | Bei Li | Tong Xiao | JingBo Zhu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
With contributions from the open-source community, a vast amount of instruction tuning (IT) data has emerged. Given the significant resource allocation required by training and evaluating models, it is advantageous to have an efficient method for selecting high-quality IT data. However, existing methods for instruction data selection have limitations such as relying on fragile external APIs, being affected by biases in GPT models, or reducing the diversity of the selected instruction dataset. In this paper, we propose an industrial-friendly, expert-aligned and diversity-preserved instruction data selection method: Clustering and Ranking (CaR). CaR consists of two steps. The first step involves ranking instruction pairs using a scoring model that is well aligned with expert preferences (achieving an accuracy of 84.25%). The second step involves preserving dataset diversity through a clustering process. In our experiment, CaR selected a subset containing only 1.96% of Alpaca’s IT data, yet the underlying AlpaCaR model trained on this subset outperforms Alpaca by an average of 32.1% in GPT-4 evaluations. Furthermore, our method utilizes small models (550M parameters) and requires only 11.2% of the monetary cost compared to existing methods, making it easily deployable in industrial scenarios.
Search
Fix author
Co-authors
- Tong Xiao (肖桐) 5
- JingBo Zhu (朱靖波) 5
- Chi Hu 3
- Xiaoqian Liu 3
- Bei Li 2
- Jianjin Wang 2
- Zhengtao Yu (余正涛) 2
- Hang Cao 1
- Kaiyan Chang 1
- Kaiyan Chang 1
- Boxing Chen 1
- Zhengkun Ge 1
- AoKai Hao 1
- Qiang Li 1
- Zhang Li 1
- Yilun Liu 1
- Bingyu Liu 1
- Yingfeng Luo 1
- Xiangnan Ma 1
- Weibin Meng 1
- Yonghao Shi 1
- Shimin Tao 1
- Chenglong Wang 1
- Chenglong Wang 1
- Mahong Xia 1
- Chen Xu 1
- Yonghua Yang 1
- Hao Yang 1
- Haoran Zhang 1
- Linfeng Zhang 1
- Junxiang Zhang 1
- Xiaofeng Zhao 1
- Haishu Zhao 1
- Hang Zhou 1
- Bingsen Zhou 1