Xianlong Luo
2026
MoE Adapter for Large Audio Language Models: Sparsity, Disentanglement, and Gradient-Conflict-Free
Yishu Lei | Shuwei He | Hu Jing | Dan Zhang | Xianlong Luo | Danxiang Zhu | Shikun Feng | Rui Liu | Jingzhou HE | Yu Sun | Hua Wu | Haifeng Wang
Findings of the Association for Computational Linguistics: ACL 2026
Yishu Lei | Shuwei He | Hu Jing | Dan Zhang | Xianlong Luo | Danxiang Zhu | Shikun Feng | Rui Liu | Jingzhou HE | Yu Sun | Hua Wu | Haifeng Wang
Findings of the Association for Computational Linguistics: ACL 2026
Extending the input modality of Large Language Models (LLMs) to the audio domain is essential for achieving comprehensive multimodal perception. However, it is well-known that acoustic information is intrinsically heterogeneous, entangling attributes such as speech, music, and environmental context. Existing research is limited to a dense, parameter-shared adapter to model these diverse patterns, which induces gradient conflict during optimization, as parameter updates required for distinct attributes contradict each other. To address this limitation, we introduce the MoE-Adapter, a sparse Mixture-of-Experts (MoE) architecture designed to decouple acoustic information. Specifically, it employs a dynamic gating mechanism that routes audio tokens to specialized experts capturing complementary feature subspaces while retaining shared experts for global context, thereby mitigating gradient conflicts and enabling fine-grained feature learning. Comprehensive experiments show that the MoE-Adapter achieves superior performance on both audio semantic and paralinguistic tasks, consistently outperforming dense linear baselines with comparable computational costs. To facilitate future research, our code are publicly available at https://github.com/Alittleegg/Eureka-Audio.
CORD: Bridging the Audio–Text Reasoning Gap via Weighted On-policy Cross-modal Distillation
Hu Jing | Danxiang Zhu | Xianlong Luo | Dan Zhang | Shuwei He | Yishu Lei | Shikun Feng | Hai-Tao Zheng | Jingzhou HE | Yu Sun | Hua Wu | Haifeng Wang
Findings of the Association for Computational Linguistics: ACL 2026
Hu Jing | Danxiang Zhu | Xianlong Luo | Dan Zhang | Shuwei He | Yishu Lei | Shikun Feng | Hai-Tao Zheng | Jingzhou HE | Yu Sun | Hua Wu | Haifeng Wang
Findings of the Association for Computational Linguistics: ACL 2026
Large Audio Language Models (LALMs) have garnered significant research interest. Despite being built upon text-based large language models (LLMs), LALMs frequently exhibit a degradation in knowledge and reasoning capabilities. We hypothesize that this limitation stems from the failure of current training paradigms to effectively bridge the acoustic-semantic gap within the feature representation space. To address this challenge, we propose CORD, a unified alignment framework that performs online cross-modal self-distillation. Specifically, it aligns audio-conditioned reasoning with its text-conditioned counterpart within a unified model. Leveraging the text modality as an internal teacher, CORD performs multi-granularity alignment throughout the audio rollout process. At the token level, it employs on-policy reverse KL divergence with importance-aware weighting to prioritize early and semantically critical tokens. At the sequence level, CORD introduces a judge-based global reward to optimize complete reasoning trajectories via Group Relative Policy Optimization (GRPO). Empirical results across multiple benchmarks demonstrate that CORD consistently enhances audio-conditioned reasoning and substantially bridges the audio–text performance gap with only 80k synthetic training samples, validating the efficacy and data efficiency of our on-policy, multi-level cross-modal alignment approach.
2024
Overcome Noise and Bias: Segmentation-Aided Multi-Granularity Denoising and Debiasing for Enhanced Quarduples Extraction in Dialogue
Xianlong Luo | Meng Yang | Yihao Wang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Xianlong Luo | Meng Yang | Yihao Wang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Dialogue Aspect-based Sentiment Quadruple analysis (DiaASQ) extends ABSA to more complex real-world scenarios (i.e., dialogues), which makes existing generation methods encounter heightened noise and order bias challenges, leading to decreased robustness and accuracy.To address these, we propose the Segmentation-Aided multi-grained Denoising and Debiasing (SADD) method. For noise, we propose the Multi-Granularity Denoising Generation model (MGDG), achieving word-level denoising via sequence labeling and utterance-level denoising via topic-aware dialogue segmentation. Denoised Attention in MGDG integrates multi-grained denoising information to help generate denoised output.For order bias, we first theoretically analyze its direct cause as the gap between ideal and actual training objectives and propose a distribution-based solution. Since this solution introduces a one-to-many learning challenge, our proposed Segmentation-aided Order Bias Mitigation (SOBM) method utilizes dialogue segmentation to supplement order diversity, concurrently mitigating this challenge and order bias.Experiments demonstrate SADD’s effectiveness, achieving state-of-the-art results with a 6.52% F1 improvement.