Closing the Modality Reasoning Gap for Speech Large Language Models

Chaoren Wang; Heng Lu; Xueyao Zhang; Shujie Liu; Yan Lu; Jinyu Li; Zhizheng Wu

Closing the Modality Reasoning Gap for Speech Large Language Models

Chaoren Wang, Heng Lu, Xueyao Zhang, Shujie Liu, Yan Lu, Jinyu Li, Zhizheng Wu

Abstract

Although Speech Large Language Models have achieved notable progress, a substantial modality reasoning gap remains: their reasoning performance on speech inputs is markedly weaker than on text. This gap could be associated with representational drift across Transformer layers and behavior deviations in long-chain reasoning. To address this issue, we introduce TARS, a reinforcement-learning framework that aligns text-conditioned and speech-conditioned trajectories through an asymmetric reward design. The framework employs two dense and complementary signals: representation alignment, which measures layer-wise hidden-state similarity between speech- and text-conditioned trajectories, and behavior alignment, which evaluates semantic consistency between generated outputs and reference text completions. Experiments on challenging reasoning benchmarks, including MMSU and OBQA, show that our approach significantly narrows the modality reasoning gap and achieves state-of-the-art performance among 7B-scale Speech LLMs.

Anthology ID:: 2026.acl-long.857
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 18821–18835
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.857/
DOI:
Bibkey:
Cite (ACL):: Chaoren Wang, Heng Lu, Xueyao Zhang, Shujie Liu, Yan Lu, Jinyu Li, and Zhizheng Wu. 2026. Closing the Modality Reasoning Gap for Speech Large Language Models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 18821–18835, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Closing the Modality Reasoning Gap for Speech Large Language Models (Wang et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.857.pdf
Checklist:: 2026.acl-long.857.checklist.pdf

PDF Cite Search Checklist Fix data