Listen, Pause, and Reason: Toward Perception-Grounded Hybrid Reasoning for Audio Understanding

Jieyi Wang, Yazhe Niu, Dexuan Xu, Zhongyu Wei


Abstract
Recent Large Audio Language Models (LALMs) have shown strong capabilities in audio understanding, yet their reasoning remains vulnerable to perceptual errors, especially in noisy and multi-speaker environments. We argue that reliable audio reasoning requires first grounding model’s perception in structured auditory scenes. Motivated by Auditory Scene Analysis, we introduce **PAQA**, a large-scale dataset for **Perception-Aware Question Answering** covering over 300 categories. PAQA adopts a hierarchical decoupling strategy that separates speech from environmental sounds and distinguishes among multiple speakers, providing explicit perceptual supervision for audio reasoning. Building on this, we propose **HyPeR**, a two-stage **Hybrid Perception-Reasoning** framework for perception-grounded audio understanding. In Stage I, the model is fine-tuned on PAQA for cold start to improve perception of acoustic attributes in complex auditory scenes. In Stage II, we further refine its internal reasoning via **Group Relative Policy Optimization (GRPO)**. To support deliberation under acoustic ambiguity, we introduce **PAUSE tokens** for latent computation and a **Perceptual Consistency Reward** to align reasoning rationales with the underlying audio evidence. Extensive ablation studies isolate the effects of the perception-attention mechanism, self-correction module, and pause-based reasoning strategy. Experiments on multiple benchmarks show that HyPeR consistently improves over the base model, including on MMAU-mini (+13.1%), MMAR (+25.5%), and PAQA (+28.2%), while achieving performance comparable to much larger models. Additional analyses of inference latency and computational overhead show that these gains come with acceptable efficiency trade-offs. Overall, our results demonstrate the effectiveness of hybrid perception-grounded reasoning for robust audio understanding.
Anthology ID:
2026.findings-acl.1776
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
35653–35671
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1776/
DOI:
Bibkey:
Cite (ACL):
Jieyi Wang, Yazhe Niu, Dexuan Xu, and Zhongyu Wei. 2026. Listen, Pause, and Reason: Toward Perception-Grounded Hybrid Reasoning for Audio Understanding. In Findings of the Association for Computational Linguistics: ACL 2026, pages 35653–35671, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Listen, Pause, and Reason: Toward Perception-Grounded Hybrid Reasoning for Audio Understanding (Wang et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1776.pdf
Checklist:
 2026.findings-acl.1776.checklist.pdf