Jieyi Wang

2026

Listen, Pause, and Reason: Toward Perception-Grounded Hybrid Reasoning for Audio Understanding
Jieyi Wang | Yazhe Niu | Dexuan Xu | Zhongyu Wei
Findings of the Association for Computational Linguistics: ACL 2026

Recent Large Audio Language Models (LALMs) have shown strong capabilities in audio understanding, yet their reasoning remains vulnerable to perceptual errors, especially in noisy and multi-speaker environments. We argue that reliable audio reasoning requires first grounding model’s perception in structured auditory scenes. Motivated by Auditory Scene Analysis, we introduce **PAQA**, a large-scale dataset for **Perception-Aware Question Answering** covering over 300 categories. PAQA adopts a hierarchical decoupling strategy that separates speech from environmental sounds and distinguishes among multiple speakers, providing explicit perceptual supervision for audio reasoning. Building on this, we propose **HyPeR**, a two-stage **Hybrid Perception-Reasoning** framework for perception-grounded audio understanding. In Stage I, the model is fine-tuned on PAQA for cold start to improve perception of acoustic attributes in complex auditory scenes. In Stage II, we further refine its internal reasoning via **Group Relative Policy Optimization (GRPO)**. To support deliberation under acoustic ambiguity, we introduce **PAUSE tokens** for latent computation and a **Perceptual Consistency Reward** to align reasoning rationales with the underlying audio evidence. Extensive ablation studies isolate the effects of the perception-attention mechanism, self-correction module, and pause-based reasoning strategy. Experiments on multiple benchmarks show that HyPeR consistently improves over the base model, including on MMAU-mini (+13.1%), MMAR (+25.5%), and PAQA (+28.2%), while achieving performance comparable to much larger models. Additional analyses of inference latency and computational overhead show that these gains come with acceptable efficiency trade-offs. Overall, our results demonstrate the effectiveness of hybrid perception-grounded reasoning for robust audio understanding.

pdf bib abs

SpeechMedAssist: Efficiently and Effectively Adapting Speech Language Models for Medical Consultation
Sirry Chen | Jieyi Wang | Wei Chen | Zhongyu Wei
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Medical consultations are intrinsically speech-centric. However, most prior works focus on long-text-based interactions, which are cumbersome and patient-unfriendly. Recent advances in speech language models (SpeechLMs) have enabled more natural speech-based interaction, yet the scarcity of medical speech data and the inefficiency of directly fine-tuning on speech data jointly hinder the adoption of SpeechLMs in medical consultation. In this paper, we propose SpeechMedAssist, a SpeechLM natively capable of conducting speech-based multi-turn interactions with patients. By exploiting the architectural properties of SpeechLMs, we decouple the conventional one-stage training into a two-stage paradigm consisting of **(1) Knowledge Capability Injection via Text** and **(2) Modality Re-alignment with Limited Speech Data**, thereby reducing the requirement for medical speech data to only **10k** synthesized samples. To evaluate SpeechLMs for medical consultation scenarios, we design a benchmark comprising both single-turn question answering and multi-turn simulated interactions. Experimental results show that our model outperforms all baselines in both effectiveness and robustness in most evaluation settings.

2024

pdf bib abs

Medical visual question answering (MVQA) requires in-depth understanding of medical images and questions to provide reliable answers. We summarize multi-level progressive capabilities that models need to focus on in MVQA: recognition, details, diagnosis, knowledge, and reasoning. Existing MVQA models tend to ignore the above capabilities due to unspecific data and plain architecture. To address these issues, this paper proposes Multi-level Visual Language Model (MLeVLM) for MVQA. On the data side, we construct a high-quality multi-level instruction dataset MLe-VQA via GPT-4, which covers multi-level questions and answers as well as reasoning processes from visual clues to semantic cognition. On the architecture side, we propose a multi-level feature alignment module, including attention-based token selector and context merger, which can efficiently align features at different levels from visual to semantic. To better evaluate the model’s capabilities, we manually construct a multi-level MVQA evaluation benchmark named MLe-Bench. Extensive experiments demonstrate the effectiveness of our constructed multi-level instruction dataset and the multi-level feature alignment module. It also proves that MLeVLM outperforms existing medical multimodal large language models.

Co-authors

Jing He 1

Yue Huang 1

Yu Huang 1

Zhi Jin 1

Hang Li 1

Venues

Findings2
ACL1

Fix author