Jieyi Wang


2026

Recent Large Audio Language Models (LALMs) have shown strong capabilities in audio understanding, yet their reasoning remains vulnerable to perceptual errors, especially in noisy and multi-speaker environments. We argue that reliable audio reasoning requires first grounding model’s perception in structured auditory scenes. Motivated by Auditory Scene Analysis, we introduce **PAQA**, a large-scale dataset for **Perception-Aware Question Answering** covering over 300 categories. PAQA adopts a hierarchical decoupling strategy that separates speech from environmental sounds and distinguishes among multiple speakers, providing explicit perceptual supervision for audio reasoning. Building on this, we propose **HyPeR**, a two-stage **Hybrid Perception-Reasoning** framework for perception-grounded audio understanding. In Stage I, the model is fine-tuned on PAQA for cold start to improve perception of acoustic attributes in complex auditory scenes. In Stage II, we further refine its internal reasoning via **Group Relative Policy Optimization (GRPO)**. To support deliberation under acoustic ambiguity, we introduce **PAUSE tokens** for latent computation and a **Perceptual Consistency Reward** to align reasoning rationales with the underlying audio evidence. Extensive ablation studies isolate the effects of the perception-attention mechanism, self-correction module, and pause-based reasoning strategy. Experiments on multiple benchmarks show that HyPeR consistently improves over the base model, including on MMAU-mini (+13.1%), MMAR (+25.5%), and PAQA (+28.2%), while achieving performance comparable to much larger models. Additional analyses of inference latency and computational overhead show that these gains come with acceptable efficiency trade-offs. Overall, our results demonstrate the effectiveness of hybrid perception-grounded reasoning for robust audio understanding.
Medical consultations are intrinsically speech-centric. However, most prior works focus on long-text-based interactions, which are cumbersome and patient-unfriendly. Recent advances in speech language models (SpeechLMs) have enabled more natural speech-based interaction, yet the scarcity of medical speech data and the inefficiency of directly fine-tuning on speech data jointly hinder the adoption of SpeechLMs in medical consultation. In this paper, we propose SpeechMedAssist, a SpeechLM natively capable of conducting speech-based multi-turn interactions with patients. By exploiting the architectural properties of SpeechLMs, we decouple the conventional one-stage training into a two-stage paradigm consisting of **(1) Knowledge Capability Injection via Text** and **(2) Modality Re-alignment with Limited Speech Data**, thereby reducing the requirement for medical speech data to only **10k** synthesized samples. To evaluate SpeechLMs for medical consultation scenarios, we design a benchmark comprising both single-turn question answering and multi-turn simulated interactions. Experimental results show that our model outperforms all baselines in both effectiveness and robustness in most evaluation settings.

2024

Medical visual question answering (MVQA) requires in-depth understanding of medical images and questions to provide reliable answers. We summarize multi-level progressive capabilities that models need to focus on in MVQA: recognition, details, diagnosis, knowledge, and reasoning. Existing MVQA models tend to ignore the above capabilities due to unspecific data and plain architecture. To address these issues, this paper proposes Multi-level Visual Language Model (MLeVLM) for MVQA. On the data side, we construct a high-quality multi-level instruction dataset MLe-VQA via GPT-4, which covers multi-level questions and answers as well as reasoning processes from visual clues to semantic cognition. On the architecture side, we propose a multi-level feature alignment module, including attention-based token selector and context merger, which can efficiently align features at different levels from visual to semantic. To better evaluate the model’s capabilities, we manually construct a multi-level MVQA evaluation benchmark named MLe-Bench. Extensive experiments demonstrate the effectiveness of our constructed multi-level instruction dataset and the multi-level feature alignment module. It also proves that MLeVLM outperforms existing medical multimodal large language models.