Dexuan Xu


2026

Recent Large Audio Language Models (LALMs) have shown strong capabilities in audio understanding, yet their reasoning remains vulnerable to perceptual errors, especially in noisy and multi-speaker environments. We argue that reliable audio reasoning requires first grounding model’s perception in structured auditory scenes. Motivated by Auditory Scene Analysis, we introduce **PAQA**, a large-scale dataset for **Perception-Aware Question Answering** covering over 300 categories. PAQA adopts a hierarchical decoupling strategy that separates speech from environmental sounds and distinguishes among multiple speakers, providing explicit perceptual supervision for audio reasoning. Building on this, we propose **HyPeR**, a two-stage **Hybrid Perception-Reasoning** framework for perception-grounded audio understanding. In Stage I, the model is fine-tuned on PAQA for cold start to improve perception of acoustic attributes in complex auditory scenes. In Stage II, we further refine its internal reasoning via **Group Relative Policy Optimization (GRPO)**. To support deliberation under acoustic ambiguity, we introduce **PAUSE tokens** for latent computation and a **Perceptual Consistency Reward** to align reasoning rationales with the underlying audio evidence. Extensive ablation studies isolate the effects of the perception-attention mechanism, self-correction module, and pause-based reasoning strategy. Experiments on multiple benchmarks show that HyPeR consistently improves over the base model, including on MMAU-mini (+13.1%), MMAR (+25.5%), and PAQA (+28.2%), while achieving performance comparable to much larger models. Additional analyses of inference latency and computational overhead show that these gains come with acceptable efficiency trade-offs. Overall, our results demonstrate the effectiveness of hybrid perception-grounded reasoning for robust audio understanding.
Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in interpreting single medical images. However, real-world clinical diagnosis is intrinsically a multi-view process, requiring the synthesis of information across volumetric slices, temporal sequences, and comparative modalities. Existing benchmarks fail to capture this complexity, limiting the assessment of models in realistic clinical workflows. To bridge this gap, we introduce MedMultiBench, the first large-scale benchmark specifically designed for medical multi-image understanding. Comprising 11,392 expert-curated samples, MedMultiBench evaluates MLLMs across four distinct dimensions: Joint Reasoning, Comparative Analysis, Comprehensive Perception, and In-Context Learning. We benchmark 13 state-of-the-art MLLMs, revealing that while current models excel in single-view tasks, they struggle significantly with multi-image contexts. Our experiments identify a performance degradation in open-source models when processing increased visual loads, whereas closed-source models demonstrate better scalability. MedMultiBench provides a robust framework to facilitate the development of MLLMs capable of holistic clinical reasoning.

2025

Existing large language model (LLM) agents for automating data science show promise, but they remain constrained by narrow task scopes, limited generalization across tasks and models, and over-reliance on state-of-the-art (SOTA) LLMs. We introduce DatawiseAgent, a notebook-centric LLM agent framework for adaptive and robust data science automation. Inspired by how human data scientists work in computational notebooks, DatawiseAgent introduces a unified interaction representation and a multi-stage architecture based on finite-state transducers (FSTs). This design enables flexible long-horizon planning, progressive solution development, and robust recovery from execution failures. Extensive experiments across diverse data science scenarios and models show that DatawiseAgent consistently achieves SOTA performance by surpassing strong baselines such as AutoGen and TaskWeaver, demonstrating superior effectiveness and adaptability. Further evaluations reveal graceful performance degradation under weaker or smaller models, underscoring the robustness and scalability.

2024

Medical visual question answering (MVQA) requires in-depth understanding of medical images and questions to provide reliable answers. We summarize multi-level progressive capabilities that models need to focus on in MVQA: recognition, details, diagnosis, knowledge, and reasoning. Existing MVQA models tend to ignore the above capabilities due to unspecific data and plain architecture. To address these issues, this paper proposes Multi-level Visual Language Model (MLeVLM) for MVQA. On the data side, we construct a high-quality multi-level instruction dataset MLe-VQA via GPT-4, which covers multi-level questions and answers as well as reasoning processes from visual clues to semantic cognition. On the architecture side, we propose a multi-level feature alignment module, including attention-based token selector and context merger, which can efficiently align features at different levels from visual to semantic. To better evaluate the model’s capabilities, we manually construct a multi-level MVQA evaluation benchmark named MLe-Bench. Extensive experiments demonstrate the effectiveness of our constructed multi-level instruction dataset and the multi-level feature alignment module. It also proves that MLeVLM outperforms existing medical multimodal large language models.