Yuxuan Jiang

Other people with similar names: Yuxuan Jiang, Yuxuan Jiang

Unverified author pages with similar names: Yuxuan Jiang


2026

Multimodal Sentiment Analysis (MSA) aims to infer human sentiment from textual, acoustic, and visual signals. In real-world scenarios, however, multimodal inputs are often compromised by dynamic noise or modality missingness. Existing methods typically treat these imperfections as discrete cases or assume fixed corruption ratios, which limits their adaptability to continuously varying reliability conditions. To address this, we first introduce a Continuous Reliability Spectrum to unify missingness and quality degradation into a single framework. Building on this, we propose QA-MoE, a Quality-Aware Mixture-of-Experts framework that quantifies modality reliability via self-supervised aleatoric uncertainty. This mechanism explicitly guides expert routing, enabling the model to suppress error propagation from unreliable signals while preserving task-relevant information. Extensive experiments indicate that QA-MoE achieves competitive or state-of-the-art performance across diverse degradation scenarios and exhibits a promising One-Checkpoint-for-All property in practice.
Recent efforts on text-to-audio (TTA) generation are starting to explore fine-grained controllability, e.g., precise timing control, with innovations on conditioning techniques or training-free latent manipulations. However, constrained by data scarcity, their generation performance at scale is still limited. In this study, we recast high-controllability TTA generation as a multi-task learning problem, and introduce a progressive diffusion modeling approach, ControlAudio. Our method adeptly fits distributions conditioned on fine-grained information, including text, timing, and phoneme features, through a step-by-step strategy. First, we propose a data construction method spanning both annotation and simulation, augmenting condition information in the sequence of text, timing, and phoneme. Second, at the model training stage, we pretrain a scalable diffusion transformer (DiT) on large-scale text-audio pairs, achieving high-fidelity TTA generation, and then incrementally integrate the timing and phoneme features, expanding controllability. Finally, at the inference stage, we propose progressively guided generation, which sequentially emphasizes more fine-grained information, aligning inherently with the coarse-to-fine sampling nature of DiT. Extensive experiments show that ControlAudio achieves state-of-the-art performance in terms of temporal accuracy and speech clarity, significantly outperforming existing methods on both objective and subjective evaluations. Demo samples are available at: https://control-audio.github.io/Control-Audio.