Yan Fang


2026

Severe acoustic degradation is often caused by overlapping noise, disfluencies, and environmental distortions. This phenomenon results in the dissolution of linguistic structures and the generation of unreliable ASR outputs. Inspired by human speech comprehension, we propose Speech-MLM, a novel multimodal framework that reframes ASR as semantics-guided speech reconstruction. This perspective introduces three core challenges: (C1) collapse of linguistic structure under acoustic degradation, (C2) semantic ambiguity under noise, and (C3) misalignment across modalities. To address these issues, we propose Speech-MLM, a multimodal ASR framework that integrates speech, spectrogram-derived visual cues, and textual variants to enhance robustness. It consists of: (i) Cognitive Structure Extractor that recovers prosodic structure from visualized acoustic features, (ii) Semantic Weaver that learns semantic equivalence across varied textual forms, and (iii) Retrieval-Guided Fusion Learner that unifies modalities within a shared semantic space. Experiments on multiple real-world noisy datasets demonstrate that Speech-MLM achieves an average 38.85% reduction in WER, while also attaining 98.71% BERTScore and 96.7% USE, over advanced baselines, demonstrating substantial gains in semantic robustness and generalization across domains.

2020

We present ConvLab-2, an open-source toolkit that enables researchers to build task-oriented dialogue systems with state-of-the-art models, perform an end-to-end evaluation, and diagnose the weakness of systems. As the successor of ConvLab, ConvLab-2 inherits ConvLab’s framework but integrates more powerful dialogue models and supports more datasets. Besides, we have developed an analysis tool and an interactive tool to assist researchers in diagnosing dialogue systems. The analysis tool presents rich statistics and summarizes common mistakes from simulated dialogues, which facilitates error analysis and system improvement. The interactive tool provides an user interface that allows developers to diagnose an assembled dialogue system by interacting with the system and modifying the output of each system component.

2012