Bach Phan Tat

2025

pdf bib abs
MultiMed: Multilingual Medical Speech Recognition via Attention Encoder Decoder
Khai Le-Duc | Phuc Phan | Tan-Hanh Pham | Bach Phan Tat | Minh-Huong Ngo | Thanh Nguyen-Tang | Truong-Son Hy
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)

Multilingual automatic speech recognition (ASR) in the medical domain serves as a foundational task for various downstream applications such as speech translation, spoken language understanding, and voice-activated assistants. This technology improves patient care by enabling efficient communication across language barriers, alleviating specialized workforce shortages, and facilitating improved diagnosis and treatment, particularly during pandemics. In this work, we introduce MultiMed, the first multilingual medical ASR dataset, along with the first collection of small-to-large end-to-end medical ASR models, spanning five languages: Vietnamese, English, German, French, and Mandarin Chinese. To our best knowledge, MultiMed stands as the world’s largest medical ASR dataset across all major benchmarks: total duration, number of recording conditions, number of accents, and number of speaking roles. Furthermore, we present the first multilinguality study for medical ASR, which includes reproducible empirical baselines, a monolinguality-multilinguality analysis, Attention Encoder Decoder (AED) vs Hybrid comparative study and a linguistic analysis. We present practical ASR end-to-end training schemes optimized for a fixed number of trainable parameters that are common in industry settings. All code, data, and models are available online.

Transparency in AI healthcare decision-makingis crucial. By incorporating rationales to explain reason for each predicted label, userscould understand Large Language Models(LLMs)’s reasoning to make better decision.In this work, we introduce a new task - Sentiment Reasoning - for both speech and textmodalities, and our proposed multimodal multitask framework and the world’s largest multimodal sentiment analysis dataset. Sentiment Reasoning is an auxiliary task in sentiment analysis where the model predicts boththe sentiment label and generates the rationale behind it based on the input transcript.Our study conducted on both human transcriptsand Automatic Speech Recognition (ASR) transcripts shows that Sentiment Reasoning helpsimprove model transparency by providing rationale for model prediction with quality semantically comparable to humans while alsoimproving model’s classification performance(+2% increase in both accuracy and macro-F1) via rationale-augmented fine-tuning. Also,no significant difference in the semantic quality of generated rationales between human andASR transcripts. All code, data (five languages - Vietnamese, English, Chinese, German, andFrench) and models are published online.

Co-authors

Venues

acl2

Fix author