Chris Ngo
2026
Polyglot-Lion: Efficient Multilingual ASR for Singapore via Balanced Fine-Tuning of Qwen3-ASR
Quy-Anh Dang | Chris Ngo
Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM 2026)
Quy-Anh Dang | Chris Ngo
Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM 2026)
We present Polyglot-Lion, a family of compact multilingual automatic speech recognition (ASR) models tailored for the linguistic landscape of Singapore, covering English, Mandarin, Tamil, and Malay. Our models are obtained by fine-tuning Qwen3-ASR-0.6B and Qwen3-ASR-1.7B exclusively on publicly available speech corpora, using a balanced sampling strategy that equalizes the number of training utterances per language and deliberately omits language-tag conditioning so that the model learns to identify languages implicitly from audio. On 12 benchmarks spanning the four target languages, Polyglot-Lion-1.7B achieves an average error rate of 14.85, competitive with MERaLiON-2-10B-ASR (14.32) - a model 6x larger - while incurring a training cost of 81 on a single RTX PRO 6000 GPU. Inference throughput is approximately 20x faster than MERaLiON at 0.10 s/sample versus 2.02 s/sample. These results demonstrate that linguistically balanced fine-tuning of moderate-scale pretrained models can yield deployment-ready multilingual ASR at a fraction of the cost of larger specialist systems.
Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection
Quy-Anh Dang | Chris Ngo
Findings of the Association for Computational Linguistics: ACL 2026
Quy-Anh Dang | Chris Ngo
Findings of the Association for Computational Linguistics: ACL 2026
Despite significant progress in alignment, large language models (LLMs) remain vulnerable to adversarial attacks that elicit harmful behaviors. Activation steering techniques offer a promising inference-time intervention approach, but existing methods suffer from critical limitations: activation addition requires careful coefficient tuning and is sensitive to layer-specific norm variations, while directional ablation provides only binary control. Recent work on Angular Steering introduces continuous control via rotation in a 2D subspace, but its practical implementation violates norm preservation, causing distribution shift and generation collapse, particularly in models below 7B parameters. We propose Selective Steering, which addresses these limitations through two key innovations: (1) a mathematically rigorous norm-preserving rotation formulation that maintains activation distribution integrity, and (2) discriminative layer selection that applies steering only where feature representations exhibit opposite-signed class alignment. Experiments across nine models demonstrate that Selective Steering achieves 5.5 higher attack success rates than prior methods while maintaining zero perplexity violations and approximately 100% capability retention on standard benchmarks. Our approach provides a principled, efficient framework for controllable and stable LLM behavior modification.
2025
MultiMed-ST: Large-scale Many-to-many Multilingual Medical Speech Translation
Khai Le-Duc | Tuyen Tran | Bach Phan Tat | Nguyen Kim Hai Bui | Quan Dang Anh | Hung-Phong Tran | Thanh Thuy Nguyen | Ly Nguyen | Tuan Minh Phan | Thi Thu Phuong Tran | Chris Ngo | Khanh Xuan Nguyen | Thanh Nguyen-Tang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Khai Le-Duc | Tuyen Tran | Bach Phan Tat | Nguyen Kim Hai Bui | Quan Dang Anh | Hung-Phong Tran | Thanh Thuy Nguyen | Ly Nguyen | Tuan Minh Phan | Thi Thu Phuong Tran | Chris Ngo | Khanh Xuan Nguyen | Thanh Nguyen-Tang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Multilingual speech translation (ST) and machine translation (MT) in the medical domain enhances patient care by enabling efficient communication across language barriers, alleviating specialized workforce shortages, and facilitating improved diagnosis and treatment, particularly during pandemics. In this work, we present the first systematic study on medical ST, to our best knowledge, by releasing MultiMedST, a large-scale ST dataset for the medical domain, spanning all translation directions in five languages: Vietnamese, English, German, French, and Simplified/Traditional Chinese, together with the models. With 290,000 samples, this is the largest medical MT dataset and the largest many-to-many multilingual ST among all domains. Secondly, we present the most comprehensive ST analysis in the field’s history, to our best knowledge, including: empirical baselines, bilingual-multilingual comparative study, end-to-end vs. cascaded comparative study, task-specific vs. multi-task sequence-to-sequence comparative study, code-switch analysis, and quantitative-qualitative error analysis. All code, data, and models are available online: https://github.com/leduckhai/MultiMed-ST.
SilVar: Speech-Driven Multimodal Model for Reasoning Visual Question Answering and Object Localization
Tan-Hanh Pham | Le Hoang Nam | Phu-Vinh Nguyen | Chris Ngo | Truong-Son Hy
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Tan-Hanh Pham | Le Hoang Nam | Phu-Vinh Nguyen | Chris Ngo | Truong-Son Hy
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Visual Language Models have demonstrated remarkable capabilities across various tasks, including visual question answering and image captioning. However, most models rely on text-based instructions, limiting their effectiveness in natural human-machine interactions. Moreover, the quality of language models primarily depends on reasoning and prompting techniques, such as chain-of-thought, which remain underexplored when using speech instructions. To address these challenges, we propose SilVar, an end-to-end multimodal model that leverages speech instructions for reasoning-based visual question answering. Additionally, we investigate reasoning techniques at different levels, including conversational, simple, and complex speech instructions. SilVar is built upon CLIP, Whisper, and LLaMA 3.1-8B, enabling more intuitive interactions by allowing users to provide verbal or text-based instructions. To this end, we introduce a new dataset designed to challenge models with speech-based reasoning tasks for object localization. This dataset enhances the model’s ability to process and explain visual scenes from spoken input, moving beyond simple object recognition to reasoning-based interactions. To our knowledge, SilVar is the first open-source, speech-driven VLM. We believe SilVar will inspire the next generation of multimodal reasoning models, advancing toward expert artificial general intelligence.