Yu-Chiang Frank Wang


2025

pdf bib
Serial Lifelong Editing via Mixture of Knowledge Experts
YuJu Cheng | Yu-Chu Yu | Kai-Po Chang | Yu-Chiang Frank Wang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

It is challenging to update Large language models (LLMs) since real-world knowledge evolves. While existing Lifelong Knowledge Editing (LKE) methods efficiently update sequentially incoming edits, they often struggle to precisely overwrite the outdated knowledge with the latest one, resulting in conflicts that hinder LLMs from determining the correct answer. To address this Serial Lifelong Knowledge Editing (sLKE) problem, wepropose a novel Mixture-of-Knowledge-Experts scheme with an Activation-guided Routing Mechanism (ARM), which assigns specialized experts to store domain-specific knowledge and ensures that each update completely overwrites old information with the latest data. Furthermore, we introduce a novel sLKE benchmark where answers to the same concept are updated repeatedly, to assess the ability of editing methods to refresh knowledge accurately. Experimental results on both LKE and sLKE benchmarks show that our ARM performs favorably against SOTA knowledge editing methods.

pdf bib
NeKo: Cross-Modality Post-Recognition Error Correction with Tasks-Guided Mixture-of-Experts Language Model
Yen-Ting Lin | Zhehuai Chen | Piotr Zelasko | Zhen Wan | Xuesong Yang | Zih-Ching Chen | Krishna C Puvvada | Ke Hu | Szu-Wei Fu | Jun Wei Chiu | Jagadeesh Balam | Boris Ginsburg | Yu-Chiang Frank Wang | Chao-Han Huck Yang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)

Construction of a general-purpose post-recognition error corrector poses a crucial question: how can we most effectively train a model on a large mixture of domain datasets? The answer would lie in learning dataset-specific features and digesting their knowledge in a single model. Previous methods achieve this by having separate correction language models, resulting in a significant increase in parameters. In this work, we present Mixture-of-Experts as a solution, highlighting that MoEs are much more than a scalability tool. We propose a Multi-Task Correction MoE, where we train the experts to become an “expert” of speech-to-text, language-to-text and vision-to-text datasets by learning to route each dataset’s tokens to its mapped expert. Experiments on the Open ASR Leaderboard show that we explore a new state-of-the-art performance by achieving an average relative 5.0% WER reduction and substantial improvements in BLEU scores for speech and translation tasks. On zero-shot evaluation, NeKo outperforms GPT-3.5 and Claude-3.5-Sonnet with 15.5% to 27.6% relative WER reduction in the Hyporadise benchmark. NeKo performs competitively on grammar and post-OCR correction as a multi-task model.

pdf bib
LOTUS: A Leaderboard for Detailed Image Captioning from Quality to Societal Bias and User Preferences
Yusuke Hirota | Boyi Li | Ryo Hachiuma | Yueh-Hua Wu | Boris Ivanovic | Marco Pavone | Yejin Choi | Yu-Chiang Frank Wang | Yuta Nakashima | Chao-Han Huck Yang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)

Large Vision-Language Models (LVLMs) have transformed image captioning, shifting from concise captions to detailed descriptions. We introduce LOTUS, a leaderboard for evaluating detailed captions, addressing three main gaps in existing evaluations: lack of standardized criteria, bias-aware assessments, and user preference considerations. LOTUS comprehensively evaluates various aspects, including caption quality (e.g., alignment, descriptiveness), risks (e.g., hallucination), and societal biases (e.g., gender bias) while enabling preference-oriented evaluations by tailoring criteria to diverse user preferences. Our analysis of recent LVLMs reveals no single model excels across all criteria, while correlations emerge between caption detail and bias risks. Preference-oriented evaluations demonstrate that optimal model selection depends on user priorities.