Hong-Yun H.Y. Lin
2026
MASA: A Novel Multimodal Foundation Model for L2 Speaking Assessment in Picture-description Scenarios
Bi-Cheng Yan | Fu-An Chao | Hong-Yun H.Y. Lin | Berlin Chen
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Bi-Cheng Yan | Fu-An Chao | Hong-Yun H.Y. Lin | Berlin Chen
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Automatic speaking assessment (ASA) manages to quantify the language competence of second language (L2) learners by providing a proficiency score based on their spoken responses. Existing efforts typically employ a neural grader coupled with a set of handcrafted features to gauge the competence of language in L2 learners from multiple facets. Despite their decent efficacy, these methods are limited by a laborious feature engineering process and largely overlook the utilization of scoring rubrics that are presented to human raters in speaking assessment. In light of this, we put forward a novel Multimodal foundation model for ASA, termed MASA, for use in picture-description scenarios. Our approach effectively streamlines the feature engineering process by leveraging the pre-trained encoders of a multimodal foundation model, and emulates the nuanced scoring behaviors of human raters by incorporating scoring rubrics directly into the modeling process. Furthermore, a simple, training-free method is introduced to alleviate the scoring bias in MASA by contrasting the output distributions derived from the multimodal and single-modal inputs. A series of experiments conducted on a picture-description task of the General English Proficiency Test (GEPT) dataset validates the feasibility and superiority of our method in comparison to several cutting-edge baselines.