MASA: A Novel Multimodal Foundation Model for L2 Speaking Assessment in Picture-description Scenarios

Bi-Cheng Yan; Fu-An Chao; Hong-Yun H.Y. Lin; Berlin Chen

MASA: A Novel Multimodal Foundation Model for L2 Speaking Assessment in Picture-description Scenarios

Bi-Cheng Yan, Fu-An Chao, Hong-Yun H.Y. Lin, Berlin Chen

Abstract

Automatic speaking assessment (ASA) manages to quantify the language competence of second language (L2) learners by providing a proficiency score based on their spoken responses. Existing efforts typically employ a neural grader coupled with a set of handcrafted features to gauge the competence of language in L2 learners from multiple facets. Despite their decent efficacy, these methods are limited by a laborious feature engineering process and largely overlook the utilization of scoring rubrics that are presented to human raters in speaking assessment. In light of this, we put forward a novel Multimodal foundation model for ASA, termed MASA, for use in picture-description scenarios. Our approach effectively streamlines the feature engineering process by leveraging the pre-trained encoders of a multimodal foundation model, and emulates the nuanced scoring behaviors of human raters by incorporating scoring rubrics directly into the modeling process. Furthermore, a simple, training-free method is introduced to alleviate the scoring bias in MASA by contrasting the output distributions derived from the multimodal and single-modal inputs. A series of experiments conducted on a picture-description task of the General English Proficiency Test (GEPT) dataset validates the feasibility and superiority of our method in comparison to several cutting-edge baselines.

Anthology ID:: 2026.lrec-main.433
Volume:: Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:: May
Year:: 2026
Address:: Palma de Mallorca, Spain
Editors:: Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:: LREC
SIG:
Publisher:: ELRA Language Resource Association
Note:
Pages:: 5545–5554
Language:
URL:: https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.433/
DOI:
Bibkey:
Cite (ACL):: Bi-Cheng Yan, Fu-An Chao, Hong-Yun H.Y. Lin, and Berlin Chen. 2026. MASA: A Novel Multimodal Foundation Model for L2 Speaking Assessment in Picture-description Scenarios. International Conference on Language Resources and Evaluation, main:5545–5554.
Cite (Informal):: MASA: A Novel Multimodal Foundation Model for L2 Speaking Assessment in Picture-description Scenarios (Yan et al., LREC 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.433.pdf

PDF Cite Search Fix data