MASA: A Novel Multimodal Foundation Model for L2 Speaking Assessment in Picture-description Scenarios

Bi-Cheng Yan, Fu-An Chao, Hong-Yun H.Y. Lin, Berlin Chen


Abstract
Automatic speaking assessment (ASA) manages to quantify the language competence of second language (L2) learners by providing a proficiency score based on their spoken responses. Existing efforts typically employ a neural grader coupled with a set of handcrafted features to gauge the competence of language in L2 learners from multiple facets. Despite their decent efficacy, these methods are limited by a laborious feature engineering process and largely overlook the utilization of scoring rubrics that are presented to human raters in speaking assessment. In light of this, we put forward a novel Multimodal foundation model for ASA, termed MASA, for use in picture-description scenarios. Our approach effectively streamlines the feature engineering process by leveraging the pre-trained encoders of a multimodal foundation model, and emulates the nuanced scoring behaviors of human raters by incorporating scoring rubrics directly into the modeling process. Furthermore, a simple, training-free method is introduced to alleviate the scoring bias in MASA by contrasting the output distributions derived from the multimodal and single-modal inputs. A series of experiments conducted on a picture-description task of the General English Proficiency Test (GEPT) dataset validates the feasibility and superiority of our method in comparison to several cutting-edge baselines.
Anthology ID:
2026.lrec-main.433
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
5545–5554
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.433/
DOI:
Bibkey:
Cite (ACL):
Bi-Cheng Yan, Fu-An Chao, Hong-Yun H.Y. Lin, and Berlin Chen. 2026. MASA: A Novel Multimodal Foundation Model for L2 Speaking Assessment in Picture-description Scenarios. International Conference on Language Resources and Evaluation, main:5545–5554.
Cite (Informal):
MASA: A Novel Multimodal Foundation Model for L2 Speaking Assessment in Picture-description Scenarios (Yan et al., LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.433.pdf