Mahmoud Ali


2026

Large audio language models (LALMs) integrate audio representations with large language models to enable unified understanding of spoken content. Their capabilities have been increasingly investigated across several benchmarks; however, the examination of their use in rating L2 speech is still in its infancy. This study explores the abilities of LALMs in scoring three L2 speech global dimensions: foreign accentedness, comprehensibility, and intelligibility. Ninety audio samples produced by L2 speakers were rated by ten native speaker raters as well as five LALM models. Model performance was evaluated against the human composite mean using Pearson r, Spearman p, mean absolute error (MAE), and systematic bias, with the human leave-one-out correlation (r = .46-.73 across dimensions) serving as an empirical performance benchmark. The results showed that no LALM reached human-level performance on any dimension. Only one model (i.e., Gemini) achieved a significant correlation with human ratings on comprehensibility (r = .28, p < .01), while Qwen2-Audio showed modest correlation on intelligibility (r = .32, p < .01). MAE ranged from 0.75 to 3.99 for accentedness (human: 1.24), 1.35 to 3.00 for comprehensibility (human: 1.24), and 12.03 to 15.43 for intelligibility (human: 8.49). All models exhibited systematic biases, with deviations ranging from -9.31 to +13.19 points. The paper concludes with a discussion of the implications for automated L2 speech assessment.