Intent vs. Surface: Recovering Acoustic Realization from Modern ASR for Pronunciation Training

Seongjin Park


Abstract
Pronunciation feedback in language learning depends on accurate detection of learner errors, but it is unclear whether modern ASR systems are suitable for this purpose. Their language models recover intended words rather than what was actually pronounced, systematically masking mispronunciations. This is a tendency we call intent bias. By evaluating eight ASR systems spanning three architectures on two L2 English corpora, we find that overcorrection rate correlates inversely with word error rate. In other words, ASR systems with lower WER tend to mask more pronunciation errors. We propose surface-faithful reranking, an inference-time method that uses phoneme-level acoustic similarity to select N-best hypotheses closer to what the learner actually said. Without retraining or access to model internals, the method reduces the false acceptance rate of mispronunciations by 6.0 percentage points on L2-ARCTIC and 5.6 on speechocean762. The improvement is consistent across age groups and first-language backgrounds, though substantial overcorrection remains, pointing to the need for pronunciation-aware ASR objectives.
Anthology ID:
2026.bea-1.23
Volume:
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Ekaterina Kochmar, Bashar Alhafni, Stefano Bannò, Marie Bexte, Jill Burstein, Andrea Horbach, Ronja Laarmann-Quante, Anais Tack, Victoria Yaneva, Zheng Yuan
Venues:
BEA | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
336–346
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.bea-1.23/
DOI:
Bibkey:
Cite (ACL):
Seongjin Park. 2026. Intent vs. Surface: Recovering Acoustic Realization from Modern ASR for Pronunciation Training. In Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026), pages 336–346, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
Intent vs. Surface: Recovering Acoustic Realization from Modern ASR for Pronunciation Training (Park, BEA 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.bea-1.23.pdf