Seongjin Park
2026
Intent vs. Surface: Recovering Acoustic Realization from Modern ASR for Pronunciation Training
Seongjin Park
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)
Seongjin Park
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)
Pronunciation feedback in language learning depends on accurate detection of learner errors, but it is unclear whether modern ASR systems are suitable for this purpose. Their language models recover intended words rather than what was actually pronounced, systematically masking mispronunciations. This is a tendency we call intent bias. By evaluating eight ASR systems spanning three architectures on two L2 English corpora, we find that overcorrection rate correlates inversely with word error rate. In other words, ASR systems with lower WER tend to mask more pronunciation errors. We propose surface-faithful reranking, an inference-time method that uses phoneme-level acoustic similarity to select N-best hypotheses closer to what the learner actually said. Without retraining or access to model internals, the method reduces the false acceptance rate of mispronunciations by 6.0 percentage points on L2-ARCTIC and 5.6 on speechocean762. The improvement is consistent across age groups and first-language backgrounds, though substantial overcorrection remains, pointing to the need for pronunciation-aware ASR objectives.
2021
Me, myself, and ire: Effects of automatic transcription quality on emotion, sarcasm, and personality detection
John Culnan | Seongjin Park | Meghavarshini Krishnaswamy | Rebecca Sharp
Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis
John Culnan | Seongjin Park | Meghavarshini Krishnaswamy | Rebecca Sharp
Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis
In deployment, systems that use speech as input must make use of automated transcriptions. Yet, typically when these systems are evaluated, gold transcriptions are assumed. We explicitly examine the impact of transcription errors on the downstream performance of a multi-modal system on three related tasks from three datasets: emotion, sarcasm, and personality detection. We include three separate transcription tools and show that while all automated transcriptions propagate errors that substantially impact downstream performance, the open-source tools fair worse than the paid tool, though not always straightforwardly, and word error rates do not correlate well with downstream performance. We further find that the inclusion of audio features partially mitigates transcription errors, but that a naive usage of a multi-task setup does not.