Jiun-Ting Li


2026

Automatic speaking assessment (ASA) manages to quantify the language competence of foreign language learners by providing a proficiency score based on their spoken response. Existing efforts in ASA typically employ a neural grader integrated with a set of handcrafted features to assess learners’ oral proficiency from multiple facets. Despite decent performance, the black-box nature of these neural graders remains a significant barrier to providing interpretable explanations for the grading results. In light of this, we propose RABIT for ASA, a novel Rationale-based knowledge distillation framework for interpretable grading decisions via a small language model. Specifically, RABIT first extracts multi-faceted grading rationales from a large language model (LLM) pertaining to the learner’s response and the scoring guidelines. Subsequently, a compact yet efficient language model, equipped with distinct output heads, is jointly optimized to estimate a proficiency score while generating a sequence of grading rationales in an autoregressive manner. A series of experiments conducted on General English Proficiency Test (GEPT) dataset validates the feasibility and superiority of our method over several cutting-edge baselines.
Automatic pronunciation assessment (APA) provides L2 learners with scalable and timely feedback on pronunciation proficiency in a target language, typically through goodness of pronunciation (GOP) features. GOP quantifies how well a pronounced phoneme matches the expected target sound by comparing acoustic features against the model’s posterior probabilities. Traditional GOP relies on forced alignment to obtain these posteriors, but it suffers from acoustic-induced misalignments that degrade assessment reliability. Although the standard CTC-GOP approach bypasses forced alignment, it is limited by the inherent peaky behavior of CTC-based ASR models, which produces sparse posteriors and lacks stable temporal information. To address these issues in standard CTC, we propose a context-aware CTC framework incorporating output context dependency (OCD) in the CTC topology, along with label prior (LP) and maximum conditional entropy (EnCTC) regularization, to mitigate peakiness and produce more stable ASR logits suitable for GOP computation. Experiments on the speechocean762 corpus demonstrate that our best context-aware configurations achieve superior phoneme-level performance, outperforming the TDNN-F baseline and standard CTC in unified GOPT (phoneme PCC 0.641 vs. 0.612; word total PCC 0.582 vs. 0.549) while narrowing the gap in hierarchical HierCB scoring. These improvements widen the scoring margin between correct and mispronounced phonemes from 0.708 to 0.816 in GOPT. They also reveal that mitigating CTC peakiness and incorporating context dependency significantly enhance CTC-GOP stability and robustness, especially for alignment-free APA models.

2024

Automatic pronunciation assessment (APA) manages to quantify a second language (L2) learner’s pronunciation proficiency in a target language by providing fine-grained feedback with multiple pronunciation aspect scores at various linguistic levels. Most existing efforts on APA typically parallelize the modeling process, namely predicting multiple aspect scores across various linguistic levels simultaneously. This inevitably makes both the hierarchy of linguistic units and the relatedness among the pronunciation aspects sidelined. Recognizing such a limitation, we in this paper first introduce HierTFR, a hierarchal APA method that jointly models the intrinsic structures of an utterance while considering the relatedness among the pronunciation aspects. We also propose a correlation-aware regularizer to strengthen the connection between the estimated scores and the human annotations. Furthermore, novel pre-training strategies tailored for different linguistic levels are put forward so as to facilitate better model initialization. An extensive set of empirical experiments conducted on the speechocean762 benchmark dataset suggest the feasibility and effectiveness of our approach in relation to several competitive baselines.

2023

2021

With the widespread commercialization of smart devices, research on environmental sound classification has gained more and more attention in recent years. In this paper, we set out to make effective use of large-scale audio pretrained model and semi-supervised model training paradigm for environmental sound classification. To this end, an environmental sound classification method is first put forward, whose component model is built on top a large-scale audio pretrained model. Further, to simulate a low-resource sound classification setting where only limited supervised examples are made available, we instantiate the notion of transfer learning with a recently proposed training algorithm (namely, FixMatch) and a data augmentation method (namely, SpecAugment) to achieve the goal of semi-supervised model training. Experiments conducted on bench-mark dataset UrbanSound8K reveal that our classification method can lead to an accuracy improvement of 2.4% in relation to a current baseline method.