Berlin Chen

2024

pdf abs
An Effective Automated Speaking Assessment Approach to Mitigating Data Scarcity and Imbalanced Distribution
Tien-Hong Lo | Fu-An Chao | Tzu-i Wu | Yao-Ting Sung | Berlin Chen
Findings of the Association for Computational Linguistics: NAACL 2024

Automated speaking assessment (ASA) typically involves automatic speech recognition (ASR) and hand-crafted feature extraction from the ASR transcript of a learner’s speech. Recently, self-supervised learning (SSL) has shown stellar performance compared to traditional methods. However, SSL-based ASA systems are faced with at least three data-related challenges: limited annotated data, uneven distribution of learner proficiency levels and non-uniform score intervals between different CEFR proficiency levels. To address these challenges, we explore the use of two novel modeling strategies: metric-based classification and loss re-weighting, leveraging distinct SSL-based embedding features. Extensive experimental results on the ICNALE benchmark dataset suggest that our approach can outperform existing strong baselines by a sizable margin, achieving a significant improvement of more than 10% in CEFR prediction accuracy.

pdf abs
An Effective Pronunciation Assessment Approach Leveraging Hierarchical Transformers and Pre-training Strategies
Bi-Cheng Yan | Jiun-Ting Li | Yi-Cheng Wang | Hsin Wei Wang | Tien-Hong Lo | Yung-Chang Hsu | Wei-Cheng Chao | Berlin Chen
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Automatic pronunciation assessment (APA) manages to quantify a second language (L2) learner’s pronunciation proficiency in a target language by providing fine-grained feedback with multiple pronunciation aspect scores at various linguistic levels. Most existing efforts on APA typically parallelize the modeling process, namely predicting multiple aspect scores across various linguistic levels simultaneously. This inevitably makes both the hierarchy of linguistic units and the relatedness among the pronunciation aspects sidelined. Recognizing such a limitation, we in this paper first introduce HierTFR, a hierarchal APA method that jointly models the intrinsic structures of an utterance while considering the relatedness among the pronunciation aspects. We also propose a correlation-aware regularizer to strengthen the connection between the estimated scores and the human annotations. Furthermore, novel pre-training strategies tailored for different linguistic levels are put forward so as to facilitate better model initialization. An extensive set of empirical experiments conducted on the speechocean762 benchmark dataset suggest the feasibility and effectiveness of our approach in relation to several competitive baselines.

pdf abs
DANCER: Entity Description Augmented Named Entity Corrector for Automatic Speech Recognition
Yi-Cheng Wang | Hsin-Wei Wang | Bi-Cheng Yan | Chi-Han Lin | Berlin Chen
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

End-to-end automatic speech recognition (E2E ASR) systems often suffer from mistranscription of domain-specific phrases, such as named entities, sometimes leading to catastrophic failures in downstream tasks. A family of fast and lightweight named entity correction (NEC) models for ASR have recently been proposed, which normally build on pho-netic-level edit distance algorithms and have shown impressive NEC performance. However, as the named entity (NE) list grows, the problems of phonetic confusion in the NE list are exacerbated; for example, homophone ambiguities increase substantially. In view of this, we proposed a novel Description Augmented Named entity CorrEctoR (dubbed DANCER), which leverages entity descriptions to provide additional information to facilitate mitigation of phonetic con-fusion for NEC on ASR transcription. To this end, an efficient entity description augmented masked language model (EDA-MLM) comprised of a dense retrieval model is introduced, enabling MLM to adapt swiftly to domain-specific entities for the NEC task. A series of experiments conducted on the AISHELL-1 and Homophone datasets confirm the effectiveness of our modeling approach. DANCER outperforms a strong baseline, the phonetic edit-distance-based NEC model (PED-NEC), by a character error rate (CER) reduction of about 7% relatively on AISHELL-1 for named entities. More notably, when tested on Homophone that contain named entities of high phonetic confusion, DANCER offers a more pronounced CER reduction of 46% relatively over PED-NEC for named entities. The code is available at https://github.com/Amiannn/Dancer.

2023

pdf
Auxiliary loss to attention head for end to end speaker diarization
Yi-Ting Yang | Jiun-Ting Li | Berlin Chen
Proceedings of the 35th Conference on Computational Linguistics and Speech Processing (ROCLING 2023)

pdf
Leveraging Dialogue Discourse Parsing in a Two-Stage Framework for Meeting Summarization
Yi-Ping Huang | Tien-Hong Lo | Berlin Chen
Proceedings of the 35th Conference on Computational Linguistics and Speech Processing (ROCLING 2023)

pdf
AaWLoss: An Artifact-aware Weighted Loss Function for Speech Enhancement
En-Lun Yu | Kuan-Hsun Ho | Berlin Chen
Proceedings of the 35th Conference on Computational Linguistics and Speech Processing (ROCLING 2023)

pdf
Enhancing Automated English Speaking Assessment for L2 Speakers with BERT and Wav2vec2.0 Fusion
Wen-Hsuan Peng | Hsin-Wei Wang | Sally Chen | Berlin Chen
Proceedings of the 35th Conference on Computational Linguistics and Speech Processing (ROCLING 2023)

pdf
Addressing the issue of Data Imbalance in Multi-granularity Pronunciation Assessment
Meng-Shin Lin | Hsin-Wei Wang | Tien-Hong Lo | Berlin Chen | Wei-Cheng Chao
Proceedings of the 35th Conference on Computational Linguistics and Speech Processing (ROCLING 2023)

pdf
KNOT-MCTS: An Effective Approach to Addressing Hallucinations in Generative Language Modeling for Question Answering
Chung-Wen Wu | Guan-Tang Huang | Yue-Yang He | Berlin Chen
Proceedings of the 35th Conference on Computational Linguistics and Speech Processing (ROCLING 2023)

pdf
The NTNU Super Monster Team (SPMT) system for the Formosa Speech Recognition Challenge 2023 - Hakka ASR
Tzu-Ting Yang | Hsin-Wei Wang | Meng-Ting Tsai | Berlin Chen
Proceedings of the 35th Conference on Computational Linguistics and Speech Processing (ROCLING 2023)

2022

pdf bib
International Journal of Computational Linguistics & Chinese Language Processing, Volume 27, Number 2, December 2022
Berlin Chen | Hung-Yu Kao
International Journal of Computational Linguistics & Chinese Language Processing, Volume 27, Number 2, December 2022

pdf abs
A Preliminary Study on Automated Speaking Assessment of English as a Second Language (ESL) Students
Tzu-I Wu | Tien-Hong Lo | Fu-An Chao | Yao-Ting Sung | Berlin Chen
Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022)

Due to the surge in global demand for English as a second language (ESL), developments of automated methods for grading speaking proficiency have gained considerable attention. This paper aims to present a computerized regime of grading the spontaneous spoken language for ESL learners. Based on the speech corpus of ESL learners recently collected in Taiwan, we first extract multi-view features (e.g., pronunciation, fluency, and prosody features) from either automatic speech recognition (ASR) transcription or audio signals. These extracted features are, in turn, fed into a tree-based classifier to produce a new set of indicative features as the input of the automated assessment system, viz. the grader. Finally, we use different machine learning models to predict ESL learners’ respective speaking proficiency and map the result into the corresponding CEFR level. The experimental results and analysis conducted on the speech corpus of ESL learners in Taiwan show that our approach holds great potential for use in automated speaking assessment, meanwhile offering more reliable predictive results than the human experts.

pdf abs
Building an Enhanced Autoregressive Document Retriever Leveraging Supervised Contrastive Learning
Yi-Cheng Wang | Tzu-Ting Yang | Hsin-Wei Wang | Yung-Chang Hsu | Berlin Chen
Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022)

The goal of an information retrieval system is to retrieve documents that are most relevant to a given user query from a huge collection of documents, which usually requires time-consuming multiple comparisons between the query and candidate documents so as to find the most relevant ones. Recently, a novel retrieval modeling approach, dubbed Differentiable Search Index (DSI), has been proposed. DSI dramatically simplifies the whole retrieval process by encoding all information about the document collection into the parameter space of a single Transformer model, on top of which DSI can in turn generate the relevant document identities (IDs) in an autoregressive manner in response to a user query. Although DSI addresses the shortcomings of traditional retrieval systems, previous studies have pointed out that DSI might fail to retrieve relevant documents because DSI uses the document IDs as the pivotal mechanism to establish the relationship between queries and documents, whereas not every document in the document collection has its corresponding relevant and irrelevant queries for the training purpose. In view of this, we put forward to leveraging supervised contrastive learning to better render the relationship between queries and documents in the latent semantic space. Furthermore, an approximate nearest neighbor search strategy is employed at retrieval time to further assist the Transformer model in generating document IDs relevant to a posed query more efficiently. A series of experiments conducted on the Nature Question benchmark dataset confirm the effectiveness and practical feasibility of our approach in relation to some strong baseline systems.

2021

pdf bib
International Journal of Computational Linguistics & Chinese Language Processing, Volume 26, Number 1, June 2021
Chia-Hui Chang | Berlin Chen
International Journal of Computational Linguistics & Chinese Language Processing, Volume 26, Number 1, June 2021

pdf bib
The NTNU Taiwanese ASR System for Formosa Speech Recognition Challenge 2020
Fu-An Chao | Tien-Hong Lo | Shi-Yan Weng | Shih-Hsuan Chiu | Yao-Ting Sung | Berlin Chen
International Journal of Computational Linguistics & Chinese Language Processing, Volume 26, Number 1, June 2021

pdf bib
International Journal of Computational Linguistics & Chinese Language Processing, Volume 26, Number 2, December 2021
Berlin Chen | Hung-Yu Kao
International Journal of Computational Linguistics & Chinese Language Processing, Volume 26, Number 2, December 2021

pdf abs
A Study on Contextualized Language Modeling for Machine Reading Comprehension
Chin-Ying Wu | Yung-Chang Hsu | Berlin Chen
Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing (ROCLING 2021)

With the recent breakthrough of deep learning technologies, research on machine reading comprehension (MRC) has attracted much attention and found its versatile applications in many use cases. MRC is an important natural language processing (NLP) task aiming to assess the ability of a machine to understand natural language expressions, which is typically operationalized by first asking questions based on a given text paragraph and then receiving machine-generated answers in accordance with the given context paragraph and questions. In this paper, we leverage two novel pretrained language models built on top of Bidirectional Encoder Representations from Transformers (BERT), namely BERT-wwm and MacBERT, to develop effective MRC methods. In addition, we also seek to investigate whether additional incorporation of the categorical information about a context paragraph can benefit MRC or not, which is achieved based on performing context paragraph clustering on the training dataset. On the other hand, an ensemble learning approach is proposed to harness the synergistic power of the aforementioned two BERT-based models so as to further promote MRC performance.

pdf abs
A Preliminary Study on Environmental Sound Classification Leveraging Large-Scale Pretrained Model and Semi-Supervised Learning
You-Sheng Tsao | Tien-Hong Lo | Jiun-Ting Li | Shi-Yan Weng | Berlin Chen
Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing (ROCLING 2021)

With the widespread commercialization of smart devices, research on environmental sound classification has gained more and more attention in recent years. In this paper, we set out to make effective use of large-scale audio pretrained model and semi-supervised model training paradigm for environmental sound classification. To this end, an environmental sound classification method is first put forward, whose component model is built on top a large-scale audio pretrained model. Further, to simulate a low-resource sound classification setting where only limited supervised examples are made available, we instantiate the notion of transfer learning with a recently proposed training algorithm (namely, FixMatch) and a data augmentation method (namely, SpecAugment) to achieve the goal of semi-supervised model training. Experiments conducted on bench-mark dataset UrbanSound8K reveal that our classification method can lead to an accuracy improvement of 2.4% in relation to a current baseline method.

pdf abs
Exploring the Integration of E2E ASR and Pronunciation Modeling for English Mispronunciation Detection
Hsin-Wei Wang | Bi-Cheng Yan | Yung-Chang Hsu | Berlin Chen
Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing (ROCLING 2021)

There has been increasing demand to develop effective computer-assisted language training (CAPT) systems, which can provide feedback on mispronunciations and facilitate second-language (L2) learners to improve their speaking proficiency through repeated practice. Due to the shortage of non-native speech for training the automatic speech recognition (ASR) module of a CAPT system, the corresponding mispronunciation detection performance is often affected by imperfect ASR. Recognizing this importance, we in this paper put forward a two-stage mispronunciation detection method. In the first stage, the speech uttered by an L2 learner is processed by an end-to-end ASR module to produce N-best phone sequence hypotheses. In the second stage, these hypotheses are fed into a pronunciation model which seeks to faithfully predict the phone sequence hypothesis that is most likely pronounced by the learner, so as to improve the performance of mispronunciation detection. Empirical experiments conducted a English benchmark dataset seem to confirm the utility of our method.

2020

pdf bib
International Journal of Computational Linguistics & Chinese Language Processing, Volume 25, Number 1, June 2020
Chia-Hui Chang | Berlin Chen
International Journal of Computational Linguistics & Chinese Language Processing, Volume 25, Number 1, June 2020

pdf bib
基於端對端模型化技術之語音文件摘要 (Spoken Document Summarization Using End-to-End Modeling Techniques)
Tzu-En Liu | Shih-Hung Liu | Kuo-Wei Chang | Berlin Chen
International Journal of Computational Linguistics & Chinese Language Processing, Volume 25, Number 1, June 2020

pdf
Multi-view Attention-based Speech Enhancement Model for Noise-robust Automatic Speech Recognition
Fu-An Chao | Jeih-weih Hung | Berlin Chen
Proceedings of the 32nd Conference on Computational Linguistics and Speech Processing (ROCLING 2020)

pdf
Innovative Pretrained-based Reranking Language Models for N-best Speech Recognition Lists
Shih-Hsuan Chiu | Berlin Chen
Proceedings of the 32nd Conference on Computational Linguistics and Speech Processing (ROCLING 2020)

pdf
A Study on Contextualized Language Modeling for FAQ Retrieval
Wen-Ting Tseng | Yung-Chang Hsu | Berlin Chen
Proceedings of the 32nd Conference on Computational Linguistics and Speech Processing (ROCLING 2020)

pdf
Exploiting Text Prompts for the Development of an End-to-End Computer-Assisted Pronunciation Training System
Yu-Sen Cheng | Tien-Hong Lo | Berlin Chen
Proceedings of the 32nd Conference on Computational Linguistics and Speech Processing (ROCLING 2020)

pdf
Exploring Disparate Language Model Combination Strategies for Mandarin-English Code-Switching ASR
Wei-Ting Lin | Berlin Chen
Proceedings of the 32nd Conference on Computational Linguistics and Speech Processing (ROCLING 2020)

2019

Natural language understanding has recently seen a surge of progress with the use of sentence encoders like ELMo (Peters et al., 2018a) and BERT (Devlin et al., 2019) which are pretrained on variants of language modeling. We conduct the first large-scale systematic study of candidate pretraining tasks, comparing 19 different tasks both as alternatives and complements to language modeling. Our primary results support the use language modeling, especially when combined with pretraining on additional labeled-data tasks. However, our results are mixed across pretraining tasks and show some concerning trends: In ELMo’s pretrain-then-freeze paradigm, random baselines are worryingly strong and results vary strikingly across target tasks. In addition, fine-tuning BERT on an intermediate task often negatively impacts downstream transfer. In a more positive trend, we see modest gains from multitask training, suggesting the development of more sophisticated multitask and transfer learning techniques as an avenue for further research.

pdf
探究端對端混合模型架構於華語語音辨識 (An Investigation of Hybrid CTC-Attention Modeling in Mandarin Speech Recognition)
Hsiu-Jui Chang | Wei-Cheng Chao | Tien-Hong Lo | Berlin Chen
International Journal of Computational Linguistics & Chinese Language Processing, Volume 24, Number 1, June 2019

pdf
使用生成對抗網路於強健式自動語音辨識的應用(Exploiting Generative Adversarial Network for Robustness Automatic Speech Recognition)
Ming-Jhang Yang | Fu-An Chao | Tien-Hong Lo | Berlin Chen
Proceedings of the 31st Conference on Computational Linguistics and Speech Processing (ROCLING 2019)

pdf
探究端對端語音辨識於發音檢測與診斷(Investigating on Computer-Assisted Pronunciation Training Leveraging End-to-End Speech Recognition Techniques)
Hsiu-Jui Chang | Tien-Hong Lo | Tzu-En Liu | Berlin Chen
Proceedings of the 31st Conference on Computational Linguistics and Speech Processing (ROCLING 2019)

pdf
基於階層式編碼架構之文本可讀性預測(A Hierarchical Encoding Framework for Text Readability Prediction)
Shi-Yan Weng | Hou-Chiang Tseng | Yao-Ting Sung | Berlin Chen
Proceedings of the 31st Conference on Computational Linguistics and Speech Processing (ROCLING 2019)

In the context of natural language processing, representation learning has emerged as a newly active research subject because of its excellent performance in many applications. Learning representations of words is a pioneering study in this school of research. However, paragraph (or sentence and document) embedding learning is more suitable/reasonable for some tasks, such as sentiment classification and document summarization. Nevertheless, as far as we are aware, there is only a dearth of research focusing on launching unsupervised paragraph embedding methods. Classic paragraph embedding methods infer the representation of a given paragraph by considering all of the words occurring in the paragraph. Consequently, those stop or function words that occur frequently may mislead the embedding learning process to produce a misty paragraph representation. Motivated by these observations, our major contributions are twofold. First, we propose a novel unsupervised paragraph embedding method, named the essence vector (EV) model, which aims at not only distilling the most representative information from a paragraph but also excluding the general background information to produce a more informative low-dimensional vector representation for the paragraph. We evaluate the proposed EV model on benchmark sentiment classification and multi-document summarization tasks. The experimental results demonstrate the effectiveness and applicability of the proposed embedding method. Second, in view of the increasing importance of spoken content processing, an extension of the EV model, named the denoising essence vector (D-EV) model, is proposed. The D-EV model not only inherits the advantages of the EV model but also can infer a more robust representation for a given spoken paragraph against imperfect speech recognition. The utility of the D-EV model is evaluated on a spoken document summarization task, confirming the effectiveness of the proposed embedding method in relation to several well-practiced and state-of-the-art summarization methods.