This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
JulieCarson-Berndsen
Also published as:
Julle Carson-Berndsen,
Julie Carson,
Julie Carson-berndsen
Fixing paper assignments
Please select all papers that do not belong to this person.
Indicate below which author they should be assigned to.
As intuitive units of speech, syllables have been widely studied in linguistics. A syllable can be defined as a three-constituent unit with a vocalic centre surrounded by two (in some languages optional) consonant clusters. Syllables are also used to design automatic speech recognition (ASR) models. The significance of knowledge-driven syllable-based tokenisation in ASR over data-driven byte-pair encoding has often been debated. However, the emergence of transformer-based ASR models employing self-attention (SA) overshadowed this debate. These models learn the nuances of speech from large corpora without prior knowledge of the domain; yet, they are not interpretable by design. Consequently, it is not clear if the recent performance improvements are related to the extraction of human-interpretable knowledge. We probe such models for syllable constituents and use an SA head pruning method to assess the relevance of the SA weights. We also investigate the role of vowel identification in syllable constituent probing. Our findings show that the general features of syllable constituents are extracted in the earlier layers of the model and the syllable-related features mostly depend on the temporal knowledge incorporated in specific SA heads rather than on vowel identification.
Neural text generation models have achieved remarkable success in carrying on short open-domain conversations. However, their performance degrades significantly in the long term, especially in their ability to ask coherent questions. A significant issue is the generation of redundant questions where the answer has already been provided by the user. We adapt and evaluate different methods, including negative training, decoding, and classification, to mitigate the redundancy problem. We also propose a simple yet effective method for generating training data without the need for crowdsourcing human-human or human-bot conversations. Experiments with the BlenderBot model show that our combined method significantly reduces the rate of redundant questions from 27.2% to 8.7%, while improving the quality of the original model. The code, dataset, and trained models can be found at our repository.
While the deep learning revolution has led to significant performance improvements in speech recognition, accented speech remains a challenge. Current approaches to this challenge typically do not seek to understand and provide explanations for the variations of accented speech, whether they stem from native regional variation or non-native error patterns. This paper seeks to address non-native speaker variations from both a knowledge-based and a data-driven perspective. We propose to approximate non-native accented-speech pronunciation patterns by the means of two approaches: based on phonetic and phonological knowledge on the one hand and inferred from a text-to-speech system on the other. Artificial speech is then generated with a range of variants which have been captured in confusion matrices representing phoneme similarities. We then show that non-native accent confusions actually propagate to the transcription from the ASR, thus suggesting that the inference of accent specific phoneme confusions is achievable from artificial speech.
In recent years large transformer model architectures have become available which provide a novel means of generating high-quality vector representations of speech audio. These transformers make use of an attention mechanism to generate representations enhanced with contextual and positional information from the input sequence. Previous works have explored the capabilities of these models with regard to performance in tasks such as speech recognition and speaker verification, but there has not been a significant inquiry as to the manner in which the contextual information provided by the transformer architecture impacts the representation of phonetic information within these models. In this paper, we report the results of a number of probing experiments on the representations generated by the wav2vec 2.0 model’s transformer component, with regard to the encoding of phonetic categorization information within the generated embeddings. We find that the contextual information generated by the transformer’s operation results in enhanced capture of phonetic detail by the model, and allows for distinctions to emerge in acoustic data that are otherwise difficult to separate.
A child who is unfamiliar with the correct spelling of a word often employs a “sound it out” approach: breaking the word down into its constituent sounds and then choosing letters to represent the identified sounds. This often results in a misspelling that is orthographically very different to the intended target. Recently, efforts have been made to develop phonetic based spellcheckers to tackle the more deviant nature of children’s misspellings. However, little work has been done to investigate the potential of spelling correction tools that incorporate regional pronunciation variation. If a child must first identify the sounds that make up a word, it stands to reason their pronunciation would influence this process. We investigate this hypothesis along with the feasibility and potential benefits of adapting spelling correction tools to more specific language variants - particularly Irish Accented English. We use misspelling data from schoolchildren across Ireland to adapt an existing English phonetic-based spellchecker and demonstrate improvements in performance. These results not only prompt consideration of language varieties in the development of spellcheckers but also contribute to existing literature on the role of regional accent in the acquisition of writing proficiency.
The paper presents a novel technique for speech translation using hierarchical phrased-based statistical machine translation (HPB-SMT). The system is based on translation of speech from phone sequences as opposed to conventional approach of speech translation from word sequences. The technique facilitates speech translation by allowing a machine translation (MT) system to access to phonetic information. This enables the MT system to act as both a word recognition and a translation component. This results in better performance than conventional speech translation approaches by recovering from recognition error with help of a source language model, translation model and target language model. For this purpose, the MT translation models are adopted to work on source language phones using a grapheme-to-phoneme component. The source-side phonetic confusions are handled using a confusion network. The result on IWLST'10 English- Chinese translation task shows a significant improvement in translation quality. In this paper, results for HPB-SMT are compared with previously published results of phrase-based statistical machine translation (PB-SMT) system (Baseline). The HPB-SMT system outperforms PB-SMT in this regard.
The work presented in this paper explores the use of Indonesian transliteration to support English pronunciation practice. It is mainly aimed for Indonesian speakers who have no or minimum English language skills. The approach implemented combines a rule-based and a statistical method. The rules of English-Phone-to-Indonesian-Grapheme mapping are implemented with a Finite State Transducer (FST), followed by a statistical method which is a grapheme-based trigram language model. The Indonesian transliteration generated was used as a means to support the learners where their speech were then recorded. The speech recordings have been evaluated by 19 participants: 8 English native and 11 non-native speakers. The results show that the transliteration positively contributes to the improvement of their English pronunciation.
Audiobooks are a rich resource of large quantities of natural sounding, highly expressive speech. In our previous research we have shown that it is possible to detect different expressive voice styles represented in a particular audiobook, using unsupervised clustering to group the speech corpus of the audiobook into smaller subsets representing the detected voice styles. These subsets of corpora of different voice styles reflect the various ways a speaker uses their voice to express involvement and affect, or imitate characters. This study is an evaluation of the detection of voice styles in an audiobook in the application of expressive speech synthesis. A further aim of this study is to investigate the usability of audiobooks as a language resource for expressive speech synthesis of utterances of conversational speech. Two evaluations have been carried out to assess the effect of the genre transfer: transmitting expressive speech from read aloud literature to conversational phrases with the application of speech synthesis. The first evaluation revealed that listeners have different voice style preferences for a particular conversational phrase. The second evaluation showed that it is possible for users of speech synthesis systems to learn the characteristics of a voice style well enough to make reliable predictions about what a certain utterance will sound like when synthesised using that voice style.
This paper describes a prototype of a computer-assisted pronunciation training system called MySpeech. The interface of the MySpeech system is web-based and it currently enables users to practice pronunciation by listening to speech spoken by native speakers and tuning their speech production to correct any mispronunciations detected by the system. This practice exercise is facilitated in different topics and difficulty levels. An experiment was conducted in this work that combines the MySpeech service with the WebWOZ Wizard-of-Oz platform (http://www.webwoz.com), in order to improve the human-computer interaction (HCI) of the service and the feedback that it provides to the user. The employed Wizard-of-Oz method enables a human (who acts as a wizard) to give feedback to the practising user, while the user is not aware that there is another person involved in the communication. This experiment permitted to quickly test an HCI model before its implementation on the MySpeech system. It also allowed to collect input data from the wizard that can be used to improve the proposed model. Another outcome of the experiment was the preliminary evaluation of the pronunciation learning service in terms of user satisfaction, which would be difficult to conduct before integrating the HCI part.