Cornelius Weber


2022

pdf
A Multimodal German Dataset for Automatic Lip Reading Systems and Transfer Learning
Gerald Schwiebert | Cornelius Weber | Leyuan Qu | Henrique Siqueira | Stefan Wermter
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Large datasets as required for deep learning of lip reading do not exist in many languages. In this paper we present the dataset GLips (German Lips) consisting of 250,000 publicly available videos of the faces of speakers of the Hessian Parliament, which was processed for word-level lip reading using an automatic pipeline. The format is similar to that of the English language LRW (Lip Reading in the Wild) dataset, with each video encoding one word of interest in a context of 1.16 seconds duration, which yields compatibility for studying transfer learning between both datasets. By training a deep neural network, we investigate whether lip reading has language-independent features, so that datasets of different languages can be used to improve lip reading models. We demonstrate learning from scratch and show that transfer learning from LRW to GLips and vice versa improves learning speed and performance, in particular for the validation set.

2020

pdf
EDA: Enriching Emotional Dialogue Acts using an Ensemble of Neural Annotators
Chandrakant Bothe | Cornelius Weber | Sven Magg | Stefan Wermter
Proceedings of the Twelfth Language Resources and Evaluation Conference

The recognition of emotion and dialogue acts enriches conversational analysis and help to build natural dialogue systems. Emotion interpretation makes us understand feelings and dialogue acts reflect the intentions and performative functions in the utterances. However, most of the textual and multi-modal conversational emotion corpora contain only emotion labels but not dialogue acts. To address this problem, we propose to use a pool of various recurrent neural models trained on a dialogue act corpus, with and without context. These neural models annotate the emotion corpora with dialogue act labels, and an ensemble annotator extracts the final dialogue act label. We annotated two accessible multi-modal emotion corpora: IEMOCAP and MELD. We analyzed the co-occurrence of emotion and dialogue act labels and discovered specific relations. For example, Accept/Agree dialogue acts often occur with the Joy emotion, Apology with Sadness, and Thanking with Joy. We make the Emotional Dialogue Acts (EDA) corpus publicly available to the research community for further study and analysis.

2018

pdf
A Context-based Approach for Dialogue Act Recognition using Simple Recurrent Neural Networks
Chandrakant Bothe | Cornelius Weber | Sven Magg | Stefan Wermter
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf
KT-Speech-Crawler: Automatic Dataset Construction for Speech Recognition from YouTube Videos
Egor Lakomkin | Sven Magg | Cornelius Weber | Stefan Wermter
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We describe KT-Speech-Crawler: an approach for automatic dataset construction for speech recognition by crawling YouTube videos. We outline several filtering and post-processing steps, which extract samples that can be used for training end-to-end neural speech recognition systems. In our experiments, we demonstrate that a single-core version of the crawler can obtain around 150 hours of transcribed speech within a day, containing an estimated 3.5% word error rate in the transcriptions. Automatically collected samples contain reading and spontaneous speech recorded in various conditions including background noise and music, distant microphone recordings, and a variety of accents and reverberation. When training a deep neural network on speech recognition, we observed around 40% word error rate reduction on the Wall Street Journal dataset by integrating 200 hours of the collected samples into the training set.

2017

pdf
Automatically augmenting an emotion dataset improves classification using audio
Egor Lakomkin | Cornelius Weber | Stefan Wermter
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

In this work, we tackle a problem of speech emotion classification. One of the issues in the area of affective computation is that the amount of annotated data is very limited. On the other hand, the number of ways that the same emotion can be expressed verbally is enormous due to variability between speakers. This is one of the factors that limits performance and generalization. We propose a simple method that extracts audio samples from movies using textual sentiment analysis. As a result, it is possible to automatically construct a larger dataset of audio samples with positive, negative emotional and neutral speech. We show that pretraining recurrent neural network on such a dataset yields better results on the challenging EmotiW corpus. This experiment shows a potential benefit of combining textual sentiment analysis with vocal information.

pdf
Reusing Neural Speech Representations for Auditory Emotion Recognition
Egor Lakomkin | Cornelius Weber | Sven Magg | Stefan Wermter
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Acoustic emotion recognition aims to categorize the affective state of the speaker and is still a difficult task for machine learning models. The difficulties come from the scarcity of training data, general subjectivity in emotion perception resulting in low annotator agreement, and the uncertainty about which features are the most relevant and robust ones for classification. In this paper, we will tackle the latter problem. Inspired by the recent success of transfer learning methods we propose a set of architectures which utilize neural representations inferred by training on large speech databases for the acoustic emotion recognition task. Our experiments on the IEMOCAP dataset show ~10% relative improvements in the accuracy and F1-score over the baseline recurrent neural network which is trained end-to-end for emotion recognition.