Speech emotion recognition is in the focus of research since several decades and has many applications. One problem is sparse data for supervised learning. One way to tackle this problem is the synthesis of data with emotion simulating speech synthesis approaches. We present a synthesized database of five basic emotions and neutral expression based on rule based manipulation for a diphone synthesizer which we release to the public. The database has been validated in several machine learning experiments as a training set to detect emotional expression from natural speech data. The scripts to generate such a database have been made open source and could be used to aid speech emotion recognition for a low resourced language, as MBROLA supports 35 languages
Since several decades emotional databases have been recorded by various laboratories. Many of them contain acted portrays of Darwin’s famous “big four” basic emotions. In this paper, we investigate in how far a selection of them are comparable by two approaches: on the one hand modeling similarity as performance in cross database machine learning experiments and on the other by analyzing a manually picked set of four acoustic features that represent different phonetic areas. It is interesting to see in how far specific databases (we added a synthetic one) perform well as a training set for others while some do not. Generally speaking, we found indications for both similarity as well as specificiality across languages.
We present advancements with a software tool called Nkululeko, that lets users perform (semi-) supervised machine learning experiments in the speaker characteristics domain. It is based on audformat, a format for speech database metadata description. Due to an interface based on configurable templates, it supports best practise and very fast setup of experiments without the need to be proficient in the underlying language: Python. The paper explains the handling of Nkululeko and presents two typical experiments: comparing the expert acoustic features with artificial neural net embeddings for emotion classification and speaker age regression.
We introduce a spoken language resource for the analysis of impact that physical exercising has on human speech production. In particular, the database provides heart rate and skin conductance measurement information alongside the audio recordings. It contains recordings from 19 subjects in a relaxed state and after exercising. The audio material includes breathing, sustained vowels, and read text. Further, we describe pre-extracted audio-features from our openSMILE feature extractor together with baseline performances for the recognition of high and low heart rate using these features. The baseline results clearly show the feasibility of automatic estimation of heart rate from the human voice, in particular from sustained vowels. Both regression - in order to predict the exact heart rate value - and a binary classification setting for high and low heart rate classes are investigated. Finally, we give tendencies on feature group relevance in the named contexts of heart rate estimation and skin conductivity estimation.