Workshop on Speech and Music Processing (2021)


up

pdf (full)
bib (full)
Proceedings of the Workshop on Speech and Music Processing 2021

pdf bib
Proceedings of the Workshop on Speech and Music Processing 2021
Anupam Biswas | Rabul Hussain Laskar | Pinki Roy

pdf bib
Classifying Emotional Utterances by Employing Multi-modal Speech Emotion Recognition
Dipankar Das

Deep learning methods are being applied to several speech processing problems in recent years. In the present work, we have explored different deep learning models for speech emotion recognition. We have employed normal deep feedforward neural network (FFNN) and convolutional neural network (CNN) to classify audio files according to their emotional content. Comparative study indicates that CNN model outperforms FFNN in case of emotions as well as gender classification. It was observed that the sole audio based models can capture the emotions up to a certain limit. Thus, we attempted a multi-modal framework by combining the benefits of the audio and text features and employed them into a recurrent encoder. Finally, the audio and text encoders are merged to provide the desired impact on various datasets. In addition, a database consists of emotional utterances of several words has also been developed as a part of this work. It contains same word in different emotional utterances. Though the size of the database is not that large but this database is ideally supposed to contain all the English words that exist in an English dictionary.

pdf bib
Prosody Labelled Dataset for Hindi
Esha Banerjee | Atul Kr. Ojha | Girish Jha

This study aims to develop an intonation labelled database for Hindi, for enhancing prosody in ASR and TTS systems, which is also helpful for building Speech to Speech Machine Translation systems. Although no single standard for prosody labelling exists in Hindi, researchers in the past have employed perceptual and statistical methods in literature to draw inferences about the behaviour of prosody patterns in Hindi. Based on such existing research and largely agreed upon intonational theories in Hindi, this study attempts to develop a manually annotated prosodic corpus of Hindi speech data, which can be used for training speech models for natural-sounding speech in the future. 500 sentences (2,550 words) for declarative and interrogative types have been labelled using Praat.

pdf
Multitask Learning based Deep Learning Model for Music Artist and Language Recognition
Yeshwant Singh | Anupam Biswas

Artist and music language recognitions of music recordings are crucial tasks in the music information retrieval domain. These tasks have many industrial applications and become much important with the advent of music streaming platforms. This work proposed a multitask learning-based deep learning model that leverages the shared latent representation between these two related tasks. Experimentally, we observe that applying multitask learning over a simple few blocks of a convolutional neural network-based model pays off with improvement in the performance. We conduct experiments on a regional music dataset curated for this task and released for others. Results show improvement up to 8.7 percent in AUC-PR, similar improvements observed in AUC-ROC.

pdf
Comparative Analysis of Melodia and Time-Domain Adaptive Filtering based Model for Melody Extraction from Polyphonic Music
Ranjeet Kumar | Anupam Biswas | Pinki Roy | Yeshwant Singh

Among the many applications of Music Information Retrieval (MIR), melody extraction is one of the most essential. It has risen to the top of the list of current research challenges in the field of MIR applications. We now need new means of defining, indexing, finding, and interacting with musical information, given the tremendous amount of music available at our fingertips. This article looked at some of the approaches that open the door to a broad variety of applications, such as automatically predicting the pitch sequence of a melody straight from the audio signal of a polyphonic music recording, commonly known as melody extraction. It is pretty easy for humans to identify the pitch of a melody, but doing so on an automated basis is very difficult and time-consuming. In this article, a comparison is made between the performance of the currently available melody extraction approach that is state-of-the-art Melodia and the technique based on time-domain adaptive filtering for melody extraction in terms of evaluation metrics introduced in MIREX 2005. Motivating by the same, this paper focuses on the discussion of datasets and state-of-the-art approaches for the extraction of the main melody from music signals. Additionally, a summary of the evaluation matrices based on which methodologies have been examined on various datasets is also present in this paper.

pdf
Dorabella Cipher as Musical Inspiration
Bradley Hauer | Colin Choi | Abram Hindle | Scott Smallwood | Grzegorz Kondrak

The Dorabella cipher is an encrypted note of English composer Edward Elgar, which has defied decipherment attempts for more than a century. While most proposed solutions are English texts, we investigate the hypothe- sis that Dorabella represents enciphered music. We weigh the evidence in favor of and against the hypothesis, devise a simplified music nota- tion, and attempt to reconstruct a melody from the cipher. Our tools are n-gram models of mu- sic which we validate on existing music cor- pora enciphered using monoalphabetic substi- tution. By applying our methods to Dorabella, we produce a decipherment with musical qual- ities, which is then transformed via artful com- position into a listenable melody. Far from ar- guing that the end result represents the only true solution, we instead frame the process of decipherment as part of the composition pro- cess.