Alexey Karpov


RUSAVIC Corpus: Russian Audio-Visual Speech in Cars
Denis Ivanko | Alexandr Axyonov | Dmitry Ryumin | Alexey Kashevnik | Alexey Karpov
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We present a new audio-visual speech corpus (RUSAVIC) recorded in a car environment and designed for noise-robust speech recognition. Our goal was to produce a speech corpus which is natural (recorded in real driving conditions), controlled (providing different SNR levels by windows open/closed, moving/parked vehicle, etc.), and adequate size (the amount of data is enough to train state-of-the-art NN approaches). We focus on the problem of audio-visual speech recognition: with the use of automated lip-reading to improve the performance of audio-based speech recognition in the presence of severe acoustic noise caused by road traffic. We also describe the equipment and procedures used to create RUSAVIC corpus. Data are collected in a synchronous way through several smartphones located at different angles and equipped with FullHD video camera and microphone. The corpus includes the recordings of 20 drivers with minimum of 10 recording sessions for each. Besides providing a detailed description of the dataset and its collection pipeline, we evaluate several popular audio and visual speech recognition methods and present a set of baseline recognition results. At the moment RUSAVIC is a unique audio-visual corpus for the Russian language that is recorded in-the-wild condition and we make it publicly available.


Class-based LSTM Russian Language Model with Linguistic Information
Irina Kipyatkova | Alexey Karpov
Proceedings of the Twelfth Language Resources and Evaluation Conference

In the paper, we present class-based LSTM Russian language models (LMs) with classes generated with the use of both word frequency and linguistic information data, obtained with the help of the “VisualSynan” software from the AOT project. We have created LSTM LMs with various numbers of classes and compared them with word-based LM and class-based LM with word2vec class generation in terms of perplexity, training time, and WER. In addition, we performed a linear interpolation of LSTM language models with the baseline 3-gram language model. The LSTM language models were used for very large vocabulary continuous Russian speech recognition at an N-best list rescoring stage. We achieved significant progress in training time reduction with only slight degradation in recognition accuracy comparing to the word-based LM. In addition, our LM with classes generated using linguistic information outperformed LM with classes generated using word2vec. We achieved WER of 14.94 % at our own speech corpus of continuous Russian speech that is 15 % relative reduction with respect to the baseline 3-gram model.

TheRuSLan: Database of Russian Sign Language
Ildar Kagirov | Denis Ivanko | Dmitry Ryumin | Alexander Axyonov | Alexey Karpov
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper, a new Russian sign language multimedia database TheRuSLan is presented. The database includes lexical units (single words and phrases) from Russian sign language within one subject area, namely, “food products at the supermarket”, and was collected using MS Kinect 2.0 device including both FullHD video and the depth map modes, which provides new opportunities for the lexicographical description of the Russian sign language vocabulary and enhances research in the field of automatic gesture recognition. Russian sign language has an official status in Russia, and over 120,000 deaf people in Russia and its neighboring countries use it as their first language. Russian sign language has no writing system, is poorly described and belongs to the low-resource languages. The authors formulate the basic principles of annotation of sign words, based on the collected data, and reveal the content of the collected database. In the future, the database will be expanded and comprise more lexical units. The database is explicitly made for the task of creating an automatic system for Russian sign language recognition.


Cross-Corpus Data Augmentation for Acoustic Addressee Detection
Oleg Akhtiamov | Ingo Siegert | Alexey Karpov | Wolfgang Minker
Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue

Acoustic addressee detection (AD) is a modern paralinguistic and dialogue challenge that especially arises in voice assistants. In the present study, we distinguish addressees in two settings (a conversation between several people and a spoken dialogue system, and a conversation between several adults and a child) and introduce the first competitive baseline (unweighted average recall equals 0.891) for the Voice Assistant Conversation Corpus that models the first setting. We jointly solve both classification problems, using three models: a linear support vector machine dealing with acoustic functionals and two neural networks utilising raw waveforms alongside with acoustic low-level descriptors. We investigate how different corpora influence each other, applying the mixup approach to data augmentation. We also study the influence of various acoustic context lengths on AD. Two-second speech fragments turn out to be sufficient for reliable AD. Mixup is shown to be beneficial for merging acoustic data (extracted features but not raw waveforms) from different domains that allows us to reach a higher classification performance on human-machine AD and also for training a multipurpose neural network that is capable of solving both human-machine and adult-child AD problems.