Solene Virginie Evain


2024

pdf
Audiocite.net : A Large Spoken Read Dataset in French
Soline Felice | Solene Virginie Evain | Solange Rossato | François Portet
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The advent of self-supervised learning (SSL) in speech processing has allowed the use of large unlabeled datasets to learn pre-trained models, serving as powerful encoders for various downstream tasks. However, the application of these SSL methods to languages such as French has proved difficult due to the scarcity of large French speech datasets. To advance the emergence of pre-trained models for French speech, we present the Audiocite.net corpus composed of 6,682 hours of recordings from 130 readers. This corpus is composed of audiobooks from the audiocite.net website, shared by 130 readers. In addition to describing the creation process and final statistics, we also show how this dataset impacted the models of LeBenchmark project in its 14k version for speech processing downstream tasks.

pdf
Unraveling Spontaneous Speech Dimensions for Cross-Corpus ASR System Evaluation for French
Solene Virginie Evain | Solange Rossato | François Portet
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Many papers on speech processing use the term ‘spontaneous speech’ as a catch-all term for situations like speaking with a friend, being interviewed on radio/TV or giving a lecture. However, Automatic Speech Recognition (ASR) systems performance seems to exhibit variation on this type of speech: the more spontaneous the speech, the higher the WER (Word Error Rate). Our study focuses on better understanding the elements influencing the levels of spontaneity in order to evaluate the relation between categories of spontaneity and ASR systems performance and improve the recognition on those categories. We first analyzed the literature, listed and unraveled those elements, and finally identified four axes: the situation of communication, the level of intimacy between speakers, the channel and the type of communication. Then, we trained ASR systems and measured the impact of instances of face-to-face interaction labeled with the previous dimensions (different levels of spontaneity) on WER. We made two axes vary and found that both dimensions have an impact on the WER. The situation of communication seems to have the biggest impact on spontaneity: ASR systems give better results for situations like an interview than for friends having a conversation at home.