Youness Dkhissi
2026
SENS-ASR: Semantic Embedding Injection in Neural-transducer for Streaming Automatic Speech Recognition
Youness Dkhissi | Valentin Vielzeuf | Elys Allesiardo | Anthony Larcher
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Youness Dkhissi | Valentin Vielzeuf | Elys Allesiardo | Anthony Larcher
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Many Automatic Speech Recognition (ASR) applications require streaming processing of the audio data. In streaming mode, ASR systems need to start transcribing the input stream before it is complete, i.e., the systems have to process a stream of inputs with a limited (or no) future context. Compared to offline mode, this reduction of the future context degrades the performance of Streaming-ASR systems, especially while working with low-latency constraint. In this work, we present SENS-ASR, an approach to enhance the transcription quality of Streaming-ASR by reinforcing the acoustic information with semantic information. This semantic information is extracted from the available past frame-embeddings by a context module. This module is trained using knowledge distillation from a sentence embedding Language Model fine-tuned on the training dataset transcriptions. Experiments on standard datasets show that SENS-ASR significantly improves the Word Error Rate on small-chunk streaming scenarios.
2023
Darbarer @ AutoMin2023: Transcription simplification for concise minute generation from multi-party conversations
Ismaël Rousseau | Loïc Fosse | Youness Dkhissi | Geraldine Damnati | Gwénolé Lecorvé
Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges
Ismaël Rousseau | Loïc Fosse | Youness Dkhissi | Geraldine Damnati | Gwénolé Lecorvé
Proceedings of the 16th International Natural Language Generation Conference: Generation Challenges
This document reports the approach of our team Darbarer for the main task (Task A) of the AutoMin 2023 challenge. Our system is composed of four main modules. The first module relies on a text simplification model aiming at standardizing the utterances of the conversation and compressing the input in order to focus on informative content. The second module handles summarization by employing a straightforward segmentation strategy and a fine-tuned BART-based generative model. Then a titling module has been trained in order to propose a short description of each summarized block. Lastly, we apply a post-processing step aimed at enhancing readability through specific formatting rules. Our contributions lie in the first, third and last steps. Our system generates precise and concise minutes. We provide a detailed description of our modules, discuss the difficulty of evaluating their impact and propose an analysis of observed errors in our generated minutes.