Vaibhav Srivastav
2025
ESPnet-SDS: Unified Toolkit and Demo for Spoken Dialogue Systems
Siddhant Arora
|
Yifan Peng
|
Jiatong Shi
|
Jinchuan Tian
|
William Chen
|
Shikhar Bharadwaj
|
Hayato Futami
|
Yosuke Kashiwagi
|
Emiru Tsunoo
|
Shuichiro Shimizu
|
Vaibhav Srivastav
|
Shinji Watanabe
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)
Advancements in audio foundation models (FMs) have fueled interest in end-to-end (E2E) spoken dialogue systems, but different web interfaces for each system makes it challenging to compare and contrast them effectively. Motivated by this, we introduce an open-source, user-friendly toolkit designed to build unified web interfaces for various cascaded and E2E spoken dialogue systems. Our demo further provides users with the option to get on-the-fly automated evaluation metrics such as (1) latency, (2) ability to understand user input, (3) coherence, diversity, and relevance of system response, and (4) intelligibility and audio quality of system output. Using the evaluation metrics, we compare various cascaded and E2E spoken dialogue systems with a human-human conversation dataset as a proxy. Our analysis demonstrates that the toolkit allows researchers to effortlessly compare and contrast different technologies, providing valuable insights such as current E2E systems having poorer audio quality and less diverse responses. An example demo produced using our toolkit is publicly available here: https://huggingface.co/spaces/Siddhant/Voice_Assistant_Demo.
2022
Embarrassingly Simple Performance Prediction for Abductive Natural Language Inference
Emīls Kadiķis
|
Vaibhav Srivastav
|
Roman Klinger
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
The task of natural language inference (NLI), to decide if a hypothesis entails or contradicts a premise, received considerable attention in recent years. All competitive systems build on top of contextualized representations and make use of transformer architectures for learning an NLI model. When somebody is faced with a particular NLI task, they need to select the best model that is available. This is a time-consuming and resource-intense endeavour. To solve this practical problem, we propose a simple method for predicting the performance without actually fine-tuning the model. We do this by testing how well the pre-trained models perform on the aNLI task when just comparing sentence embeddings with cosine similarity to what kind of performance is achieved when training a classifier on top of these embeddings. We show that the accuracy of the cosine similarity approach correlates strongly with the accuracy of the classification approach with a Pearson correlation coefficient of 0.65. Since the similarity is orders of magnitude faster to compute on a given dataset (less than a minute vs. hours), our method can lead to significant time savings in the process of model selection.
Search
Fix data
Co-authors
- Siddhant Arora 1
- Shikhar Bharadwaj 1
- William Chen 1
- Hayato Futami 1
- Emīls Kadiķis 1
- show all...