Yuki Saito
2026
Real-Time Generation of Game Video Commentary with Multimodal LLMs: Pause-Aware Decoding Approaches
Anum Afzal | Yuki Saito | Hiroya Takamura | Katsuhito Sudoh | Shinnosuke Takamichi | Graham Neubig | Florian Matthes | Tatsuya Ishigaki
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Anum Afzal | Yuki Saito | Hiroya Takamura | Katsuhito Sudoh | Shinnosuke Takamichi | Graham Neubig | Florian Matthes | Tatsuya Ishigaki
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Real-time video commentary generation provides textual descriptions of ongoing events in videos. It supports accessibility and engagement in domains such as sports, esports, and livestreaming. Commentary generation involves two essential decisions: what to say and when to say it. While recent prompting-based approaches using multimodal large language models (MLLMs) have shown strong performance in content generation, they largely ignore the timing aspect. We investigate whether in-context prompting alone can support real-time commentary generation that is both semantically relevant and well-timed. We propose two prompting-based decoding strategies: 1) a fixed-interval approach, and 2) a novel dynamic interval-based decoding approach that adjusts the next prediction timing based on the estimated duration of the previous utterance. Both methods enable pause-aware generation without any fine-tuning. Experiments on Japanese and English datasets of racing and fighting games show that the dynamic interval-based decoding can generate commentary more closely aligned with human utterance timing and content using prompting alone. We release a multilingual benchmark dataset, trained models, and implementations to support future research on real-time video commentary generation.
J-CHAT: Japanese Large-scale Spoken Dialogue Corpus for Spoken Dialogue Language Modeling
Wataru Nakata | Kentaro Seki | Hitomi Yanaka | Yuki Saito | Shinnosuke Takamichi | Hiroshi Saruwatari
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Wataru Nakata | Kentaro Seki | Hitomi Yanaka | Yuki Saito | Shinnosuke Takamichi | Hiroshi Saruwatari
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Spoken dialogue is essential for human-AI interactions, providing expressive capabilities beyond text. Developing effective spoken dialogue systems (SDSs) requires large-scale, high-quality, and diverse spoken dialogue corpora. However, existing datasets are often limited in size, spontaneity, or linguistic coherence. To address these limitations, we introduce J-CHAT, a 76,000-hour open-source Japanese spoken dialogue corpus. Constructed using an automated, language-independent methodology, J-CHAT ensures acoustic cleanliness, diversity, and natural spontaneity. The corpus is built from YouTube and podcast data, with extensive filtering and denoising to enhance quality. Experimental results with generative spoken dialogue language models trained on J-CHAT demonstrate its effectiveness for SDS development. By providing a robust foundation for training advanced dialogue models, we anticipate that J-CHAT will drive progress in human-AI dialogue research and applications.
2025
Static Word Embeddings for Sentence Semantic Representation
Takashi Wada | Yuki Hirakawa | Ryotaro Shimizu | Takahiro Kawashima | Yuki Saito
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Takashi Wada | Yuki Hirakawa | Ryotaro Shimizu | Takahiro Kawashima | Yuki Saito
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
We propose new static word embeddings optimised for sentence semantic representation. We first extract word embeddings from a pre-trained Sentence Transformer, and improve them with sentence-level principal component analysis, followed by either knowledge distillation or contrastive learning. During inference, we represent sentences by simply averaging word embeddings, which requires little computational cost. We evaluate models on both monolingual and cross-lingual tasks and show that our model substantially outperforms existing static models on sentence semantic tasks, and even surpasses a basic Sentence Transformer model (SimCSE) on a text embedding benchmark. Lastly, we perform a variety of analyses and show that our method successfully removes word embedding components that are not highly relevant to sentence semantics, and adjusts the vector norms based on the influence of words on sentence semantics.
VitaEval: Open-source Human Evaluation Tool for Video-to-Text and Video-to-Audio Systems
Goran Topic | Yuki Saito | Katsuhito Sudoh | Shinnosuke Takamichi | Hiroya Takamura | Graham Neubig | Tatsuya Ishigaki
Proceedings of the 18th International Natural Language Generation Conference: System Demonstrations
Goran Topic | Yuki Saito | Katsuhito Sudoh | Shinnosuke Takamichi | Hiroya Takamura | Graham Neubig | Tatsuya Ishigaki
Proceedings of the 18th International Natural Language Generation Conference: System Demonstrations
2020
SMASH Corpus: A Spontaneous Speech Corpus Recording Third-person Audio Commentaries on Gameplay
Yuki Saito | Shinnosuke Takamichi | Hiroshi Saruwatari
Proceedings of the Twelfth Language Resources and Evaluation Conference
Yuki Saito | Shinnosuke Takamichi | Hiroshi Saruwatari
Proceedings of the Twelfth Language Resources and Evaluation Conference
Developing a spontaneous speech corpus would be beneficial for spoken language processing and understanding. We present a speech corpus named the SMASH corpus, which includes spontaneous speech of two Japanese male commentators that made third-person audio commentaries during the gameplay of a fighting game. Each commentator ad-libbed while watching the gameplay with various topics covering not only explanations of each moment to convey the information on the fight but also comments to entertain listeners. We made transcriptions and topic tags as annotations on the recorded commentaries with our two-step method. We first made automatic and manual transcriptions of the commentaries and then manually annotated the topic tags. This paper describes how we constructed the SMASH corpus and reports some results of the annotations.
DNN-based Speech Synthesis Using Abundant Tags of Spontaneous Speech Corpus
Yuki Yamashita | Tomoki Koriyama | Yuki Saito | Shinnosuke Takamichi | Yusuke Ijima | Ryo Masumura | Hiroshi Saruwatari
Proceedings of the Twelfth Language Resources and Evaluation Conference
Yuki Yamashita | Tomoki Koriyama | Yuki Saito | Shinnosuke Takamichi | Yusuke Ijima | Ryo Masumura | Hiroshi Saruwatari
Proceedings of the Twelfth Language Resources and Evaluation Conference
In this paper, we investigate the effectiveness of using rich annotations in deep neural network (DNN)-based statistical speech synthesis. DNN-based frameworks typically use linguistic information as input features called context instead of directly using text. In such frameworks, we can synthesize not only reading-style speech but also speech with paralinguistic and nonlinguistic features by adding such information to the context. However, it is not clear what kind of information is crucial for reproducing paralinguistic and nonlinguistic features. Therefore, we investigate the effectiveness of rich tags in DNN-based speech synthesis according to the Corpus of Spontaneous Japanese (CSJ), which has a large amount of annotations on paralinguistic features such as prosody, disfluency, and morphological features. Experimental evaluation results shows that the reproducibility of paralinguistic features of synthetic speech was enhanced by adding such information as context.