Yuto Abe
2026
Effects of Dialogue Corpora Properties on Fine-Tuning a Moshi-Based Spoken Dialogue Model
Yuto Abe | Mao Saeki | Atsumoto Ohashi | Shinnosuke Takamichi | Shiyna Fujie | Tetsunori Kobayashi | Tetsuji Ogawa | Ryuichiro Higashinaka
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
Yuto Abe | Mao Saeki | Atsumoto Ohashi | Shinnosuke Takamichi | Shiyna Fujie | Tetsunori Kobayashi | Tetsuji Ogawa | Ryuichiro Higashinaka
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
This study investigates how interactional characteristics of spoken dialogue corpora influence the learning process and resulting behavior of speech language models for full-duplex dialogue systems. While previous research has mainly focused on improving acoustic and linguistic quality, an effective dialogue system must also capture and reproduce task-dependent interactional dynamics such as conversational tempo and turn-taking patterns. To analyze these properties, we evaluated multiple dialogue corpora using NISQA for speech quality, LLM-as-a-Judge for linguistic and semantic appropriateness, and four timing-based indicators: inter-pausal units, pause, gap, and overlap. A curriculum learning strategy was applied to fine-tune a Moshi-based full-duplex dialogue model by incrementally combining corpora with different interactional characteristics. Experimental results on a dialogue continuation task showed that corpus-specific interactional patterns effectively shape model behavior. Chat-style corpora facilitated natural rhythms with moderate overlaps and gaps, whereas consultation-style corpora promoted more stable and deliberate timing. Fine-tuning with high-quality audio improved speech quality, while using task-mismatched data degraded linguistic coherence.