Inyoung Kim


2026

LLMs and speech assistants are increasingly used for task-oriented interactions, yet their evaluation often relies on controlled scenarios that fail to capture the variability and complexity of real user requests. Drink ordering, for example, involves diverse named entities, drink types, sizes, customizations, and brand-specific terminology, as well as spontaneous speech phenomena such as hesitations and self-corrections. To address this gap, we introduce StarDrinks, a test set in English and Korean containing speech utterances features, transcriptions, and annotated slots. Our dataset supports speech-to-slots SLU, transcription-to-slots NLU, and speech-to-transcription ASR evaluation, providing a realistic benchmark for model robustness and generalization in a linguistically rich, real-world task.

2021

While End-2-End Text-to-Speech (TTS) has made significant progresses over the past few years, these systems still lack intuitive user controls over prosody. For instance, generating speech with fine-grained prosody control (prosodic prominence, contextually appropriate emotions) is still an open challenge. In this paper, we investigate whether we can control prosody directly from the input text, in order to code information related to contrastive focus which emphasizes a specific word that is contrary to the presuppositions of the interlocutor. We build and share a specific dataset for this purpose and show that it allows to train a TTS system were this fine-grained prosodic feature can be correctly conveyed using control tokens. Our evaluation compares synthetic and natural utterances and shows that prosodic patterns of contrastive focus (variations of Fo, Intensity and Duration) can be learnt accurately. Such a milestone is important to allow, for example, smart speakers to be programmatically controlled in terms of output prosody.

2020

Cette étude propose de caractériser le non relâchement des plosives finales /p, t, k/ de deux langues d’Asie, tonale (vietnamien) et non tonale (coréen), du point de vue aérodynamique et glottographique. Le comportement glottique (ouverture et fermeture de la glotte, position verticale du larynx) a été examiné en synchronisation avec les valeurs de débits d’air (oral et nasal) pendant les phases de la réalisation consonantique. Les résultats mettent en évidence (1) l’absence de relâchement nasal après l’occlusion de la plosive finale pouvant entraîner une baisse de la pression intraorale, (2) que le larynx s’abaisse systématiquement durant la tenue de la consonne. Cette stratégie de réalisation va dans le sens de notre hypothèse selon laquelle les plosives non relâchées sont produites avec un mécanisme permettant de diminuer la pression intraorale de manière à minimiser le coût articulatoire de la tenue de la closion avec, pour conséquence acoustique, l’absence de burst.

2018