Shunsuke Kando


2026

Children are known to generalize syntactic knowledge at ages when their linguistic input is predominantly raw speech rather than text. This raises the question of whether syntactic generalization can emerge directly from acoustic input. We address this question using Autoregressive Predictive Coding (APC), a simple prediction-based self-supervised speech model. To approximate the input available to human learners while enabling controlled comparison, we train models on both child-directed speech and audiobook speech. We evaluate the models on a minimal-pair benchmark targeting elementary syntactic phenomena, designed to be acquisition-friendly. Our results show that APC partially generalizes word-order regularities when trained to predict near-future frames. However, the model fails to generalize agreement phenomena, suggesting that predictive learning from acoustic signals alone is insufficient. Furthermore, we observe distinct learning dynamics across word-order phenomena, suggesting that some improvements may be driven by shallow statistical regularities rather than genuine syntactic generalization.

2024

This paper aims to forecast the implicit emotion elicited in the dialogue partner by a textual input utterance. Forecasting the interlocutor’s emotion is beneficial for natural language generation in dialogue systems to avoid generating utterances that make the users uncomfortable. Previous studies forecast the emotion conveyed in the interlocutor’s response, assuming it will explicitly reflect their elicited emotion. However, true emotions are not always expressed verbally. We propose a new task to directly forecast the implicit emotion elicited by an input utterance, which does not rely on this assumption. We compare this task with related ones to investigate the impact of dialogue history and one’s own utterance on predicting explicit and implicit emotions. Our result highlights the importance of dialogue history for predicting implicit emotions. It also reveals that, unlike explicit emotions, implicit emotions show limited improvement in predictive performance with one’s own utterance, and that they are more difficult to predict than explicit emotions. We find that even a large language model (LLM) struggles to forecast implicit emotions accurately.

2022

Incorporating stronger syntactic biases into neural language models (LMs) is a long-standing goal, but research in this area often focuses on modeling English text, where constituent treebanks are readily available. Extending constituent tree-based LMs to the multilingual setting, where dependency treebanks are more common, is possible via dependency-to-constituency conversion methods. However, this raises the question of which tree formats are best for learning the model, and for which languages. We investigate this question by training recurrent neural network grammars (RNNGs) using various conversion methods, and evaluating them empirically in a multilingual setting. We examine the effect on LM performance across nine conversion methods and five languages through seven types of syntactic tests. On average, the performance of our best model represents a 19 % increase in accuracy over the worst choice across all languages. Our best model shows the advantage over sequential/overparameterized LMs, suggesting the positive effect of syntax injection in a multilingual setting. Our experiments highlight the importance of choosing the right tree formalism, and provide insights into making an informed decision.