Richard A. Brutti

2026

More than "Oh": Grounding Observable Events with Grunts in Multimodal Dialogue
Richard A. Brutti | James Pustejovsky
Proceedings of the Fifteenth Language Resources and Evaluation Conference

Conversational grunts (minimal vocalizations like oh, mm-hm, and uh-huh) ground information and coordinate understanding in human dialogue, yet computational systems typically treat them as noise rather than meaningful communicative acts. We present a systematic annotation and analysis of 497 grunts across 3 hours of multimodal collaborative tasks, introducing an annotation scheme that captures grunts, their antecedents, and dialogue act functions. Our analysis reveals that grunts respond to speech and observable events at nearly equal rates, demonstrating that non-verbal events function as conversational contributions requiring acknowledgment. Tokens exhibit functional specialization: mm-hm predominantly acknowledges speech, while oh preferentially acknowledges events. Prosodic analysis shows speakers systematically modulate duration and pitch based on antecedent type, with event responses typically longer and having greater range. These findings have implications for dialogue state tracking, multimodal grounding, and turn-taking in conversational AI systems.

2024

pdf bib abs

Within Dialogue Modeling research in AI and NLP, considerable attention has been spent on “dialogue state tracking” (DST), which is the ability to update the representations of the speaker’s needs at each turn in the dialogue by taking into account the past dialogue moves and history. Less studied but just as important to dialogue modeling, however, is “common ground tracking” (CGT), which identifies the shared belief space held by all of the participants in a task-oriented dialogue: the task-relevant propositions all participants accept as true. In this paper we present a method for automatically identifying the current set of shared beliefs and ”questions under discussion” (QUDs) of a group with a shared goal. We annotate a dataset of multimodal interactions in a shared physical space with speech transcriptions, prosodic features, gestures, actions, and facets of collaboration, and operationalize these features for use in a deep neural model to predict moves toward construction of common ground. Model outputs cascade into a set of formal closure rules derived from situated evidence and belief axioms and update operations. We empirically assess the contribution of each feature type toward successful construction of common ground relative to ground truth, establishing a benchmark in this novel, challenging task.

Co-authors

Nikhil Krishnaswamy 1

Venues

LREC2
COLING1

Fix author