Livia Qian

2026

Do Factual Recall Mechanisms Carry over from Text to Speech in Multimodal Language Models?
Luca Modica | Filip Landin | Mehrdad Farahani | Livia Qian | Gabriel Skantze | Richard Johansson
Proceedings of the 15th Joint Conference on Lexical and Computational Semantics (*SEM 2026)

In recent years, several Speech Language Models (SLMs) that represent speech and written text jointly have been presented. The question then emerges about how model-internal mechanisms are similar and different when operating in the two modalities. We focus on how these systems encode, store, and retrieve factual knowledge, which has previously been investigated for text-only models. To investigate mechanisms behind the storage and recall of factual association in SLMs, we leverage Causal Mediation Analysis, a technique previously applied to text-based models. Initial results using SpiritLM, a multimodal model integrating discrete speech tokens reveal discrepancies between text-to-text and speech-to-text results, suggesting that the emergent mechanisms for factual recall are only partially carried over from the text to the speech modality. These results advance our understanding of how internal mechanisms encode factual associations in SLMs while contributing insights for improving speech-enabled AI systems.

pdf bib abs

Aligning Backchannel and Dialogue Context Representations via Contrastive LLM Fine-Tuning
Livia Qian | Gabriel Skantze
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Backchannels (e.g., ‘yeah’, ‘mhm’, and ‘right’) are short, non-interruptive feedback signals whose lexical form and prosody jointly convey pragmatic meaning. While prior computational research has largely focused on predicting backchannel timing, the relationship between lexico-prosodic form and meaning remains underexplored. We propose a two-stage framework: first, fine-tuning large language models on dialogue transcripts to derive rich contextual representations; and second, learning a joint embedding space for dialogue contexts and backchannel realizations. We evaluate alignment with human perception via triadic similarity judgments (prosodic and cross-lexical) and a context–backchannel suitability task. Our results demonstrate that the learned projections substantially improve context-backchannel retrieval compared to previous methods. In addition, they reveal that backchannel form is highly sensitive to extended conversational context and that the learned embeddings align more closely with human judgments than raw WavLM features.

2023

pdf bib abs

The Future of Designing Spoken Dialogue Systems and Analyzing Written Conversations
Livia Qian
Proceedings of the 19th Annual Meeting of the Young Reseachers' Roundtable on Spoken Dialogue Systems

This is my position paper for YRRSDS 2023. In it, I write about the details of my research interests as well as past, current and future projects, talk about the status of spoken dialogue system research, include a short bio, and suggest topics for discussion.

pdf bib abs

Resolving References in Visually-Grounded Dialogue via Text Generation
Bram Willemsen | Livia Qian | Gabriel Skantze
Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Vision-language models (VLMs) have shown to be effective at image retrieval based on simple text queries, but text-image retrieval based on conversational input remains a challenge. Consequently, if we want to use VLMs for reference resolution in visually-grounded dialogue, the discourse processing capabilities of these models need to be augmented. To address this issue, we propose fine-tuning a causal large language model (LLM) to generate definite descriptions that summarize coreferential information found in the linguistic context of references. We then use a pretrained VLM to identify referents based on the generated descriptions, zero-shot. We evaluate our approach on a manually annotated dataset of visually-grounded dialogues and achieve results that, on average, exceed the performance of the baselines we compare against. Furthermore, we find that using referent descriptions based on larger context windows has the potential to yield higher returns.

Co-authors

Bram Willemsen 1

Venues

Fix author