This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
TakatoYamazaki
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
In human-human conversation, interpersonal consideration for the interlocutor is essential, and similar expectations are increasingly placed on dialogue systems. This study examines the behavior of dialogue systems in a specific interpersonal scenario where a user vents frustrations and seeks emotional support from a long-time friend represented by a dialogue system. We conducted a human evaluation and qualitative analysis of 15 dialogue systems under this setting. These systems implemented diverse strategies, such as structuring dialogue into distinct phases, modeling interpersonal relationships, and incorporating cognitive behavioral therapy techniques. Our analysis reveals that these approaches contributed to improved perceived empathy, coherence, and appropriateness, highlighting the importance of design choices in socially sensitive dialogue.
Challenges in multimodal task-oriented dialogue between humans and systems, particularly those involving audio and visual interactions, have not been sufficiently explored or shared, forcing researchers to define improvement directions individually without a clearly shared roadmap. To address these challenges, we organized a competition for multimodal task-oriented dialogue systems and constructed a large competition-based dataset of 1,865 minutes of Japanese task-oriented dialogues. This dataset includes audio and visual interactions between diverse systems and human participants. After analyzing system behaviors identified as problematic by the human participants in questionnaire surveys and notable methods employed by the participating teams, we identified key challenges in multimodal task-oriented dialogue systems and discussed potential directions for overcoming these challenges.
When individuals engage in spoken discourse, various phenomena can be observed that differ from those that are apparent in text-based conversation. While written communication commonly uses a question mark to denote a query, in spoken discourse, queries are frequently indicated by a rising intonation at the end of a sentence. However, numerous speech recognition engines do not append a question mark to recognized queries, presenting a challenge when creating a spoken dialogue system. Specifically, the absence of a question mark at the end of a sentence can impede the generation of appropriate responses to queries in spoken dialogue systems. Hence, we investigate the impact of question marks on dialogue systems, with the results showing that they have a significant impact. Moreover, we analyze specific examples in an effort to determine which types of utterances have the impact on dialogue systems.
With the ambition to create avatars capable of human-level casual conversation, we developed an open-domain avatar chatbot, situated in a virtual reality environment, that employs a large language model (LLM). Introducing the LLM posed several challenges for multimodal integration, such as developing techniques to align diverse outputs and avatar control, as well as addressing the issue of slow generation speed. To address these challenges, we integrated various external modules into our system. Our system is based on the award-winning model from the Dialogue System Live Competition 5. Through this work, we hope to stimulate discussions within the research community about the potential and challenges of multimodal dialogue systems enhanced with LLMs.
Recent models achieve promising results in visually grounded dialogues. However, existing datasets often contain undesirable biases and lack sophisticated linguistic analyses, which make it difficult to understand how well current models recognize their precise linguistic structures. To address this problem, we make two design choices: first, we focus on OneCommon Corpus (CITATION), a simple yet challenging common grounding dataset which contains minimal bias by design. Second, we analyze their linguistic structures based on spatial expressions and provide comprehensive and reliable annotation for 600 dialogues. We show that our annotation captures important linguistic structures including predicate-argument structure, modification and ellipsis. In our experiments, we assess the model’s understanding of these structures through reference resolution. We demonstrate that our annotation can reveal both the strengths and weaknesses of baseline models in essential levels of detail. Overall, we propose a novel framework and resource for investigating fine-grained language understanding in visually grounded dialogues.