2025
pdf
bib
abs
OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction
Haonan Zhang
|
Run Luo
|
Xiong Liu
|
Yuchuan Wu
|
Ting-En Lin
|
Pengpeng Zeng
|
Qiang Qu
|
Feiteng Fang
|
Min Yang
|
Lianli Gao
|
Jingkuan Song
|
Fei Huang
|
Yongbin Li
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Role-Playing Agents (RPAs), benefiting from large language models, is an emerging interactive AI system that simulates roles or characters with diverse personalities. However, existing methods primarily focus on mimicking dialogues among roles in textual form, neglecting the role’s voice traits (e.g., voice style and emotions) as playing a crucial effect in interaction, which tends to be more immersive experiences in realistic scenarios. Towards this goal, we propose OmniCharacter, a first seamless speech-language personality interaction model to achieve immersive RPAs with low latency. Specifically, OmniCharacter enables agents to consistently exhibit role-specific personality traits and vocal traits throughout the interaction, enabling a mixture of speech and language responses. To align the model with speech-language scenarios, we construct a dataset named OmniCharacter-10K, which involves more distinctive characters (20), richly contextualized multi-round dialogue (10K), and dynamic speech response (135K). Experimental results showcase that our method yields better responses in terms of both content and style compared to existing RPAs and mainstream speech-language models, with a response latency as low as 289ms.
pdf
bib
abs
From Observation to Understanding: Front-Door Adjustments with Uncertainty Calibration for Enhancing Egocentric Reasoning in LVLMs
Shenshen Li
|
Wenxin Meng
|
Lei Wang
|
Hao Yang
|
Chong Peng
|
Peng Yan
|
Fumin Shen
|
Jingkuan Song
|
Heng Tao Shen
|
Xing Xu
Findings of the Association for Computational Linguistics: ACL 2025
Recent progress in large vision-language models (LVLMs) has shown substantial potential across a broad spectrum of third-person tasks. However, adapting these LVLMs to egocentric scenarios remains challenging due to their third-person training bias. Existing methods that adapt LVLMs for first-person tasks often overlook critical agent-environment interactions, limiting their ability to perform egocentric reasoning. To address these challenges, we propose a novel zero-shot paradigm termed Front-Door Adjustments with Uncertainty Calibration (FRUIT) to enhance the egocentric reasoning abilities of LVLMs by simulating human causal reasoning. Specifically, the FRUIT operates in two stages: observation and understanding. Unlike conventional prompting techniques, we formalize egocentric reasoning using a structural causal model. Then, we ground interaction regions and expand them into hierarchical visual cues, augmented with corresponding captions, to form the initial observations. To reduce noise in these observations, we employ uncertainty calibration to filter out unreliable information. These refined observations as mediators are then incorporated into the prompt template, guiding the model to understand semantics from a first-person perspective. Extensive experiments conducted on the EgoThink benchmark demonstrate that our FRUIT method consistently enhances the performance of existing LVLMs on six distinct tasks. Our code is available at https://github.com/Mrshenshen/FRUIT.
pdf
bib
abs
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct
Run Luo
|
Haonan Zhang
|
Longze Chen
|
Ting-En Lin
|
Xiong Liu
|
Yuchuan Wu
|
Min Yang
|
Yongbin Li
|
Minzheng Wang
|
Pengpeng Zeng
|
Lianli Gao
|
Heng Tao Shen
|
Yunshui Li
|
Hamid Alinejad-Rokny
|
Xiaobo Xia
|
Jingkuan Song
|
Fei Huang
Findings of the Association for Computational Linguistics: ACL 2025
The development of Multimodal Large Language Models (MLLMs) has seen significant progress, driven by increasing demands across various fields (e.g., multimodal agents, embodied intelligence). While model-driven approaches aim to enhance MLLM capabilities through diverse architectures, their performance gains have become increasingly marginal. In contrast, data-driven methods, which scale up image-text instruction datasets, have proven more effective but face challenges related to limited data diversity and complexity. The absence of high-quality instruction data remains a major bottleneck in MLLM development. To address this issue, we propose , a novel multimodal instruction data evolution framework. This framework iteratively enhances data quality through a refined combination of fine-grained perception, cognitive reasoning, and interaction evolution, generating a more complex and diverse image-text instruction dataset that significantly improves MLLM capabilities. Starting with an initial dataset, SEED-163K, we employ to systematically expand instruction diversity, extend visual reasoning steps to improve cognitive abilities, and extract fine-grained visual details to enhance understanding and robustness. To rigorously evaluate our approach, we conduct extensive qualitative analysis and quantitative experiments across 13 vision-language tasks. Compared to baseline models trained on the original seed dataset, our method achieves an average accuracy improvement of 3.1 percentage points. Moreover, our approach attains state-of-the-art (SOTA) performance in nine tasks while using significantly less data than existing state-of-the-art models.