ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding

Tian Xueyun; Wei Li; Bingbing Xu; Heng Dong; Yuanzhuo Wang; Huawei Shen (沈华伟)

ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding

Tian Xueyun, Wei Li, Bingbing Xu, Heng Dong, Yuanzhuo Wang, Huawei Shen

Abstract

Recent Omni-multimodal Large Language Models show promise in unified audio, vision, and text modeling. However, streaming audio-video understanding remains challenging, as existing approaches suffer from disjointed capabilities: they typically exhibit incomplete modality support or lack autonomous proactive monitoring. To address this, we present ROMA, **a real-time omni-multimodal assistant for unified reactive and proactive interaction**. ROMA processes continuous inputs as synchronized multimodal units, aligning dense audio with discrete video frames to handle granularity mismatches. For online decision-making, we introduce a lightweight *speak head* that decouples response initiation from generation to ensure precise triggering without task conflict. We train ROMA with a curated streaming dataset and a two-stage curriculum that progressively optimizes for streaming format adaptation and proactive responsiveness. To standardize the fragmented evaluation landscape, we reorganize diverse benchmarks into a unified suite covering both proactive (alert, narration) and reactive (QA) settings. Extensive experiments across 12 benchmarks demonstrate ROMA achieves state-of-the-art performance on proactive tasks while competitive in reactive settings, validating its robustness in unified real-time omni-multimodal understanding. Code and benchmark are available [here](https://eureka-maggie.github.io/ROMA_show/).

Anthology ID:: 2026.findings-acl.1153
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 23018–23039
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.findings-acl.1153/
DOI:
Bibkey:
Cite (ACL):: Tian Xueyun, Wei Li, Bingbing Xu, Heng Dong, Yuanzhuo Wang, and Huawei Shen. 2026. ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding. In Findings of the Association for Computational Linguistics: ACL 2026, pages 23018–23039, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: ROMA: Real-time Omni-Multimodal Assistant with Interactive Streaming Understanding (Xueyun et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.findings-acl.1153.pdf
Checklist:: 2026.findings-acl.1153.checklist.pdf

PDF Cite Search Checklist Fix data