TVWorld: Foundations for Remote-Control TV Agents

Zhantao Ma, Quanfeng Lu, Shuai Zhong, Dahai Yu, Ping Luo, Michael Ng


Abstract
Recent large vision–language models (LVLMs) have demonstrated strong potential for device control. However, existing research has primarily focused on point-and-click (PnC) interaction, while remote-control (RC) interaction commonly encountered in everyday TV usage remains largely underexplored. To fill this gap, we introduce TVWorld, an offline graph-based abstraction of real-world TV navigation that enables reproducible and deployment-free evaluation. On this basis, we derive two complementary benchmarks that comprehensively assess TV-use capabilities: TVWorld-N for topology-aware navigation and TVWorld-G for focus-aware grounding. These benchmarks expose a key limitation of existing agents: insufficient topology awareness for focus-based, long-horizon TV navigation. Motivated by this finding, we propose a Topology-Aware Training framework that injects topology awareness into LVLMs. Using this framework, we develop TVTheseus, a foundation model specialized for TV navigation. TVTheseus achieves a success rate of 68.3 on TVWorld-N, surpassing strong closed-source baselines such as Gemini 3 Flash and establishing state-of-the-art (SOTA) performance. Additional analyses further provide valuable insights into the development of effective TV-use agents.
Anthology ID:
2026.findings-acl.1792
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
35959–35984
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1792/
DOI:
Bibkey:
Cite (ACL):
Zhantao Ma, Quanfeng Lu, Shuai Zhong, Dahai Yu, Ping Luo, and Michael Ng. 2026. TVWorld: Foundations for Remote-Control TV Agents. In Findings of the Association for Computational Linguistics: ACL 2026, pages 35959–35984, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
TVWorld: Foundations for Remote-Control TV Agents (Ma et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1792.pdf
Checklist:
 2026.findings-acl.1792.checklist.pdf