A2O: LLM-based Agentic Learning of Action-to-Object Features for Video Action Recognition

Taiki Sekii, Fumiaki Sato


Abstract
Recent action recognition based on vision–language pretraining and self-supervised video foundation models tends to induce spurious correlations and shortcut learning by relying on action-irrelevant cues, such as backgrounds and object co-occurrences. By contrast, object-detection-based approaches can suppress spurious correlations; however, the loss of input information can limit accuracy. To mitigate this trade-off, we combine these two approaches to learn complementary features that compensate for each other’s shortcomings. Specifically, we leverage the commonsense knowledge of large language models (LLMs) regarding human actions and realize a framework in which an LLM agent integrates the two approaches within an agentic learning paradigm to design motion features tailored to the target actions. The LLM agent uses an open vocabulary object detector to instruct the video foundation model with the target and nontarget objects in the video to make the model attend to objects in a video required for recognizing the target actions. The composition of the detected objects is optimized for the target actions through in-context reinforcement learning (ICRL) using the commonsense knowledge of the LLM. Experiments on multiple public action recognition datasets and an ablation study confirm the robustness of features learned using the proposed method and the effectiveness of ICRL.
Anthology ID:
2026.findings-acl.1805
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
36212–36228
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1805/
DOI:
Bibkey:
Cite (ACL):
Taiki Sekii and Fumiaki Sato. 2026. A2O: LLM-based Agentic Learning of Action-to-Object Features for Video Action Recognition. In Findings of the Association for Computational Linguistics: ACL 2026, pages 36212–36228, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
A2O: LLM-based Agentic Learning of Action-to-Object Features for Video Action Recognition (Sekii & Sato, Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1805.pdf
Checklist:
 2026.findings-acl.1805.checklist.pdf