Taiki Sekii

2026

A2O: LLM-based Agentic Learning of Action-to-Object Features for Video Action Recognition
Taiki Sekii | Fumiaki Sato
Findings of the Association for Computational Linguistics: ACL 2026

Recent action recognition based on vision–language pretraining and self-supervised video foundation models tends to induce spurious correlations and shortcut learning by relying on action-irrelevant cues, such as backgrounds and object co-occurrences. By contrast, object-detection-based approaches can suppress spurious correlations; however, the loss of input information can limit accuracy. To mitigate this trade-off, we combine these two approaches to learn complementary features that compensate for each other’s shortcomings. Specifically, we leverage the commonsense knowledge of large language models (LLMs) regarding human actions and realize a framework in which an LLM agent integrates the two approaches within an agentic learning paradigm to design motion features tailored to the target actions. The LLM agent uses an open vocabulary object detector to instruct the video foundation model with the target and nontarget objects in the video to make the model attend to objects in a video required for recognizing the target actions. The composition of the detected objects is optimized for the target actions through in-context reinforcement learning (ICRL) using the commonsense knowledge of the LLM. Experiments on multiple public action recognition datasets and an ablation study confirm the robustness of features learned using the proposed method and the effectiveness of ICRL.

2025

pdf bib abs

Flashback: Memory Mechanism for Enhancing Memory Efficiency and Speed in Deep Sequential Models
Taiki Sekii
Proceedings of the 31st International Conference on Computational Linguistics

In this study, we tackle three main challenges of deep sequential processing models in previous research: (1) memory degradation, (2) inaccurate gradient backpropagation, and (3) compatibility with next-token prediction. Specifically, to address (1-2), we define a Flashback property in which memory is preserved perfectly as an identity mapping of its stored value in a memory region until it is overwritten by a hidden state at a different time step. We propose a Flashback mechanism that satisfies this property in a fully differentiable, end-to-end manner. Further, to tackle (3), we propose architectures that incorporate the Flashback mechanism into Transformers and Mamba, enabling next-token prediction for language modeling tasks. In experiments, we trained on The Pile dataset, which includes diverse texts, to evaluate tradeoffs between commonsense reasoning accuracy, processing speed, and memory usage after introducing the Flashback mechanism into existing methods. The evaluations confirmed the effectiveness of the Flashback mechanism.

2024

pdf bib abs

Text2Traj2Text: Learning-by-Synthesis Framework for Contextual Captioning of Human Movement Trajectories
Hikaru Asano | Ryo Yonetani | Taiki Sekii | Hiroki Ouchi
Proceedings of the 17th International Natural Language Generation Conference

This paper presents Text2Traj2Text, a novel learning-by-synthesis framework for captioning possible contexts behind shopper’s trajectory data in retail stores. Our work will impact various retail applications that need better customer understanding, such as targeted advertising and inventory management. The key idea is leveraging large language models to synthesize a diverse and realistic collection of contextual captions as well as the corresponding movement trajectories on a store map. Despite learned from fully synthesized data, the captioning model can generalize well to trajectories/captions created by real human subjects. Our systematic evaluation confirmed the effectiveness of the proposed framework over competitive approaches in terms of ROUGE and BERT Score metrics.

Co-authors

Venues

Fix author