Botian Shi
2026
The Agent’s First Day: Benchmarking Learning, Exploration, and Scheduling in the Workplace Scenarios
Daocheng Fu | Jianbiao Mei | Rong Wu | Xuemeng Yang | Jia Xu | Ding Wang | Pinlong Cai | Yong Liu | Licheng Wen | Botian Shi
Findings of the Association for Computational Linguistics: ACL 2026
Daocheng Fu | Jianbiao Mei | Rong Wu | Xuemeng Yang | Jia Xu | Ding Wang | Pinlong Cai | Yong Liu | Licheng Wen | Botian Shi
Findings of the Association for Computational Linguistics: ACL 2026
The rapid evolution of Multi-modal Large Language Models (MLLMs) has advanced workflow automation; however, existing research mainly targets performance upper bounds in static environments, overlooking robustness for stochastic real-world deployment. We identify three key challenges: dynamic task scheduling, active exploration under uncertainty, and continuous learning from experience. To bridge this gap, we introduce TraineeBench, a dynamic evaluation environment that simulates a "trainee" agent continuously exploring a novel setting. Unlike traditional benchmarks, TraineeBench evaluates agents along three dimensions: (1) context-aware scheduling for streaming tasks with varying priorities; (2) prudent information acquisition to reduce hallucination via active exploration; and (3) continuous evolution by distilling generalized strategies from rule-based, dynamically generated tasks. Experiments show that cutting-edge agents have significant deficiencies in dynamic environments, especially in active exploration and continual learning. Our work establishes a framework for assessing agent reliability, shifting evaluation from static tests to realistic, production-oriented scenarios.
Towards Self-Evolving Agents: Enabling Autonomy through Interactive Experience Refinement
Cheng Yang | Xuemeng Yang | Licheng Wen | Daocheng Fu | Jianbiao Mei | Rong Wu | Pinlong Cai | Yufan Shen | Nianchen Deng | Jia Xu | Botian Shi | Yu Qiao | Haifeng Li
Findings of the Association for Computational Linguistics: ACL 2026
Cheng Yang | Xuemeng Yang | Licheng Wen | Daocheng Fu | Jianbiao Mei | Rong Wu | Pinlong Cai | Yufan Shen | Nianchen Deng | Jia Xu | Botian Shi | Yu Qiao | Haifeng Li
Findings of the Association for Computational Linguistics: ACL 2026
Large Language Models often struggle with complex, multi-step operational tasks because they remain static during inference and cannot learn from past experience. To address this, we propose MUSE, a framework that enables iterative self-improvement through a hierarchical Memory Module. MUSE organizes cross-domain insights to facilitate the orchestration of long-horizon workflows. The core of our approach is an autonomous post-execution critique mechanism: after completing each sub-task, the system analyzes its operational logs and distills raw execution data into structured, reusable knowledge. This allows the agent to evolve dynamically rather than relying on fixed parameters. Evaluated on the rigorous TAC productivity benchmark, MUSE achieves new state-of-the-art results, significantly outperforming previous methods using only the streamlined Gemini-2.5 Flash model. Our analysis demonstrates that MUSE’s performance scales with the accumulation of insights and exhibits strong cross-task transferability, marking a key step toward autonomous systems capable of lifelong learning in professional environments. Demo videos can be found in our supplementary materials.
2025
Dolphin: Moving Towards Closed-loop Auto-research through Thinking, Practice, and Feedback
Jiakang Yuan | Xiangchao Yan | Bo Zhang | Tao Chen | Botian Shi | Wanli Ouyang | Yu Qiao | Lei Bai | Bowen Zhou
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jiakang Yuan | Xiangchao Yan | Bo Zhang | Tao Chen | Botian Shi | Wanli Ouyang | Yu Qiao | Lei Bai | Bowen Zhou
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The scientific research paradigm is undergoing a profound transformation owing to the development of Artificial Intelligence (AI). Recent works demonstrate that various AI-assisted research methods can largely improve research efficiency by improving data analysis, accelerating computation, and fostering novel idea generation. To further move towards the ultimate goal (i.e., automatic scientific research), in this paper, we introduce Dolphin, a closed-loop LLM-driven framework to enhance the automation level of scientific research. Dolphin first generates novel ideas based on feedback from previous experiments and relevant papers ranked by the topic and task attributes. Then, the generated ideas can be implemented using a code template refined and debugged with the designed exception-traceback-guided local code structure. Finally, Dolphin automatically analyzes the results of each idea and feeds the results back to the next round of idea generation. Experiments are conducted on the benchmark datasets of different topics and a subset of MLE-bench. Results show that Dolphin can continuously improve the performance of the input topic in a loop. We highlight that Dolphin can automatically propose methods that are comparable to the state-of-the-art in some tasks such as 3D point classification.
2021
Hashing based Efficient Inference for Image-Text Matching
Rong-Cheng Tu | Lei Ji | Huaishao Luo | Botian Shi | Heyan Huang | Nan Duan | Xian-Ling Mao
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
Rong-Cheng Tu | Lei Ji | Huaishao Luo | Botian Shi | Heyan Huang | Nan Duan | Xian-Ling Mao
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
2020
A Benchmark for Structured Procedural Knowledge Extraction from Cooking Videos
Frank F. Xu | Lei Ji | Botian Shi | Junyi Du | Graham Neubig | Yonatan Bisk | Nan Duan
Proceedings of the First International Workshop on Natural Language Processing Beyond Text
Frank F. Xu | Lei Ji | Botian Shi | Junyi Du | Graham Neubig | Yonatan Bisk | Nan Duan
Proceedings of the First International Workshop on Natural Language Processing Beyond Text
Watching instructional videos are often used to learn about procedures. Video captioning is one way of automatically collecting such knowledge. However, it provides only an indirect, overall evaluation of multimodal models with no finer-grained quantitative measure of what they have learned. We propose instead, a benchmark of structured procedural knowledge extracted from cooking videos. This work is complementary to existing tasks, but requires models to produce interpretable structured knowledge in the form of verb-argument tuples. Our manually annotated open-vocabulary resource includes 356 instructional cooking videos and 15,523 video clip/sentence-level annotations. Our analysis shows that the proposed task is challenging and standard modeling approaches like unsupervised segmentation, semantic role labeling, and visual action detection perform poorly when forced to predict every action of a procedure in a structured form.
2019
Dense Procedure Captioning in Narrated Instructional Videos
Botian Shi | Lei Ji | Yaobo Liang | Nan Duan | Peng Chen | Zhendong Niu | Ming Zhou
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Botian Shi | Lei Ji | Yaobo Liang | Nan Duan | Peng Chen | Zhendong Niu | Ming Zhou
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Understanding narrated instructional videos is important for both research and real-world web applications. Motivated by video dense captioning, we propose a model to generate procedure captions from narrated instructional videos which are a sequence of step-wise clips with description. Previous works on video dense captioning learn video segments and generate captions without considering transcripts. We argue that transcripts in narrated instructional videos can enhance video representation by providing fine-grained complimentary and semantic textual information. In this paper, we introduce a framework to (1) extract procedures by a cross-modality module, which fuses video content with the entire transcript; and (2) generate captions by encoding video frames as well as a snippet of transcripts within each extracted procedure. Experiments show that our model can achieve state-of-the-art performance in procedure extraction and captioning, and the ablation studies demonstrate that both the video frames and the transcripts are important for the task.
Search
Fix author
Co-authors
- Nan Duan 3
- Lei Ji 3
- Pinlong Cai 2
- Daocheng Fu 2
- Jianbiao Mei 2
- Yu Qiao 2
- Licheng Wen 2
- Rong Wu 2
- Jia Xu 2
- Xuemeng Yang 2
- Lei Bai 1
- Yonatan Bisk 1
- Tao Chen 1
- Peng Chen 1
- Nianchen Deng 1
- Junyi Du 1
- He-Yan Huang (黄河燕) 1
- Haifeng Li 1
- Yaobo Liang 1
- Yong Liu 1
- Huaishao Luo 1
- Xian-Ling Mao 1
- Graham Neubig 1
- Zhendong Niu 1
- Wanli Ouyang 1
- Yufan Shen 1
- Rong-Cheng Tu 1
- Ding Wang 1
- Frank F. Xu 1
- Xiangchao Yan 1
- Cheng Yang 1
- Jiakang Yuan 1
- Bo Zhang 1
- Bowen Zhou 1
- Ming Zhou 1