Shuofei Qiao


2025

pdf bib
OS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use
Xueyu Hu | Tao Xiong | Biao Yi | Zishu Wei | Ruixuan Xiao | Yurun Chen | Jiasheng Ye | Meiling Tao | Xiangxin Zhou | Ziyu Zhao | Yuhuai Li | Shengze Xu | Shenzhi Wang | Xinchen Xu | Shuofei Qiao | Zhaokai Wang | Kun Kuang | Tieyong Zeng | Liang Wang | Jiwei Li | Yuchen Eleanor Jiang | Wangchunshu Zhou | Guoyin Wang | Keting Yin | Zhou Zhao | Hongxia Yang | Fan Wu | Shengyu Zhang | Fei Wu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The dream to create AI assistants as capable and versatile as the fictional J.A.R.V.I.S from Iron Man has long captivated imaginations. With the evolution of multi-modal large language models ((M)LLMs), this dream is closer to reality, as (M)LLM-based Agents using computers, mobile phones and web browsers by operating within the environments and interfaces (e.g., Graphical User Interface (GUI) and Command Line Interface (CLI)) provided by operating systems (OS) to automate tasks have significantly advanced. This paper presents a comprehensive survey on these advanced agents, designated as OS Agents. We begin by elucidating the fundamentals of OS Agents, exploring their key components and capabilities. We then examine methodologies for constructing OS Agents, focusing on domain-specific foundation models and agent frameworks. A detailed review of evaluation metrics and benchmarks highlights how OS Agents are assessed across diverse platforms and tasks. Finally, we discuss current challenges and identify promising directions for future research. An open-source GitHub repository is maintained as a dynamic resource to foster further innovation in this field.

pdf bib
Agentic Knowledgeable Self-awareness
Shuofei Qiao | Zhisong Qiu | Baochang Ren | Xiaobin Wang | Xiangyuan Ru | Ningyu Zhang | Xiang Chen | Yong Jiang | Pengjun Xie | Fei Huang | Huajun Chen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large Language Models (LLMs) have achieved considerable performance across various agentic planning tasks. However, traditional approaches adopt a “flood irrigation” methodology that indiscriminately injects gold trajectories, external feedback, and domain knowledge into agent models. This practice overlooks the fundamental human cognitive principle of self-awareness - the ability to dynamically assess situational demands and strategically employ resources during decision-making. We propose Agentic Knowledgeable Self-awareness to address this gap, a novel paradigm enabling LLM-based agents to autonomously regulate knowledge utilization. Specifically, we propose KnowSelf, a data-centric approach that applies agents with knowledgeable self-awareness like humans. Concretely, we devise a heuristic situation judgement criterion to mark special tokens on the agent’s self-explored trajectories for collecting training data. Through a two-stage training process, the agent model can switch between different situations by generating specific special tokens, achieving optimal planning effects with minimal costs. Our experiments demonstrate that can outperform various strong baselines on different tasks and models with minimal use of external knowledge.

pdf bib
SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement
Runnan Fang | Xiaobin Wang | Yuan Liang | Shuofei Qiao | Jialong Wu | Zekun Xi | Ningyu Zhang | Yong Jiang | Pengjun Xie | Fei Huang | Huajun Chen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

In the interaction between agents and their environments, agents expand their capabilities by planning and executing actions. However, LLM-based agents face substantial challenges when deployed in novel environments or required to navigate unconventional action spaces. To empower agents to autonomously explore environments, optimize workflows, and enhance their understanding of actions, we propose SynWorld, a framework that allows agents to synthesize possible scenarios with multi-step action invocation within the action space and perform Monte Carlo Tree Search (MCTS) exploration to effectively refine their action knowledge in the current environment. Our experiments demonstrate that SynWorld is an effective and general approach to learning action knowledge in new environments.

pdf bib
KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents
Yuqi Zhu | Shuofei Qiao | Yixin Ou | Shumin Deng | Shiwei Lyu | Yue Shen | Lei Liang | Jinjie Gu | Huajun Chen | Ningyu Zhang
Findings of the Association for Computational Linguistics: NAACL 2025

Large Language Models (LLMs) have demonstrated great potential in complex reasoning tasks, yet they fall short when tackling more sophisticated challenges, especially when interacting with environments through generating executable actions. This inadequacy primarily stems from the lack of built-in action knowledge in language agents, which fails to effectively guide the planning trajectories during task solving and results in planning hallucination. To address this issue, we introduce KnowAgent, a novel approach designed to enhance the planning capabilities of LLMs by incorporating explicit action knowledge. Specifically, KnowAgent employs an action knowledge base and a knowledgeable self-learning strategy to constrain the action path during planning, enabling more reasonable trajectory synthesis, and thereby enhancing the planning performance of language agents. Experimental results on HotpotQA and ALFWorld based on various backbone models demonstrate that KnowAgent can achieve comparable or superior performance to existing baselines. Further analysis indicates the effectiveness of KnowAgent in terms of planning hallucinations mitigation.

pdf bib
Graph-guided Cross-composition Feature Disentanglement for Compositional Zero-shot Learning
Yuxia Geng | Runkai Zhu | Jiaoyan Chen | Jintai Chen | Xiang Chen | Zhuo Chen | Shuofei Qiao | Yuxiang Wang | Xiaoliang Xu | Sheng-Jun Huang
Findings of the Association for Computational Linguistics: ACL 2025

Disentanglement of visual features of primitives (i.e., attributes and objects) has shown exceptional results in Compositional Zero-shot Learning (CZSL). However, due to the feature divergence of an attribute (resp. object) when combined with different objects (resp. attributes), it is challenging to learn disentangled primitive features that are general across different compositions. To this end, we propose the solution of cross-composition feature disentanglement, which takes multiple primitive-sharing compositions as inputs and constrains the disentangled primitive features to be general across these compositions. More specifically, we leverage a compositional graph to define the overall primitive-sharing relationships between compositions, and build a task-specific architecture upon the recently successful large pre-trained vision-language model (VLM) CLIP, with dual cross-composition disentangling adapters (called L-Adapter and V-Adapter) inserted into CLIP’s frozen text and image encoders, respectively. Evaluation on three popular CZSL benchmarks shows that our proposed solution significantly improves the performance of CZSL, and its components have been verified by solid ablation studies. Our code and data are available at: https://github.com/zhurunkai/DCDA.

2024

pdf bib
AutoAct: Automatic Agent Learning from Scratch for QA via Self-Planning
Shuofei Qiao | Ningyu Zhang | Runnan Fang | Yujie Luo | Wangchunshu Zhou | Yuchen Jiang | Chengfei Lv | Huajun Chen
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Language agents have achieved considerable performance on various complex question-answering tasks by planning with external tools. Despite the incessant exploration in this field, existing language agent systems still struggle with costly, non-reproducible data reliance and face the challenge of compelling a single model for multiple functions. To this end, we introduce AutoAct, an automatic agent learning framework for QA that does not rely on large-scale annotated data and synthetic planning trajectories from closed-source models (e.g., GPT-4). Given limited data with a tool library, AutoAct first automatically synthesizes planning trajectories without any assistance from humans or strong closed-source models. Then, AutoAct leverages a division-of-labor strategy to automatically differentiate based on the target task information and synthesized trajectories, producing a sub-agent group to complete the task. We conduct comprehensive experiments with different LLMs, which demonstrates that AutoAct yields better or parallel performance compared to various strong baselines. Further analysis demonstrates the effectiveness of the division-of-labor strategy, with the trajectory quality generated by AutoAct generally outperforming that of others.

pdf bib
EasyInstruct: An Easy-to-use Instruction Processing Framework for Large Language Models
Yixin Ou | Ningyu Zhang | Honghao Gui | Ziwen Xu | Shuofei Qiao | Runnan Fang | Lei Li | Zhen Bi | Guozhou Zheng | Huajun Chen
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

In recent years, instruction tuning has gained increasing attention and emerged as a crucial technique to enhance the capabilities of Large Language Models (LLMs). To construct high-quality instruction datasets, many instruction processing approaches have been proposed, aiming to achieve a delicate balance between data quantity and data quality. Nevertheless, due to inconsistencies that persist among various instruction processing methods, there is no standard open-source instruction processing implementation framework available for the community, which hinders practitioners from further developing and advancing. To facilitate instruction processing research and development, we present EasyInstruct, an easy-to-use instruction processing framework for LLMs, which modularizes instruction generation, selection, and prompting, while also considering their combination and interaction. EasyInstruct is publicly released and actively maintained at Github, along with an online demo app and a demo video for quick-start, calling for broader research centered on instruction data and synthetic data.

pdf bib
Knowledge Mechanisms in Large Language Models: A Survey and Perspective
Mengru Wang | Yunzhi Yao | Ziwen Xu | Shuofei Qiao | Shumin Deng | Peng Wang | Xiang Chen | Jia-Chen Gu | Yong Jiang | Pengjun Xie | Fei Huang | Huajun Chen | Ningyu Zhang
Findings of the Association for Computational Linguistics: EMNLP 2024

Understanding knowledge mechanisms in Large Language Models (LLMs) is crucial for advancing towards trustworthy AGI. This paper reviews knowledge mechanism analysis from a novel taxonomy including knowledge utilization and evolution. Knowledge utilization delves into the mechanism of memorization, comprehension and application, and creation. Knowledge evolution focuses on the dynamic progression of knowledge within individual and group LLMs. Moreover, we discuss what knowledge LLMs have learned, the reasons for the fragility of parametric knowledge, and the potential dark knowledge (hypothesis) that will be challenging to address. We hope this work can help understand knowledge in LLMs and provide insights for future research.

pdf bib
Making Language Models Better Tool Learners with Execution Feedback
Shuofei Qiao | Honghao Gui | Chengfei Lv | Qianghuai Jia | Huajun Chen | Ningyu Zhang
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Tools serve as pivotal interfaces that enable humans to understand and reshape the environment. With the advent of foundation models, AI systems can utilize tools to expand their capabilities and interact with the real world. Existing tool learning methodologies, encompassing supervised fine-tuning and prompt engineering approaches, often induce large language models to utilize tools indiscriminately, as complex tasks often exceed their own competencies. However, introducing tools for simple tasks, which the models themselves can readily resolve, can inadvertently propagate errors rather than enhance performance. This leads to the research question: can we teach language models when and how to use tools? To meet this need, we propose Tool leaRning wIth exeCution fEedback (TRICE), a two-stage end-to-end framework that enables the model to continually learn through feedback derived from tool execution, thereby learning when and how to use tools effectively. Experimental results, backed by further analysis, show that TRICE can make the large language model selectively use tools by improving the accuracy of tool usage while enhancing insufficient tool learning and mitigating excessive reliance on tools.

2023

pdf bib
Reasoning with Language Model Prompting: A Survey
Shuofei Qiao | Yixin Ou | Ningyu Zhang | Xiang Chen | Yunzhi Yao | Shumin Deng | Chuanqi Tan | Fei Huang | Huajun Chen
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Reasoning, as an essential ability for complex problem-solving, can provide back-end support for various real-world applications, such as medical diagnosis, negotiation, etc. This paper provides a comprehensive survey of cutting-edge research on reasoning with language model prompting. We introduce research works with comparisons and summaries and provide systematic resources to help beginners. We also discuss the potential reasons for emerging such reasoning abilities and highlight future research directions. Resources are available at https://github.com/zjunlp/Prompt4ReasoningPapers (updated periodically).

2022

pdf bib
DeepKE: A Deep Learning Based Knowledge Extraction Toolkit for Knowledge Base Population
Ningyu Zhang | Xin Xu | Liankuan Tao | Haiyang Yu | Hongbin Ye | Shuofei Qiao | Xin Xie | Xiang Chen | Zhoubo Li | Lei Li
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We present an open-source and extensible knowledge extraction toolkit DeepKE, supporting complicated low-resource, document-level and multimodal scenarios in the knowledge base population. DeepKE implements various information extraction tasks, including named entity recognition, relation extraction and attribute extraction. With a unified framework, DeepKE allows developers and researchers to customize datasets and models to extract information from unstructured data according to their requirements. Specifically, DeepKE not only provides various functional modules and model implementation for different tasks and scenarios but also organizes all components by consistent frameworks to maintain sufficient modularity and extensibility. We release the source code at GitHub in https://github.com/zjunlp/DeepKE with Google Colab tutorials and comprehensive documents for beginners. Besides, we present an online system in http://deepke.openkg.cn/EN/re_doc_show.html for real-time extraction of various tasks, and a demo video.