Jiasheng Ye


2025

pdf bib
OS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use
Xueyu Hu | Tao Xiong | Biao Yi | Zishu Wei | Ruixuan Xiao | Yurun Chen | Jiasheng Ye | Meiling Tao | Xiangxin Zhou | Ziyu Zhao | Yuhuai Li | Shengze Xu | Shenzhi Wang | Xinchen Xu | Shuofei Qiao | Zhaokai Wang | Kun Kuang | Tieyong Zeng | Liang Wang | Jiwei Li | Yuchen Eleanor Jiang | Wangchunshu Zhou | Guoyin Wang | Keting Yin | Zhou Zhao | Hongxia Yang | Fan Wu | Shengyu Zhang | Fei Wu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The dream to create AI assistants as capable and versatile as the fictional J.A.R.V.I.S from Iron Man has long captivated imaginations. With the evolution of multi-modal large language models ((M)LLMs), this dream is closer to reality, as (M)LLM-based Agents using computers, mobile phones and web browsers by operating within the environments and interfaces (e.g., Graphical User Interface (GUI) and Command Line Interface (CLI)) provided by operating systems (OS) to automate tasks have significantly advanced. This paper presents a comprehensive survey on these advanced agents, designated as OS Agents. We begin by elucidating the fundamentals of OS Agents, exploring their key components and capabilities. We then examine methodologies for constructing OS Agents, focusing on domain-specific foundation models and agent frameworks. A detailed review of evaluation metrics and benchmarks highlights how OS Agents are assessed across diverse platforms and tasks. Finally, we discuss current challenges and identify promising directions for future research. An open-source GitHub repository is maintained as a dynamic resource to foster further innovation in this field.

2024

pdf bib
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling
Jun Zhan | Junqi Dai | Jiasheng Ye | Yunhua Zhou | Dong Zhang | Zhigeng Liu | Xin Zhang | Ruibin Yuan | Ge Zhang | Linyang Li | Hang Yan | Jie Fu | Tao Gui | Tianxiang Sun | Yu-Gang Jiang | Xipeng Qiu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities, including speech, text, images, and music. AnyGPT can be trained stably without any alterations to the current large language model (LLM) architecture or training paradigms. Instead, it relies exclusively on data-level preprocessing, facilitating the seamless integration of new modalities into LLMs, akin to the incorporation of new languages.We build a multimodal text-centric dataset for multimodal alignment pre-training. Utilizing generative models, we synthesize the first large-scale any-to-any multimodal instruction dataset. It consists of 108k samples of multi-turn conversations that intricately interweave various modalities, thus equipping the model to handle arbitrary combinations of multimodal inputs and outputs.Experimental results demonstrate that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities, proving that discrete representations can effectively and conveniently unify multiple modalities within a language model. Demos are shown in https://junzhan2000.github.io/AnyGPT.github.io/.

pdf bib
LLM can Achieve Self-Regulation via Hyperparameter Aware Generation
Siyin Wang | Shimin Li | Tianxiang Sun | Jinlan Fu | Qinyuan Cheng | Jiasheng Ye | Junjie Ye | Xipeng Qiu | Xuanjing Huang
Findings of the Association for Computational Linguistics: ACL 2024

In the realm of Large Language Models (LLMs), users commonly employ diverse decoding strategies and adjust hyperparameters to control the generated text. However, a critical question emerges: Are LLMs conscious of the existence of these decoding strategies and capable of regulating themselves? The current decoding generation process often relies on empirical and heuristic manual adjustments to hyperparameters based on types of tasks and demands. However, this process is typically cumbersome, and the decoding hyperparameters may not always be optimal for each sample. To address the aforementioned challenges, we propose a novel text generation paradigm termed Hyperparameter Aware Generation (HAG). By leveraging hyperparameter-aware instruction tuning, the LLM autonomously determines the optimal decoding strategy and configs based on the input samples, enabling self-regulation. Our approach eliminates the need for extensive manual tuning, offering a more autonomous, self-regulate model behavior. Experimental results spanning six datasets across reasoning, creativity, translation, and mathematics tasks demonstrate that hyperparameter-aware instruction tuning empowers the LLMs to self-regulate the decoding strategy and hyperparameter. HAG extends the current paradigm in the text generation process, highlighting the feasibility of endowing the LLMs with self-regulate decoding strategies.

2021

pdf bib
Energy-based Unknown Intent Detection with Data Manipulation
Yawen Ouyang | Jiasheng Ye | Yu Chen | Xinyu Dai | Shujian Huang | Jiajun Chen
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021