He Chen


2023

pdf
ModelScope-Agent: Building Your Customizable Agent System with Open-source Large Language Models
Chenliang Li | He Chen | Ming Yan | Weizhou Shen | Haiyang Xu | Zhikai Wu | Zhicheng Zhang | Wenmeng Zhou | Yingda Chen | Chen Cheng | Hongzhu Shi | Ji Zhang | Fei Huang | Jingren Zhou
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Large language models (LLMs) have recently demonstrated remarkable capabilities to comprehend human intentions, engage in reasoning, and design planning-like behavior. To further unleash the power of LLMs to accomplish complex tasks, there is a growing trend to build agent frameworks that equips LLMs, such as ChatGPT, with tool-use abilities to connect with massive external APIs. In this work, we introduce ModelScope-Agent, a general and customizable agent framework for real-world applications, based on open-source LLMs as controllers. It provides a user-friendly system library, with a customizable engine design to support model training on multiple open-source LLMs, while also enabling seamless integration with both model APIs and common APIs in a unified way. To equip the LLMs with tool-use abilities, a comprehensive framework has been proposed spanning tool-use data collection, tool retrieval, tool registration, memory control, customized model training, and evaluation for practical real-world applications. Finally, we showcase ModelScopeGPT, a real-world intelligent assistant of ModelScope Community based on the ModelScope-Agent framework, which is able to connect open-source LLMs with more than 1000 public AI models and localized community knowledge in ModelScope. The ModelScope-Agent online demo, library are now publicly available.

2022

pdf
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
Chenliang Li | Haiyang Xu | Junfeng Tian | Wei Wang | Ming Yan | Bin Bi | Jiabo Ye | He Chen | Guohai Xu | Zheng Cao | Ji Zhang | Songfang Huang | Fei Huang | Jingren Zhou | Luo Si
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Large-scale pre-trained foundation models have been an emerging paradigm for building artificial intelligence (AI) systems, which can be quickly adapted to a wide range of downstream tasks. This paper presents mPLUG, a new vision-language foundation model for both cross-modal understanding and generation. Most existing pre-trained models suffer from inefficiency and linguistic signal overwhelmed by long visual sequences in cross-modal alignment. To address both problems, mPLUG introduces an effective and efficient vision-language architecture with novel cross-modal skip-connections.mPLUG is pre-trained end-to-end on large-scale image-text pairs with both discriminative and generative objectives. It achieves state-of-the-art results on a wide range of vision-language downstream tasks, including image captioning, image-text retrieval, visual grounding and visual question answering. mPLUG also demonstrates strong zero-shot transferability on vision-language and video-language tasks. The code and pre-trained models are available at https://github.com/alibaba/AliceMind