Zhiyuan Li

2025

pdf bib abs
Octopus: On-device language model for function calling of software APIs
Wei Chen | Zhiyuan Li | Mingyuan Ma
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: Industry Track)

Large Language Models (LLMs) are pivotal for advanced text processing and generation. This study presents a framework to train a series of on-device LLMs optimized for invoking software APIs. Using a curated dataset of 30,000 API function calls from software documentation, we fine-tune LLMs with 2B, 3B, and 7B parameters to enhance their proficiency in API interactions. Our approach improves the understanding of API structures and syntax, leading to significantly better accuracy in API function calls. We also propose a conditional masking technique to enforce correct output formats, reducing errors while maintaining inference speed, specifically tailored for API tasks. The fine-tuned model, Octopus, outperforms GPT-4 in API calling tasks, showcasing advancements in automated software development and API integration. The model checkpoints are publicly available.

2024

Recent advancements in Vision-Language (VL) research have sparked new benchmarks for complex visual reasoning, challenging models’ advanced reasoning ability. Traditional Vision-Language models (VLMs) perform well in visual perception tasks while struggling with complex reasoning scenarios. Conversely, Large Language Models (LLMs) demonstrate robust text reasoning capabilities; however, they lack visual acuity. To bridge this gap, we propose **C**omplex **V**isual **R**easoning **L**arge **L**anguage **M**odels (**CVR-LLM**), capitalizing on VLMs’ visual perception proficiency and LLMs’ extensive reasoning capability. Unlike recent multimodal large language models (MLLMs) that require a projection layer, our approach transforms images into detailed, context-aware descriptions using an iterative self-refinement loop and leverages LLMs’ text knowledge for accurate predictions without extra training. We also introduce a novel multi-modal in-context learning (ICL) methodology to enhance LLMs’ contextual understanding and reasoning. Additionally, we introduce Chain-of-Comparison (CoC), a step-by-step comparison technique enabling contrasting various aspects of predictions. Our CVR-LLM presents the first comprehensive study across a wide array of complex visual reasoning tasks and achieves SOTA performance among all.

Co-authors

Tengfei Xue 1

Chaoyi Zhang 1

Venues

emnlp1
naacl1

Fix data