Wenhao Zheng
2025
Token Level Routing Inference System for Edge Devices
Jianshu She
|
Wenhao Zheng
|
Zhengzhong Liu
|
Hongyi Wang
|
Eric P. Xing
|
Huaxiu Yao
|
Qirong Ho
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
The computational complexity of large language model (LLM) inference significantly constrains their deployment efficiency on edge devices. In contrast, small language models offer faster decoding and lower resource consumption but often suffer from degraded response quality and heightened susceptibility to hallucinations. To address this trade-off, collaborative decoding, in which a large model assists in generating critical tokens, has emerged as a promising solution. This paradigm leverages the strengths of both model types by enabling high-quality inference through selective intervention of the large model, while maintaining the speed and efficiency of the smaller model. In this work, we present a novel collaborative decoding inference system that allows small models to perform on-device inference while selectively consulting a cloud-based large model for critical token generation. Remarkably, the system achieves a 60% performance gain on CommonsenseQA using only a 0.5B model on an M1 MacBook, with under 7% of tokens generation uploaded to the large model in the cloud.
Verifiable Format Control for Large Language Model Generations
Zhaoyang Wang
|
Jinqi Jiang
|
Huichi Zhou
|
Wenhao Zheng
|
Xuchao Zhang
|
Chetan Bansal
|
Huaxiu Yao
Findings of the Association for Computational Linguistics: NAACL 2025
Recent Large Language Models (LLMs) have demonstrated satisfying general instruction following ability. However, small LLMs with about 7B parameters still struggle fine-grained format following (e.g., JSON format), which seriously hinder the advancements of their applications. Most existing methods focus on benchmarking general instruction following while overlook how to improve the specific format following ability for small LLMs. Besides, these methods often rely on evaluations based on advanced LLMs (e.g., GPT-4), which can introduce the intrinsic bias of LLMs and be costly due to the API calls. In this paper, we first curate a fully verifiable format following dataset VFF. In contrast to existing works often adopting external LLMs for instruction-following validations, every sample of VFF can be easily validated with a Python function. Further, we propose to leverage this verifiable feature to synthesize massive data for progressively training small LLMs, in order to improve their format following abilities. Experimental results highlight the prevalent limitations in the format following capabilities of 7B level open-source LLMs and demonstrate the effectiveness of our method in enhancing this essential ability.
Search
Fix author
Co-authors
- Huaxiu Yao 2
- Chetan Bansal 1
- Qirong Ho 1
- Jinqi Jiang 1
- Zhengzhong Liu 1
- show all...