Jianshu She
2025
Token Level Routing Inference System for Edge Devices
Jianshu She
|
Wenhao Zheng
|
Zhengzhong Liu
|
Hongyi Wang
|
Eric P. Xing
|
Huaxiu Yao
|
Qirong Ho
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
The computational complexity of large language model (LLM) inference significantly constrains their deployment efficiency on edge devices. In contrast, small language models offer faster decoding and lower resource consumption but often suffer from degraded response quality and heightened susceptibility to hallucinations. To address this trade-off, collaborative decoding, in which a large model assists in generating critical tokens, has emerged as a promising solution. This paradigm leverages the strengths of both model types by enabling high-quality inference through selective intervention of the large model, while maintaining the speed and efficiency of the smaller model. In this work, we present a novel collaborative decoding inference system that allows small models to perform on-device inference while selectively consulting a cloud-based large model for critical token generation. Remarkably, the system achieves a 60% performance gain on CommonsenseQA using only a 0.5B model on an M1 MacBook, with under 7% of tokens generation uploaded to the large model in the cloud.
Linear Steerability in Language Models: When It Emerges and How It Evolves
Jianshu She
|
Xinyue Li
|
Eric P. Xing
|
Zhengzhong Liu
|
Qirong Ho
Findings of the Association for Computational Linguistics: EMNLP 2025
Language models can be steered by modifying their internal representations to control concepts such as emotion, style, or truthfulness in generation. However, the conditions for an effective intervention remain unclear and are often validated through heuristics and trial-and-error. To fill this gap, we demonstrate that intervention efficacy, measured by linear steerability (i.e., the ability to adjust output via linear transformations of hidden states), emerges during intermediate stages of training. Moreover, even closely related concepts (e.g., anger and sadness) exhibit steerability emergence at distinct stages of training*.To better interpret the dynamics of steerability during training, we adapt existing intervention techniques into a unified framework, referred to as the “Intervention Detector” (ID), which is designed to reveal how linear steerability evolves over the course of training through hidden state and representation analysis. ID reveals that concepts become increasingly linearly separable in the hidden space as training progresses, which strongly correlates with the emergence of linear steerability. We further introduce ID-based metrics, such as heatmaps, entropy trends, and cosine similarity, to help interpret how linear steerability evolves throughout training. In addition, we apply ID across different model families to ensure the generality of our findings on steerability dynamics.
Search
Fix author
Co-authors
- Qirong Ho 2
- Zhengzhong Liu 2
- Eric Xing 2
- Xinyue Li 1
- Hongyi Wang 1
- show all...