2025
pdf
bib
abs
MoQAE: Mixed-Precision Quantization for Long-Context LLM Inference via Mixture of Quantization-Aware Experts
Wei Tao
|
Haocheng Lu
|
Xiaoyang Qu
|
Bin Zhang
|
Kai Lu
|
Jiguang Wan
|
Jianzong Wang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
One of the primary challenges in optimizing large language models (LLMs) for long-context inference lies in the high memory consumption of the Key-Value (KV) cache. Existing approaches, such as quantization, have demonstrated promising results in reducing memory usage. However, current quantization methods cannot take both effectiveness and efficiency into account. In this paper, we propose MoQAE, a novel mixed-precision quantization method via mixture of quantization-aware experts. First, we view different quantization bit-width configurations as experts and use the traditional mixture of experts (MoE) method to select the optimal configuration. To avoid the inefficiency caused by inputting tokens one by one into the router in the traditional MoE method, we input the tokens into the router chunk by chunk. Second, we design a lightweight router-only fine-tuning process to train MoQAE with a comprehensive loss to learn the trade-off between model accuracy and memory usage. Finally, we introduce a routing freezing (RF) and a routing sharing (RS) mechanism to further reduce the inference overhead. Extensive experiments on multiple benchmark datasets demonstrate that our method outperforms state-of-the-art KV cache quantization approaches in both efficiency and effectiveness.
2024
pdf
bib
abs
TPTU-v2: Boosting Task Planning and Tool Usage of Large Language Model-based Agents in Real-world Industry Systems
Yilun Kong
|
Jingqing Ruan
|
YiHong Chen
|
Bin Zhang
|
Tianpeng Bao
|
Shi Shiwei
|
du Guo Qing
|
Xiaoru Hu
|
Hangyu Mao
|
Ziyue Li
|
Xingyu Zeng
|
Rui Zhao
|
Xueqian Wang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track
Large Language Models (LLMs) have demonstrated proficiency in addressing tasks that necessitate a combination of task planning and the usage of external tools, such as weather and calculator APIs. However, real-world industrial systems present prevalent challenges in task planning and tool usage: numerous APIs in the real system make it intricate to invoke the appropriate one, while the inherent limitations of LLMs pose challenges in orchestrating an accurate sub-task sequence and API-calling order. This paper introduces a comprehensive framework aimed at enhancing the Task Planning and Tool Usage (TPTU) abilities of LLM-based agents in industry. Our framework comprises three key components designed to address these challenges: (1) the API Retriever selects the most pertinent APIs among the extensive API set; (2) the Demo Selector retrieves task-level demonstrations, which is further used for in-context learning to aid LLMs in accurately decomposing subtasks and effectively invoking hard-to-distinguish APIs; (3) LLM Finetuner tunes a base LLM to enhance its capability for task planning and API calling. We validate our methods using a real-world industry system and an open-sourced academic dataset, demonstrating the efficacy of each individual component as well as the integrated framework. The code is available at here.
2011
pdf
bib
Detecting Forum Authority Claims in Online Discussions
Alex Marin
|
Bin Zhang
|
Mari Ostendorf
Proceedings of the Workshop on Language in Social Media (LSM 2011)
pdf
bib
Annotating Social Acts: Authority Claims and Alignment Moves in Wikipedia Talk Pages
Emily M. Bender
|
Jonathan T. Morgan
|
Meghan Oxley
|
Mark Zachry
|
Brian Hutchinson
|
Alex Marin
|
Bin Zhang
|
Mari Ostendorf
Proceedings of the Workshop on Language in Social Media (LSM 2011)
2010
pdf
bib
Automatic Generation of Personalized Annotation Tags for Twitter Users
Wei Wu
|
Bin Zhang
|
Mari Ostendorf
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
pdf
bib
Extracting Phrase Patterns with Minimum Redundancy for Unsupervised Speaker Role Classification
Bin Zhang
|
Brian Hutchinson
|
Wei Wu
|
Mari Ostendorf
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics