Zhipeng Liu
2026
QUARTZ: Quantile-Aware Routing and Queueing for TTFT SLOs in LLM Serving
Zhipeng Liu | Yifan Zheng | Fanqi Kong | Ziming Zhao
Findings of the Association for Computational Linguistics: ACL 2026
Zhipeng Liu | Yifan Zheng | Fanqi Kong | Ziming Zhao
Findings of the Association for Computational Linguistics: ACL 2026
Large Language Model (LLM) serving systems increasingly face strict time-to-first-token (TTFT) service-level objectives (SLOs), yet TTFT remains highly sensitive to router-side queueing effects. Prefill costs scale with prompt length, decode lengths are uncertain, and prefix locality creates strong performance skew across requests. Despite major advances in continuous batching and KV-cache management, today’s routers are often agnostic to request cost, which makes them vulnerable to head-of-line blocking and tail-latency amplification under mixed workloads. We propose QUARTZ, a quantile-aware routing and queueing layer for LLM serving that predicts conservative quantile-based request-cost proxies, rather than point estimates, using lightweight router-visible signals. QUARTZ uses these quantiles together with backlog-aware router signals to guide worker selection and admission decisions that better align with TTFT tail SLOs while preserving fairness. We implement QUARTZ as a router upgrade for SGLang and evaluate it on representative interactive and retrieval-augmented workloads. The results show reductions in TTFT tail latency and SLO violations across heterogeneous workloads.
2024
MDS: A Fine-Grained Dataset for Multi-Modal Dialogue Summarization
Zhipeng Liu | Xiaoming Zhang | Litian Zhang | Zelong Yu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Zhipeng Liu | Xiaoming Zhang | Litian Zhang | Zelong Yu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Due to the explosion of various dialogue scenes, summarizing the dialogue into a short message has drawn much attention recently. In the multi-modal dialogue scene, people tend to use tone and body language to illustrate their intentions. While traditional dialogue summarization has predominantly focused on textual content, this approach may overlook vital visual and audio information essential for understanding multi-modal interactions. Recognizing the established field of multi-modal dialogue summarization, we develop a new multi-modal dialogue summarization dataset (MDS), which aims to enhance the variety and scope of data available for this research area. MDS provides a demanding testbed for multi-modal dialogue summarization. Subsequently, we conducted a comparative analysis of various summarization techniques on MDS and found that the existing methods tend to produce redundant and incoherent summaries. All of the models generate unfaithful facts to some degree, suggesting future research directions. MDS is available at https://github.com/R00kkie/MDS.