Hoang Anh Duy Le

2025

pdf bib abs
Word Salad Chopper: Reasoning Models Waste A Ton Of Decoding Budget On Useless Repetitions, Self-Knowingly
Wenya Xie | Shaochen Zhong | Hoang Anh Duy Le | Zhaozhuo Xu | Jianwen Xie | Zirui Liu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large Reasoning Models (LRMs) are often bottlenecked by the high cost of output tokens. We show that a significant portion of these tokens are useless self-repetitions — what we call “word salad” — that exhaust the decoding budget without adding value. Interestingly, we observe that LRMs are self-aware when trapped in these loops: the hidden states of ‘‘ tokens trailing each reasoning chunk exhibit patterns that allow us to detect word salad behavior on-the-fly via a single linear classifier. Once detected, a simple chop appended by a straightforward regeneration prompt yields substantial length savings with minimal quality loss. Our work offers WordSaladChopper (WSC) — a lightweight, turnkey component for LRM that is minimally invasive to its reasoning trajectory. Given its low overhead, strong savings, and the lack of semantic value of word salad tokens, we believe it is not too far-fetched to argue that WSC — or a similar component — is a must-have for all LRM applications with user experience in mind.

Large Language Models (LLMs) are increasingly adopted across real-world applications, yet traditional evaluations rely on expensive, domain-specific ground-truth labels that are often unavailable or infeasible. We introduce a ground-truth-free evaluation framework focused on reasoning consistency and instruction following, shifting the emphasis from correctness—which is elusive without labels—to transparent, coherent, evidence-based reasoning. Each model response must include a direct answer, a structured multi-step explanation, and supporting evidence, all assessed via semantic similarity and output adherence checks. We further propose TopK-ReRank, which refines rankings by constructing a consensus answer from the most reliable models, reducing ambiguity across diverse reasoning styles. Experiments show that our framework outperforms existing label-free methods, including majority voting, triplet ranking, and peer-review approaches, providing a more interpretable and efficient alternative for evaluating LLMs in the absence of ground-truth labels.

Co-authors

Venues

emnlp1
findings1

Fix author