Ram\'on Fernandez Astudillo


2026

Despite its simplicity and efficacy, the high token expenditure of self-consistency can limit its practical utility. We investigate whether early hypothesis pruning can improve the token efficiency of self-consistency for long chain-of-thought reasoning tasks, while preserving its parallelism. Concretely, we generate all solutions in parallel but periodically prune intermediate hypotheses based on two lightweight indicators: (a) the model’s confidence in each hypothesis, and (b) the lexical coverage of all current hypotheses by candidate subsets. We design a fast weighted set cover algorithm that utilizes the two indicators; evaluation of five LLMs on three math benchmarks shows that our method improves token efficiency in most cases, with reductions of 10-35% in many.