CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

Shudong Liu (刘树东); Hongwei Liu; Junnan Liu; Linchen Xiao; Songyang Gao; Chengqi Lyu; Yuzhe Gu; Wenwei Zhang; Derek F. Wong (黄辉); Songyang Zhang; Kai Chen

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

Shudong Liu, Hongwei Liu, Junnan Liu, Linchen Xiao, Songyang Gao, Chengqi Lyu, Yuzhe Gu, Wenwei Zhang, Derek F. Wong, Songyang Zhang, Kai Chen

Abstract

Answer verification is crucial not only for evaluating large language models (LLMs) by matching their unstructured outputs against standard answers, but also serves as the reward model to guide LLM optimization. Most evaluation frameworks rely on regularized matching or employ general LLMs for answer verification, which demands extensive, repetitive customization for regex rules or evaluation prompts. Two fundamental limitations persist in current methodologies: 1) the absence of comprehensive benchmarks that systematically evaluate verification capabilities across different LLMs; and 2) the nascent stage of verifier development, where existing approaches lack both the robustness to handle complex edge cases and the generalizability across different domains. In this work, we develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward. It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types including multi-subproblems, formulas, and sequence answers, while effectively identifying abnormal/invalid responses. We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of meta error patterns to enhance CompassVerifier. We anticipate that CompassVerifier and VerifierBench will facilitate evaluation protocols and reinforcement learning research.

Anthology ID:: 2025.emnlp-main.1698
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 33454–33482
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1698/
DOI:
Bibkey:
Cite (ACL):: Shudong Liu, Hongwei Liu, Junnan Liu, Linchen Xiao, Songyang Gao, Chengqi Lyu, Yuzhe Gu, Wenwei Zhang, Derek F. Wong, Songyang Zhang, and Kai Chen. 2025. CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 33454–33482, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward (Liu et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1698.pdf
Checklist:: 2025.emnlp-main.1698.checklist.pdf

PDF Cite Search Checklist Fix data