ContrastScore: Towards Higher Quality, Less Biased, More Efficient Evaluation Metrics with Contrastive Evaluation
Xiao Wang, Daniil Larionov, Siwei Wu, Yiqi Liu, Steffen Eger, Nafise Sadat Moosavi, Chenghua Lin
Abstract
Recent advances in automatic evaluation of natural language generation have increasingly relied on large language models as general-purpose metrics. While effective, these approaches often require high-capacity models, which introduce substantial computational costs, and remain susceptible to known evaluation pathologies, such as over-reliance on likelihood. We introduce ContrastScore, a contrastive evaluation paradigm that builds on the widely used BARTScore formulation by comparing token-level probabilities between a stronger and a weaker model. Instead of relying on single-model likelihoods or prompt-based judgments, ContrastScore captures disagreement between models to better reflect confidence and uncertainty in generation quality. Empirical results on summarization and machine translation benchmarks show that ContrastScore, instantiated with paired moderate-scale models across both Qwen and LLaMA families, consistently outperforms larger alternatives, such as Qwen 7B and LLaMA 8B, in correlation with human ratings. In addition to improving evaluation quality, ContrastScore significantly reduces susceptibility to likelihood bias, offering a more robust and cost-effective alternative to larger LLM-based evaluation methods.- Anthology ID:
- 2025.ijcnlp-long.163
- Volume:
- Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
- Month:
- December
- Year:
- 2025
- Address:
- Mumbai, India
- Editors:
- Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, Dhirendra Pratap Singh
- Venues:
- IJCNLP | AACL
- SIG:
- Publisher:
- The Asian Federation of Natural Language Processing and The Association for Computational Linguistics
- Note:
- Pages:
- 3045–3060
- Language:
- URL:
- https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.ijcnlp-long.163/
- DOI:
- Cite (ACL):
- Xiao Wang, Daniil Larionov, Siwei Wu, Yiqi Liu, Steffen Eger, Nafise Sadat Moosavi, and Chenghua Lin. 2025. ContrastScore: Towards Higher Quality, Less Biased, More Efficient Evaluation Metrics with Contrastive Evaluation. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 3045–3060, Mumbai, India. The Asian Federation of Natural Language Processing and The Association for Computational Linguistics.
- Cite (Informal):
- ContrastScore: Towards Higher Quality, Less Biased, More Efficient Evaluation Metrics with Contrastive Evaluation (Wang et al., IJCNLP-AACL 2025)
- PDF:
- https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.ijcnlp-long.163.pdf