Exploring Capability Thresholds in Ultra-Lightweight LLM Judges for Nugget-Based Report Evaluation
Mann Bajpai, Pulkit Chatwal, Priyanshu Deswal, Harish Pratap Singh, Santosh Kumar Mishra
Abstract
Reliable automatic evaluation of retrieval-grounded long-form reports typically requires human annotation or frontier-scale proprietary LLMs, both of which are expensive in constrained settings. Team rgipt participated in RAG4Reports@ACL 2026 Task 1 with a zero-shot nugget-verification system that runs entirely on a single NVIDIA T4 GPU. We compare three ultra-lightweight decoder-only models: Qwen2-0.5B, Qwen2-1.5B, and Qwen2.5-0.5B, under identical inference conditions to examine how small an LLM judge can be while retaining human-aligned ranking signal. Both Qwen2 models produced negative 𝜏gap, whereas Qwen2.5-0.5B achieved 𝜏gap = 0.0772 and Pearson r = 0.2209, ranking 13th of 21 teams. Within this family and evaluation setting, model generation appears to matter more than parameter count, although this finding is based on three configurations on a single task and warrants further validation.- Anthology ID:
- 2026.rag4reports-1.13
- Volume:
- Proceedings of the 1st Workshop on Multilingual Report Generation via Retrieval Augmented Generation (RAG4Reports 2026)
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, CA, USA
- Editors:
- Eugene Yang, Dawn Lawrie, Sean MacAvaney, James Mayfield, Luca Soldaini, Andrew Yates
- Venues:
- RAG4Reports | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 94–98
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.rag4reports-1.13/
- DOI:
- Cite (ACL):
- Mann Bajpai, Pulkit Chatwal, Priyanshu Deswal, Harish Pratap Singh, and Santosh Kumar Mishra. 2026. Exploring Capability Thresholds in Ultra-Lightweight LLM Judges for Nugget-Based Report Evaluation. In Proceedings of the 1st Workshop on Multilingual Report Generation via Retrieval Augmented Generation (RAG4Reports 2026), pages 94–98, San Diego, CA, USA. Association for Computational Linguistics.
- Cite (Informal):
- Exploring Capability Thresholds in Ultra-Lightweight LLM Judges for Nugget-Based Report Evaluation (Bajpai et al., RAG4Reports 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl-workshops/2026.rag4reports-1.13.pdf