Can LLMs Really Judge? A Progressive Argumentation-Mining Framework for Distinguishing Understanding from Aggregation

Fuyu Wang, Jiangtong Li, Kun Zhu, Changjun Jiang


Abstract
Current evaluations of large language models (LLMs) mainly rely on dataset-based generation accuracy. However, generative correctness does not guarantee the discriminative capability required to verify solutions, frequently masking an inability to distinguish valid reasoning from plausible errors. While multi-agent debate inherently entails judgment, we show that uncontrolled context growth and convergence to majority voting introduce significant noise, obscuring intrinsic model judgment. To address these limitations, we propose a progressive argumentation-mining diagnostic framework designed to explicitly control context and isolate discriminative behaviors. Instead of indiscriminate aggregation, our approach distills and retains only the single most well-supported rationale per answer, preventing context dilution while enforcing strict quality-based selection. Applying this framework reveals a fundamental cognitive divergence: models exhibit structural susceptibility to plausible misinformation in knowledge tasks, whereas in reasoning tasks they demonstrate latent discriminative potential that remains fragile under pressure. These findings underscore the fragility of discriminative capabilities, advocating for diagnostic methodologies that prioritize judgment stability over simple generation performance.
Anthology ID:
2026.findings-acl.1473
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
29463–29482
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1473/
DOI:
Bibkey:
Cite (ACL):
Fuyu Wang, Jiangtong Li, Kun Zhu, and Changjun Jiang. 2026. Can LLMs Really Judge? A Progressive Argumentation-Mining Framework for Distinguishing Understanding from Aggregation. In Findings of the Association for Computational Linguistics: ACL 2026, pages 29463–29482, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Can LLMs Really Judge? A Progressive Argumentation-Mining Framework for Distinguishing Understanding from Aggregation (Wang et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1473.pdf
Checklist:
 2026.findings-acl.1473.checklist.pdf