Argument-Based Comparative Question Answering Evaluation Benchmark

Irina Nikishina, Saba Anwar, Nikolay Dolgov, Maria Manina, Daria Ignatenko, Viktor Moskvoretskii, Artem Shelmanov, Tim Baldwin, Chris Biemann


Abstract
Despite the ability of large language models (LLMs) to generate coherent comparative answers, automatic comparative question answering (CQA) remains challenging due to the absence of standardized evaluation criteria and the high resource demands of manual assessment. To address these problems, this paper proposes a comprehensive evaluation framework designed to assess the quality of CQA summaries using LLMs-as-a-Judge. We formulate 15 evaluation criteria for assessing comparative answers generated by various sources, including LLMs, human experts, and prior work. To capture a diverse range of comparative answers, LLM summaries were generated under various prompting scenarios. We evaluate the effectiveness of our framework using both human assessment and LLMs, demonstrating the consistency between automated and manual evaluations. Finally, we fine-tune Llama-3-8B-Instruct on a dataset generated from the best-performing CQA models in our evaluation.
Anthology ID:
2026.argmining-1.6
Volume:
Proceedings of the 13th Workshop on Argument Mining and Reasoning
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Mohamed Elaraby, Annette Hautli-Janisz, Julia Romberg, Elena Musi, Federico Ruggeri, John Lawrence
Venues:
ArgMining | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
43–51
Language:
URL:
https://preview.aclanthology.org/ingest-acl-workshops/2026.argmining-1.6/
DOI:
Bibkey:
Cite (ACL):
Irina Nikishina, Saba Anwar, Nikolay Dolgov, Maria Manina, Daria Ignatenko, Viktor Moskvoretskii, Artem Shelmanov, Tim Baldwin, and Chris Biemann. 2026. Argument-Based Comparative Question Answering Evaluation Benchmark. In Proceedings of the 13th Workshop on Argument Mining and Reasoning, pages 43–51, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
Argument-Based Comparative Question Answering Evaluation Benchmark (Nikishina et al., ArgMining 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-workshops/2026.argmining-1.6.pdf