This House Debates AI: Evaluating a Language Model in Oxford-Style Debates against Human Experts

Umberto Belluzzo, Kobi Hackenburg, Hannah Rose Kirk, Scott Hale, Paul Röttger


Abstract
Recent work shows that large language models (LLMs) are increasingly capable of generating persuasive arguments and messages, creating concerns over undue influence on human beliefs. Most evidence so far, however, evaluates LLM argumentation and persuasion in single-turn interactions and/or compares to weak human baselines. To address this gap, we benchmark a state-of-the-art LLM, Llama 3.1 Instruct 405B, in 100 six-turn Oxford-style debates against 20 experienced human debaters. Each anonymised debate is rated by 5 independent raters, who provide win/loss judgments as well as 0–100 scores across 11 dimensions of quality. Based on these ratings, the LLM is competitive overall, with a win rate of 51.2%, ranking 6th out of 21 debaters on mean performance score. Compared to humans, the LLM generally scores higher on presentational dimensions (e.g., clarity, confidence, formality) but equal on most substantive dimensions (convincingness, evidence, originality). We also find that pre/post rater stance tends to shift towards the position raters chose as the winning side, regardless of whether this side was the LLM or a human. Overall, our results provide new evidence on the qualities of LLM argumentation and its drivers, suggesting strong argumentative competence even in competitive multi-turn settings.
Anthology ID:
2026.lrec-main.215
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
2742–2759
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.215/
DOI:
Bibkey:
Cite (ACL):
Umberto Belluzzo, Kobi Hackenburg, Hannah Rose Kirk, Scott Hale, and Paul Röttger. 2026. This House Debates AI: Evaluating a Language Model in Oxford-Style Debates against Human Experts. International Conference on Language Resources and Evaluation, main:2742–2759.
Cite (Informal):
This House Debates AI: Evaluating a Language Model in Oxford-Style Debates against Human Experts (Belluzzo et al., LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.215.pdf