Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models

Mehrzad Samadi; Aleksander Ficek; Sean Narenthiran; Siddhartha Jain; Wasi Uddin Ahmad; Somshubra Majumdar; Vahid Noroozi; Boris Ginsburg

Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models

Mehrzad Samadi, Aleksander Ficek, Sean Narenthiran, Siddhartha Jain, Wasi Uddin Ahmad, Somshubra Majumdar, Vahid Noroozi, Boris Ginsburg

Abstract

Competitive programming has become a rigorous benchmark for evaluating the reasoning and problem-solving capabilities of large language models (LLMs). The International Olympiad in Informatics (IOI) stands out as one of the most prestigious annual competitions in competitive programming and has become a key benchmark for comparing human and AI-level programming ability. While several proprietary models have been claimed to achieve gold medal-level performance at the IOI, often with undisclosed methods, achieving comparable results with open-weight models remains a significant challenge. In this paper, we present GenCluster, a scalable and reproducible test-time compute framework that attains IOI gold-level performance using open-weight models. It combines large-scale generation, behavioral clustering, ranking, and a round-robin submission strategy to efficiently explore diverse solution spaces under limited validation budgets. Our experiments show that the performance of our proposed approach scales consistently with available compute, narrowing the gap between open and closed systems. Notably, we will show that GenCluster can achieve a gold medal at IOI 2025 for the first time with an open-weight model gpt-oss-120b, setting a new benchmark for transparent and reproducible evaluation of reasoning in LLMs.

Anthology ID:: 2026.acl-long.1532
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 33156–33169
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1532/
DOI:
Bibkey:
Cite (ACL):: Mehrzad Samadi, Aleksander Ficek, Sean Narenthiran, Siddhartha Jain, Wasi Uddin Ahmad, Somshubra Majumdar, Vahid Noroozi, and Boris Ginsburg. 2026. Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 33156–33169, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models (Samadi et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1532.pdf
Checklist:: 2026.acl-long.1532.checklist.pdf

PDF Cite Search Checklist Fix data