Trojsten Benchmark: Evaluating LLM Problem-Solving in Slovak STEM Competition Problems

Adam Zahradník; Marek Šuppa

Trojsten Benchmark: Evaluating LLM Problem-Solving in Slovak STEM Competition Problems

Abstract

Large language models show promising performance on reasoning tasks, yet evaluation methods for low-resource languages remain limited, particularly for complex STEM problem-solving. We introduce Trojsten Benchmark, a Slovak-language dataset of 1,108 high-school competition problems with reference solutions across mathematics, physics, and programming, and a rubric-based LLM grading framework. Using GPT-4 to generate rubrics and grade solutions, we observe 1.05 average absolute deviation from human graders (5-point scale), while benchmarking GPT-3.5-Turbo, GPT-4, GPT-4o, and open-weight models (Llama 3, Phi-3). We quantify multistep reasoning performance by difficulty, show consistent underperformance on harder items, and demonstrate language sensitivity: accuracy drops on English translations of Slovak statements, evidencing challenges beyond translation. Trojsten Benchmark complements English-centric math datasets (e.g., MATH, GSM8K) by targeting open-response, rubric-gradable reasoning under low-resource linguistic framing. We release code and data to enable reproducible evaluation and human-aligned auto-grading for STEM in under-served languages.

Anthology ID:: 2025.emnlp-main.1779
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 35094–35109
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1779/
DOI:
Bibkey:
Cite (ACL):: Adam Zahradník and Marek Suppa. 2025. Trojsten Benchmark: Evaluating LLM Problem-Solving in Slovak STEM Competition Problems. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 35094–35109, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Trojsten Benchmark: Evaluating LLM Problem-Solving in Slovak STEM Competition Problems (Zahradník & Suppa, EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1779.pdf
Checklist:: 2025.emnlp-main.1779.checklist.pdf

PDF Cite Search Checklist Fix data