Trojsten Benchmark: Evaluating LLM Problem-Solving in Slovak STEM Competition Problems

Adam Zahradník, Marek Suppa


Abstract
Large language models show promising performance on reasoning tasks, yet evaluation methods for low-resource languages remain limited, particularly for complex STEM problem-solving. We introduce Trojsten Benchmark, a Slovak-language dataset of 1,108 high-school competition problems with reference solutions across mathematics, physics, and programming, and a rubric-based LLM grading framework. Using GPT-4 to generate rubrics and grade solutions, we observe 1.05 average absolute deviation from human graders (5-point scale), while benchmarking GPT-3.5-Turbo, GPT-4, GPT-4o, and open-weight models (Llama 3, Phi-3). We quantify multistep reasoning performance by difficulty, show consistent underperformance on harder items, and demonstrate language sensitivity: accuracy drops on English translations of Slovak statements, evidencing challenges beyond translation. Trojsten Benchmark complements English-centric math datasets (e.g., MATH, GSM8K) by targeting open-response, rubric-gradable reasoning under low-resource linguistic framing. We release code and data to enable reproducible evaluation and human-aligned auto-grading for STEM in under-served languages.
Anthology ID:
2025.emnlp-main.1779
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
35094–35109
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1779/
DOI:
Bibkey:
Cite (ACL):
Adam Zahradník and Marek Suppa. 2025. Trojsten Benchmark: Evaluating LLM Problem-Solving in Slovak STEM Competition Problems. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 35094–35109, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Trojsten Benchmark: Evaluating LLM Problem-Solving in Slovak STEM Competition Problems (Zahradník & Suppa, EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1779.pdf
Checklist:
 2025.emnlp-main.1779.checklist.pdf