HuGME: A benchmark system for evaluating Hungarian generative LLMs
Noémi Ligeti-Nagy, Gabor Madarasz, Flora Foldesi, Mariann Lengyel, Matyas Osvath, Bence Sarossy, Kristof Varga, Győző Zijian Yang, Enikő Héja, Tamás Váradi, Gábor Prószéky
Abstract
In this study, we introduce the Hungarian Generative Model Evaluation (HuGME) benchmark, a new framework designed to assess the linguistic proficiency of large language models (LLMs) in Hungarian. HuGME evaluates models across a diverse set of linguistic and reasoning skills, including bias, toxicity, faithfulness, relevance, summarization, prompt alignment, readability, spelling, grammaticality, and domain-specific knowledge through tasks like TruthfulQA and MMLU. We applied HuGME to a range of Hungarian LLMs, including those developed in-house as well as several publicly available models that claim Hungarian language proficiency. This paper presents the comparative results of these evaluations, shedding light on the capabilities of current LLMs in processing the Hungarian language. Through our analysis, we aim to both showcase the current state of Hungarian linguistic processing in LLMs and provide a foundational resource for future advancements in the field.- Anthology ID:
- 2025.gem-1.32
- Volume:
- Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
- Month:
- July
- Year:
- 2025
- Address:
- Vienna, Austria and virtual meeting
- Editors:
- Kaustubh Dhole, Miruna Clinciu
- Venues:
- GEM | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 385–403
- Language:
- URL:
- https://preview.aclanthology.org/corrections-2025-08/2025.gem-1.32/
- DOI:
- Cite (ACL):
- Noémi Ligeti-Nagy, Gabor Madarasz, Flora Foldesi, Mariann Lengyel, Matyas Osvath, Bence Sarossy, Kristof Varga, Győző Zijian Yang, Enikő Héja, Tamás Váradi, and Gábor Prószéky. 2025. HuGME: A benchmark system for evaluating Hungarian generative LLMs. In Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²), pages 385–403, Vienna, Austria and virtual meeting. Association for Computational Linguistics.
- Cite (Informal):
- HuGME: A benchmark system for evaluating Hungarian generative LLMs (Ligeti-Nagy et al., GEM 2025)
- PDF:
- https://preview.aclanthology.org/corrections-2025-08/2025.gem-1.32.pdf