HuGME: A benchmark system for evaluating Hungarian generative LLMs

Noémi Ligeti-Nagy; Gabor Madarasz; Flora Foldesi; Mariann Lengyel; Matyas Osvath; Bence Sarossy; Kristof Varga; Győző Zijian Yang; Enikő Héja; Tamás Váradi; Gabor Proszeky

HuGME: A benchmark system for evaluating Hungarian generative LLMs

Noémi Ligeti-Nagy, Gabor Madarasz, Flora Foldesi, Mariann Lengyel, Matyas Osvath, Bence Sarossy, Kristof Varga, Győző Zijian Yang, Enikő Héja, Tamás Váradi, Gábor Prószéky

Abstract

In this study, we introduce the Hungarian Generative Model Evaluation (HuGME) benchmark, a new framework designed to assess the linguistic proficiency of large language models (LLMs) in Hungarian. HuGME evaluates models across a diverse set of linguistic and reasoning skills, including bias, toxicity, faithfulness, relevance, summarization, prompt alignment, readability, spelling, grammaticality, and domain-specific knowledge through tasks like TruthfulQA and MMLU. We applied HuGME to a range of Hungarian LLMs, including those developed in-house as well as several publicly available models that claim Hungarian language proficiency. This paper presents the comparative results of these evaluations, shedding light on the capabilities of current LLMs in processing the Hungarian language. Through our analysis, we aim to both showcase the current state of Hungarian linguistic processing in LLMs and provide a foundational resource for future advancements in the field.

Anthology ID:: 2025.gem-1.32
Volume:: Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
Month:: July
Year:: 2025
Address:: Vienna, Austria and virtual meeting
Editors:: Kaustubh Dhole, Miruna Clinciu
Venues:: GEM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 385–403
Language:
URL:: https://preview.aclanthology.org/corrections-2025-08/2025.gem-1.32/
DOI:
Bibkey:
Cite (ACL):: Noémi Ligeti-Nagy, Gabor Madarasz, Flora Foldesi, Mariann Lengyel, Matyas Osvath, Bence Sarossy, Kristof Varga, Győző Zijian Yang, Enikő Héja, Tamás Váradi, and Gábor Prószéky. 2025. HuGME: A benchmark system for evaluating Hungarian generative LLMs. In Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²), pages 385–403, Vienna, Austria and virtual meeting. Association for Computational Linguistics.
Cite (Informal):: HuGME: A benchmark system for evaluating Hungarian generative LLMs (Ligeti-Nagy et al., GEM 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/corrections-2025-08/2025.gem-1.32.pdf

PDF Cite Search Fix data