Bence Sarossy


2025

pdf bib
HuGME: A benchmark system for evaluating Hungarian generative LLMs
Noémi Ligeti-Nagy | Gabor Madarasz | Flora Foldesi | Mariann Lengyel | Matyas Osvath | Bence Sarossy | Kristof Varga | Győző Zijian Yang | Enikő Héja | Tamás Váradi | Gábor Prószéky
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)

In this study, we introduce the Hungarian Generative Model Evaluation (HuGME) benchmark, a new framework designed to assess the linguistic proficiency of large language models (LLMs) in Hungarian. HuGME evaluates models across a diverse set of linguistic and reasoning skills, including bias, toxicity, faithfulness, relevance, summarization, prompt alignment, readability, spelling, grammaticality, and domain-specific knowledge through tasks like TruthfulQA and MMLU. We applied HuGME to a range of Hungarian LLMs, including those developed in-house as well as several publicly available models that claim Hungarian language proficiency. This paper presents the comparative results of these evaluations, shedding light on the capabilities of current LLMs in processing the Hungarian language. Through our analysis, we aim to both showcase the current state of Hungarian linguistic processing in LLMs and provide a foundational resource for future advancements in the field.