Noémi Ligeti-Nagy

2026

The Impact of Tokenization Algorithms on Hungarian Language Model Performance
Mátyás Osváth | Máté Norbert Molnár | Roland Gunics | Noémi Ligeti-Nagy
Proceedings of the Fifteenth Language Resources and Evaluation Conference

Tokenization is a crucial text processing step for preparing input for language models and can contribute to model performance, especially in morphologically rich languages. Currently, Byte Pair Encoding (BPE), WordPiece, and Unigram LM algorithms are predominantly used in language models, but their effects can vary in agglutinative languages. This work compares these tokenization algorithms across varying vocabulary sizes, as well as a modified Unigram LM variant with morphologically informed initialization, on the Hungarian subset of the OSCAR dataset. The evaluation is based on several metrics describing the inferred quality of the tokenizers and on the downstream performance of multiple BERT models on the HuLU benchmark. Results show that BPE produces the most compact and morphologically aligned subword representations, while the modified Unigram LM achieved the best overall downstream performance across tasks. However, differences between methods and vocabulary sizes were generally small and not statistically significant, with the exception of HuCoPA (a task within the HuLU benchmark), which showed sensitivity to both factors. These findings underscore that tokenizer choice and vocabulary design are critical determinants of language model efficiency and performance in morphologically rich languages.

pdf bib abs

Potentially idiomatic expressions (PIEs) carry meanings inherently tied to the everyday experience of a given language community. As such, they constitute an interesting challenge for assessing the linguistic (and to some extent cultural) capabilities of NLP systems. In this paper, we present XMPIE, a parallel multilingual and multimodal dataset of potentially idiomatic expressions. The dataset, containing 34 languages and over ten thousand items, allows comparative analyses of idiomatic patterns among language-specific realisations and preferences in order to gather insights about shared cultural aspects. This parallel dataset allows evaluation of language model performance for a given PIE in different languages and whether idiomatic understanding in one language can be transferred to another. Moreover, the dataset supports the study of PIEs across textual and visual modalities, to measure to what extent PIE understanding in one modality transfers or implies in understanding in another modality (text vs. image). The data was created by language experts, with both textual and visual components crafted under multilingual guidelines, and each PIE is accompanied by five images representing a spectrum from idiomatic to literal meanings, including semantically related and random distractors. The result is a high-quality benchmark for evaluating multilingual and multimodal idiomatic language understanding.

2025

pdf bib abs

In this study, we introduce the Hungarian Generative Model Evaluation (HuGME) benchmark, a new framework designed to assess the linguistic proficiency of large language models (LLMs) in Hungarian. HuGME evaluates models across a diverse set of linguistic and reasoning skills, including bias, toxicity, faithfulness, relevance, summarization, prompt alignment, readability, spelling, grammaticality, and domain-specific knowledge through tasks like TruthfulQA and MMLU. We applied HuGME to a range of Hungarian LLMs, including those developed in-house as well as several publicly available models that claim Hungarian language proficiency. This paper presents the comparative results of these evaluations, shedding light on the capabilities of current LLMs in processing the Hungarian language. Through our analysis, we aim to both showcase the current state of Hungarian linguistic processing in LLMs and provide a foundational resource for future advancements in the field.

We introduce OpenHuEval, the first benchmark for LLMs focusing on the Hungarian language and specifics. OpenHuEval is constructed from a vast collection of Hungarian-specific materials sourced from multiple origins. In the construction, we incorporated the latest design principles for evaluating LLMs, such as using real user queries from the internet, emphasizing the assessment of LLMs’ generative capabilities, and employing LLM-as-judge to enhance the multidimensionality and accuracy of evaluations. Ultimately, OpenHuEval encompasses eight Hungarian-specific dimensions, featuring five tasks and 3953 questions. Consequently, OpenHuEval provides the comprehensive, in-depth, and scientifically accurate assessment of LLM performance in the context of the Hungarian language and its specifics. We evaluated current mainstream LLMs, including both traditional LLMs and recently developed Large Reasoning Models. The results demonstrate the significant necessity for evaluation and model optimization tailored to the Hungarian language and specifics. We also established the framework for analyzing the thinking processes of LRMs with OpenHuEval, revealing intrinsic patterns and mechanisms of these models in non-English languages, with Hungarian serving as a representative example. We will release OpenHuEval at https://github.com/opendatalab/OpenHuEval .

2024

pdf bib abs

HuLU: Hungarian Language Understanding Benchmark Kit
Noémi Ligeti-Nagy | Gergő Ferenczi | Enikő Héja | László János Laki | Noémi Vadász | Zijian Győző Yang | Tamás Váradi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

The paper introduces the Hungarian Language Understanding (HuLU) benchmark, a comprehensive assessment framework designed to evaluate the performance of neural language models on Hungarian language tasks. Inspired by the renowned GLUE and SuperGLUE benchmarks, HuLU aims to address the challenges specific to Hungarian language processing. The benchmark consists of various datasets, each representing different linguistic phenomena and task complexities. Moreover, the paper presents a web service developed for HuLU, offering a user-friendly interface for model evaluation. This platform not only ensures consistent assessment but also fosters transparency by maintaining a leaderboard showcasing model performances. Preliminary evaluations of various LMMs on HuLU datasets indicate that while Hungarian models show promise, there’s room for improvement to match the proficiency of English-centric models in their native language.

2022

pdf bib abs

A Clique-based Graphical Approach to Detect Interpretable Adjectival Senses in Hungarian
Enikő Héja | Noémi Ligeti-Nagy
Proceedings of TextGraphs-16: Graph-based Methods for Natural Language Processing

The present paper introduces an ongoing research which aims to detect interpretable adjectival senses from monolingual corpora applying an unsupervised WSI approach. According to our expectations the findings of our investigation are going to contribute to the work of lexicographers, linguists and also facilitate the creation of benchmarks with semantic information for the NLP community. For doing so, we set up four criteria to distinguish between senses. We experiment with a graphical approach to model our criteria and then perform a detailed, linguistically motivated manual evaluation of the results.

2019

pdf bib abs

In this article, an ongoing research is presented, the immediate goal of which is to create a corpus annotated with semantic role labels for Hungarian that can be used to train a parser-based system capable of formulating relevant questions about the text it processes. We briefly describe the objectives of our research, our efforts at eliminating errors in the Hungarian Universal Dependencies corpus, which we use as the base of our annotation effort, at creating a Hungarian verbal argument database annotated with thematic roles, at classifying adjuncts, and at matching verbal argument frames to specific occurrences of verbs and participles in the corpus.

pdf bib

What does the Nom say? An algorithm for case disambiguation in Hungarian
Noémi Ligeti-Nagy | Andrea Dömötör | Noémi Vadász
Proceedings of the Fifth International Workshop on Computational Linguistics for Uralic Languages