Ezgi Başar

2026

A Morphology-Aware Evaluation of Turkish Syntax in Large Language Models
Ezgi Başar | Arianna Bisazza
Proceedings of the Second Workshop Natural Language Processing for Turkic Languages (SIGTURK 2026)

Minimal pair benchmarks have become a common approach for evaluating the syntactic knowledge of language models (LMs). However, the creation of such benchmarks often overlooks language-specific confounders that may affect model performance, particularly in the case of morphologically rich languages. In this paper, we investigate how surface-level factors such as morpheme count, subword count, and sentence length influence the performance of LMs on a Turkish benchmark of linguistic minimal pairs. We further analyze whether a tokenizer’s degree of alignment with morphological boundaries can serve as a proxy for model performance. Finally, we test whether the distribution of morphemes in a minimal pair benchmark can skew model performance. Our results show that while surface factors have limited predictive power, they might still serve as a systematic source of bias. Moreover, we find that morphological alignment can roughly correspond to model performance, and morpheme-level imbalances in the benchmark may have a significant influence on results.

2025

pdf bib abs

TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs
Ezgi Başar | Francesca Padovani | Jaap Jumelet | Arianna Bisazza
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

We introduce TurBLiMP, the first Turkish benchmark of linguistic minimal pairs, designed to evaluate the linguistic abilities of monolingual and multilingual language models (LMs). Covering 16 linguistic phenomena with 1000 minimal pairs each, TurBLiMP fills an important gap in linguistic evaluation resources for Turkish. In designing the benchmark, we give extra attention to two properties of Turkish that remain understudied in current syntactic evaluations of LMs, namely word order flexibility and subordination through morphological processes. Our experiments on a wide range of LMs and a newly collected set of human acceptability judgments reveal that even cutting-edge Large LMs still struggle with grammatical phenomena that are not challenging for humans, and may also exhibit different sensitivities to word order and morphological complexity compared to humans.

pdf bib abs

LCTeam at SemEval-2025 Task 3: Multilingual Detection of Hallucinations and Overgeneration Mistakes Using XLM-RoBERTa
Araya Hailemariam | Jose Maldonado Rodriguez | Ezgi Başar | Roman Kovalev | Hanna Shcharbakova
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

In recent years, the tendency of large language models to produce hallucinations has become an object of academic interest. Hallucinated or overgenerated outputs created by LLMs contain factual inaccuracies which can potentially invalidate textual coherence. The Mu-SHROOM shared task sets the goal of developing strategies for detecting hallucinated parts of LLM outputs in a multilingual context. We present an approach applicable across multiple languages, which incorporates the alignment of tokens and hard labels, as well as training a multi-lingual XLM-RoBERTa model. With this approach we managed to achieve 2nd in Chinese and top-10 positions in 7 other language tracks of the competition.

Co-authors

Francesca Padovani 1

Hanna Shcharbakova 1

Venues

Fix author