Workshop on Natural Language Processing for Turkic Languages (SIGTURK 2024) (2026)
up
Proceedings of the Second Workshop Natural Language Processing for Turkic Languages (SIGTURK 2026)
Proceedings of the Second Workshop Natural Language Processing for Turkic Languages (SIGTURK 2026)
Kemal Oflazer | Abdullatif Köksal | Onur Varol
Kemal Oflazer | Abdullatif Köksal | Onur Varol
Transformer models have revolutionized NLP, yet many morphologically rich languages remain underrepresented in large-scale pre-training efforts. With SindBERT, we set out to chart the seas of Turkish NLP, providing the first large-scale RoBERTa-based encoder for Turkish. Trained from scratch on 312 GB of Turkish text (mC4, OSCAR23, Wikipedia), SindBERT is released in both base and large configurations, representing the first large-scale encoder-only language model available for Turkish. We evaluate SindBERT on part-of-speech tagging, named entity recognition, offensive language detection, and the TurBLiMP linguistic acceptability benchmark. Our results show that SindBERT performs competitively with existing Turkish and multilingual models, with the large variant achieving the best scores in two of four tasks but showing no consistent scaling advantage overall. This flat scaling trend, also observed for XLM-R and EuroBERT, suggests that current Turkish benchmarks may already be saturated. At the same time, comparisons with smaller but more curated models such as BERTurk highlight that corpus quality and diversity can outweigh sheer data volume. Taken together, SindBERT contributes both as an openly released resource for Turkish NLP and as an empirical case study on the limits of scaling and the central role of corpus composition in morphologically rich languages. The SindBERT models are released under the MIT license and made available in both fairseq and Huggingface formats.
Authorial style transfer is particularly challenging in low-resource scenarios, such as those presented by languages with a distinct socio-digital trajectory like Turkish, where contemporary digital text coexists with under-resourced literary and historical styles. This work addresses this gap through the Dual-Stage Stylometric Imprinting (DSSI) framework, introducing a Rule+Example paradigm for effective style profiling. Evaluated on a corpus of Turkish texts, the approach enables smaller models to achieve up to 90% of large model performance by combining explicit stylistic guidelines with contextual demonstrations. The findings demonstrate altered scaling laws for stylistic tasks and facilitate the practical deployment of personalized style transfer for preserving distinctive writing characteristics.
TUNE: A Task For Turkish Machine Unlearning For Data Privacy
Doruk Benli | Ada Canoğlu | Nehir İlkim Gönençer | Dilara Keküllüoğlu
Doruk Benli | Ada Canoğlu | Nehir İlkim Gönençer | Dilara Keküllüoğlu
Most large language models (LLMs) are trainedon massive datasets that include private infor-mation, which may be disclosed to third-partyusers in output generation. Developers put de-fences to prevent the generation of harmful andprivate information, but jailbreaking methodscan be used to bypass them. Machine unlearn-ing aims to remove information that may beprivate or harmful from the model’s genera-tion without retraining the model from scratch.While machine unlearning has gained somepopularity to counter the removal of privateinformation, especially in English, little to noattention has been given to Turkish unlearn-ing paradigms or existing benchmarks. In thisstudy, we introduce TUNE (Turkish Unlearn-ing Evaluation), the first benchmark datasetfor Turkish unlearning task for personal infor-mation. TUNE consists of 9842 input-targettext pairs about 50 fictitious personalities withtwo training task types: (1) Q A and (2) In-formation Request. We fine-tuned the mT5base model to evaluate various unlearning meth-ods, including our proposed approach. We findthat while current methods can help unlearnunwanted private information in Turkish, theyalso unlearn other information we want to re-tain in the model.
A Unified Turkic Idiom Understanding Benchmark: Idiom Detection and Semantic Retrieval Across Five Turkic Languages
Gözde Aslantaş | Tunga Gungor
Gözde Aslantaş | Tunga Gungor
Idiomatic expressions are culturally grounded, semantically opaque, and difficult to interpret for multilingual natural language processing systems. Despite the large speaker population of Turkic languages, resources that focus on monolingual and cross-lingual idioms and their meanings are limited. We introduce the first unified benchmark for idiom understanding across Turkish, Azerbaijani, Turkmen, Gagauz, and Uzbek languages. The datasets compiled include token-level idiom span annotations. We develop models for idiom identification and semantic retrieval tasks. We evaluate seven models for idiom identification and nine embedding models for semantic retrieval tasks under several fine-tuning schemes using standard dense retrieval metrics. This benchmark provides a basis for studying idiomatic phenomena in Turkic languages and clarifies how idiomatic meanings are shared, altered, or diverge across languages.
TR-EduVSum: A Turkish-Focused Dataset and Consensus Framework for Educational Video Summarization
Figen Eğin | Aytuğ Onan
Figen Eğin | Aytuğ Onan
This study presents a framework for generating the gold-standard summary fully automatically and reproducibly based on multiple human summaries of Turkish educational videos. Within the scope of the study, a new dataset called TR-EduVSum was created, encompassing 82 Turkish course videos in the field of "Data Structures and Algorithms" and containing a total of 3281 independent human summaries. Inspired by existing pyramid-based evaluation approaches, the AutoMUP (Automatic Meaning Unit Pyramid) method is proposed, which extracts consensus-based content from multiple human summaries. AutoMUP clusters the meaning units extracted from human summaries using embedding, statistically models inter-participant agreement, and generates graded summaries based on consensus weight. In this framework, the gold summary corresponds to the highest-consensus AutoMUP configuration, constructed from the most frequently supported meaning units across human summaries. Experimental results show that AutoMUP summaries exhibit high semantic overlap with robust LLM summaries such as Flash 2.5 and GPT-5.1. Furthermore, ablation studies clearly demonstrate the decisive role of consensus weight and clustering in determining summary quality. The proposed approach can be generalized to other Turkic languages at low cost.
SarcasTürk: Turkish Context-Aware Sarcasm Detection Dataset
Niyazi Ahmet Metin | Sevde Yılmaz | Osman Enes Erdoğdu | Elif Sude Meydan | Oğul Sümer | Dilara Keküllüoğlu
Niyazi Ahmet Metin | Sevde Yılmaz | Osman Enes Erdoğdu | Elif Sude Meydan | Oğul Sümer | Dilara Keküllüoğlu
Sarcasm is a colloquial form of language that is used to convey messages in a non-literal way, which affects the performance of many NLP tasks. Sarcasm detection is not trivial and existing work mainly focus on only English. We present SarcasTürk, a context-aware Turkish sarcasm detection dataset built from Ekşi Sözlük entries, a large-scale Turkish online discussion platform where people frequently use sarcasm. SarcasTürk contains 1,515 entries from 98 titles with binary sarcasm labels and a title-level context field created to support comparisons between entry-only and context-aware models. We generate these contexts by selecting representative sentences from all entries under a title using summarization techniques. We report baseline results for a fine-tuned BERTurk classifier and zero-shot LLMs under both no-context and context-aware conditions. We find that BERTurk model with title-level context has the best performance with 0.76 accuracy and balanced class-wise F1 scores (0.77 for sarcasm, 0.75 for no sarcasm). SarcasTürk can be shared upon contacting the authors since the dataset contains potentially sensitive and offensive language.
Language Matters: Target-Language Supervision for Political Bias Detection in Turkish News
Umut Ozbagriacik | Haim Dubossarsky
Umut Ozbagriacik | Haim Dubossarsky
We present, to our knowledge, the first systematic transformer-based outlet-ideology classification study for Turkish news. Using a topic-balanced corpus of Turkish political articles drawn from six outlets commonly perceived as left-, centre-, or right-leaning, we formulate a three-way outlet-ideology classification task. On this dataset, we evaluate a monolingual encoder (BERTurk), two multilingual encoders (mBERT, XLM-R), and a LoRA-adapted decoder model (Mistral). BERTurk achieves the best performance among individual models (70% accuracy, 71% macro-F1), reaching levels comparable to English-language studies despite operating in a lower-resource setting. Error analyses show that all encoders reliably distinguish centrist from partisan articles, but frequently confuse left- and right-leaning articles with each other. Moreover, BERTurk is relatively stronger on right-leaning content, whereas the multilingual models favour left-leaning content, suggesting an “ideological fingerprint” of their pre-training data. Crucially, models fine-tuned on an English political-bias task fail to transfer to Turkish, collapsing to near-chance performance. Taken together, these results demonstrate that effective political bias detection requires target-language supervision and cannot be achieved through naïve cross-lingual transfer. Our work establishes a first baseline for Turkish political bias detection and underscores the need for open, carefully designed Turkish (and broader Turkic) bias benchmarks to support robust and fair media analysis.
Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew
Giuseppe Samo | Paola Merlo
Giuseppe Samo | Paola Merlo
In this paper, we investigate how transformer models represent complex verb paradigms in Turkish and Modern Hebrew, focusing on how tokenization strategies shape this ability. Using the Blackbird Language Matrices task on natural data, we show that for Turkish—with its transparent morphological markers—both monolingual and multilingual models succeed either when tokenization is highly atomic or breaking words into small subword units. For Hebrew, however, a multilingual model using character-level tokenization fails to capture its non-concatenative morphology, while a monolingual model with unified morpheme-aware segmentation excels. Performance improves on more synthetic datasets, in all models.
A Morphology-Aware Evaluation of Turkish Syntax in Large Language Models
Ezgi Başar | Arianna Bisazza
Ezgi Başar | Arianna Bisazza
Minimal pair benchmarks have become a common approach for evaluating the syntactic knowledge of language models (LMs). However, the creation of such benchmarks often overlooks language-specific confounders that may affect model performance, particularly in the case of morphologically rich languages. In this paper, we investigate how surface-level factors such as morpheme count, subword count, and sentence length influence the performance of LMs on a Turkish benchmark of linguistic minimal pairs. We further analyze whether a tokenizer’s degree of alignment with morphological boundaries can serve as a proxy for model performance. Finally, we test whether the distribution of morphemes in a minimal pair benchmark can skew model performance. Our results show that while surface factors have limited predictive power, they might still serve as a systematic source of bias. Moreover, we find that morphological alignment can roughly correspond to model performance, and morpheme-level imbalances in the benchmark may have a significant influence on results.
Benchmarking Hate Speech Detection in Azerbaijani with Turkish Cross-Lingual Transfer and Transformer Models
Tural Alizada | Haim Dubossarsky
Tural Alizada | Haim Dubossarsky
In this paper, we investigated the task of hate-speech classification in the closely related Turkic language pair, Turkish-Azerbaijani. Transformer models can achieve strong hate-speech classification in Turkish, but their performance does not reliably transfer to closely related low-resource languages without careful evaluation. We study Turkish–Azerbaijani hate speech detection and introduce the first manually annotated Azerbaijani benchmark, comprising 1,112 YouTube comments from major news channels with severe class imbalance. We compare XLM-RoBERTa and a compact BERT-Tiny model against a TF–IDF + logistic regression baseline under monolingual training, zero-shot Turkish→Azerbaijani transfer, low-resource balanced subsampling, bilingual mixed fine-tuning, and translation-based augmentation using machine-translated Turkish data. XLM-R attains high macro-F1 in Turkish and achieves moderate zero-shot transfer to Azerbaijani, but native Azerbaijani training is fragile for the hate class. Mixed bilingual training improves robustness for both languages, whereas TF–IDF generalizes poorly to Azerbaijani.
When Semantic Overlap Is Not Enough: Cross-Lingual Euphemism Transfer Between Turkish and English
Hasan Can Biyik | Libby Barak | Jing Peng | Anna Feldman
Hasan Can Biyik | Libby Barak | Jing Peng | Anna Feldman
Euphemisms substitute socially sensitive expressions, often softening or reframing meaning, and their reliance on cultural and pragmatic context complicates modeling across languages. In this study, we investigate how cross-lingual equivalence influences transfer in multilingual euphemism detection. We categorize Potentially Euphemistic Terms (PETs) in Turkish and English into Overlapping (OPETs) and Non-Overlapping (NOPETs) subsets based on their functional, pragmatic, and semantic alignment. Our findings reveal a transfer asymmetry: semantic overlap is insufficient to guarantee positive transfer, particularly in low-resource Turkish-to-English direction, where performance can degrade even for overlapping euphemisms, and in some cases, improve under NOPET-based training. Differences in label distribution help explain these counterintuitive results. Category-level analysis suggests that transfer may be influenced by domain-specific alignment, though evidence is limited by sparsity.
TurkBench: A Benchmark for Evaluating Turkish Large Language Models
Cagri Toraman | Ahmet Kaan Sever | Ayşe Aysu Cengiz | Elif Ecem Arslan | Görkem Sevinç | Sarp Kantar | Mete Mert Birdal | Yusuf Faruk Güldemir | Ali Buğra Kanburoğlu | Sezen Felekoğlu | Birsen Şahin Kütük | Büşra Tufan | Elif Genç | Serkan Coşkun | Gupse Ekin Demir | Muhammed Emin Arayıcı | Olgun Dursun | Onur Gungor | Susan Üsküdarlı | Abdullah Topraksoy | Esra Darıcı
Cagri Toraman | Ahmet Kaan Sever | Ayşe Aysu Cengiz | Elif Ecem Arslan | Görkem Sevinç | Sarp Kantar | Mete Mert Birdal | Yusuf Faruk Güldemir | Ali Buğra Kanburoğlu | Sezen Felekoğlu | Birsen Şahin Kütük | Büşra Tufan | Elif Genç | Serkan Coşkun | Gupse Ekin Demir | Muhammed Emin Arayıcı | Olgun Dursun | Onur Gungor | Susan Üsküdarlı | Abdullah Topraksoy | Esra Darıcı
With the recent surge in the development of large language models, the need for comprehensive and language-specific evaluation benchmarks has become critical. While significant progress has been made in evaluating English-language models, benchmarks for other languages, particularly those with unique linguistic characteristics such as Turkish, remain less developed. Our study introduces TurkBench, a comprehensive benchmark designed to assess the capabilities of generative large language models in the Turkish language. TurkBench involves 8,151 data samples across 21 distinct subtasks. These are organized under six main categories of evaluation: Knowledge, Language Understanding, Reasoning, Content Moderation, Turkish Grammar and Vocabulary, and Instruction Following. The diverse range of tasks and the culturally relevant data would provide researchers and developers with a valuable tool for evaluating their models and identifying areas for improvement. We further publish our benchmark for online submissions at https://huggingface.co/turkbench
BIRDTurk: Adaptation of the BIRD Text-to-SQL Dataset to Turkish
Burak Aktaş | Mehmet Can Baytekin | Süha Kağan Köse | Ömer İlbilgi | Elif Özge Yılmaz | Cagri Toraman | Bilge Kaan Görür
Burak Aktaş | Mehmet Can Baytekin | Süha Kağan Köse | Ömer İlbilgi | Elif Özge Yılmaz | Cagri Toraman | Bilge Kaan Görür
Text-to-SQL systems have achieved strong performance on English benchmarks, yet their behavior in morphologically rich, low-resource languages remains largely unexplored. We introduce BIRDTurk, the first Turkish adaptation of the BIRD benchmark, constructed through a controlled translation pipeline that adapts schema identifiers to Turkish while strictly preserving the logical structure and execution semantics of SQL queries and databases. Translation quality is validated on a sample size determined by the Central Limit Theorem to ensure 95% confidence, achieving 98.15% accuracy on human-evaluated samples. Using BIRDTurk, we evaluate inference-based prompting, agentic multi-stage reasoning, and supervised fine-tuning. Our results reveal that Turkish introduces consistent performance degradation–driven by both structural linguistic divergence and underrepresentation in LLM pretraining–while agentic reasoning demonstrates stronger cross-lingual robustness. Supervised fine-tuning remains challenging for standard multilingual baselines but scales effectively with modern instruction-tuned models. BIRDTurk provides a controlled testbed for cross-lingual Text-to-SQL evaluation under realistic database conditions. We release the training and development splits to support future research.
Tokenisation of Turkic Copula Constructions in Universal Dependencies
Cagri Coltekin | Furkan Akkurt | Bermet Chontaeva | Soudabeh Eslami | Sardana Ivanova | Gulnura Dzhumalieva | Aida Kasieva | Nikolett Mus | Jonathan Washington
Cagri Coltekin | Furkan Akkurt | Bermet Chontaeva | Soudabeh Eslami | Sardana Ivanova | Gulnura Dzhumalieva | Aida Kasieva | Nikolett Mus | Jonathan Washington
Identifying units, ’syntactic words’, for morphosyntactic analysis is important yet challenging for morphologically rich languages. In this paper we propose a set of guiding principles to determine units of morphosyntactic analysis, and apply them to the case of copular constructions in Turkic languages, in the context of Universal Dependencies (UD) framework. We also provide a survey of the practice in the Turkic UD treebanks published to date, and discuss the advantages and disadvantages of the proposed tokenisation for a selection of Turkic languages.
RAGTurk: Best Practices for Retrieval Augmented Generation in Turkish
Süha Kağan Köse | Mehmet Can Baytekin | Burak Aktaş | Bilge Kaan Görür | Evren Ayberk Munis | Deniz Yılmaz | Muhammed Yusuf Kartal | Cagri Toraman
Süha Kağan Köse | Mehmet Can Baytekin | Burak Aktaş | Bilge Kaan Görür | Evren Ayberk Munis | Deniz Yılmaz | Muhammed Yusuf Kartal | Cagri Toraman
Retrieval-Augmented Generation (RAG) enhances LLM factuality, yet design guidance remains English-centric, limiting insights for morphologically rich languages like Turkish. We address this by constructing a comprehensive Turkish RAG dataset derived from Turkish Wikipedia and CulturaX, comprising question-answer pairs and relevant passage chunks. We benchmark seven stages of the RAG pipeline—from query transformation and reranking to answer refinement—without task-specific fine-tuning. Our results show that complex methods like HyDE maximize accuracy (85%) that is considerably higher than the baseline (78.70%). Also a Pareto-optimal configuration using Cross-encoder Reranking and Context Augmentation achieves comparable performance (84.60%) with much lower cost. We further demonstrate that over-stacking generative modules can degrade performance by distorting morphological cues, whereas simple query clarification with robust reranking offers an effective solution.
OCRTurk: A Comprehensive OCR Benchmark for Turkish
Deniz Yılmaz | Evren Ayberk Munis | Cagri Toraman | Süha Kağan Köse | Burak Aktaş | Mehmet Can Baytekin | Bilge Kaan Görür
Deniz Yılmaz | Evren Ayberk Munis | Cagri Toraman | Süha Kağan Köse | Burak Aktaş | Mehmet Can Baytekin | Bilge Kaan Görür
Document parsing is now widely used in applications, such as large-scale document digitization, retrieval-augmented generation, and domain-specific pipelines in healthcare and education. Benchmarking these models is crucial for assessing their reliability and practical robustness. Existing benchmarks mostly target high-resource languages and provide limited coverage for low-resource settings, such as Turkish. Moreover, existing studies on Turkish document parsing lack a standardized benchmark that reflects real-world scenarios and document diversity. To address this gap, we introduce OCRTurk, a Turkish document parsing benchmark covering multiple layout elements and document categories at three difficulty levels. OCRTurk consists of 180 Turkish documents drawn from academic articles, theses, slide decks, and non-academic articles. We evaluate seven OCR models on OCRTurk using element-wise metrics. Across difficulty levels, PaddleOCR achieves the strongest overall results, leading most element-wise metrics except figures and attaining the best Normalized Edit Distance scores in easy, medium, and hard subsets. We also observe performance variation by document type: models perform well on non-academic documents, while slideshows become the most challenging.
Building a Turkish Large Language Model via Continual Pre-Training and Parameter-Efficient Adaptation
Alperen Enes Bayar | Mert Ege | Gökhan Yurtalan | Alper Karamanlioglu | Berkan Demirel | Ramazan Gokberk Cinbis
Alperen Enes Bayar | Mert Ege | Gökhan Yurtalan | Alper Karamanlioglu | Berkan Demirel | Ramazan Gokberk Cinbis
Large Language Models (LLMs) achieve strong performance on many tasks, but they still struggle with morphologically rich, low-resource languages such as Turkish. This difficulty stems from Turkish being an agglutinative language and underrepresented in multilingual training data, which causes current models to often fail at capturing its morphology, flexible word order, and formal registers. In this paper, we introduce MODA (Model Adapted for Domain Applications), a Turkish-specialized LLM built via a modular pipeline that combines continual pre-training, parameter-efficient fine-tuning, and model merging. Starting from Qwen2.5-7B as the base model, we first perform large-scale continual pre-training on a Turkish web corpus to improve grammatical and morphological representations. We then apply parameter-efficient supervised fine-tuning on task-oriented instruction data, and finally merge specialized variants into a single unified model. We evaluate MODA on TurkishMMLU, the Turkish subset of EXAMS, and TRCLAIM-19, where it consistently outperforms both the base and instruction-tuned Qwen2.5-7B models. Our results support a training strategy that explicitly separates linguistic acquisition from task alignment when adapting LLMs to morphologically rich, underrepresented languages under realistic hardware constraints.
From Lemmas to Dependencies: What Signals Drive Light Verbs Classification?
Sercan Karakas | Yusuf Şimşek
Sercan Karakas | Yusuf Şimşek
Light verb constructions (LVCs) are a challenging class of verbal multiword expressions, especially in Turkish,where rich morphology and productive complex predicates create minimal contrasts between idiomatic predicatemeanings and literal verb–argument uses. This paper asks what signals drive LVC classification bysystematically restricting model inputs. Using UD-derived supervision, we compare lemma-driven baselines(lemma TF–IDF + Logistic Regression; BERTurk trained on lemma sequences), a grammar-only Logistic Regressionover UD morphosyntax (UPOS/DEPREL/MORPH), and a full-input BERTurk baseline. We evaluate on a controlleddiagnostic set with Random negatives, lexical controls (NLVC), and LVC positives, reporting split-wiseperformance to expose decision-boundary behavior. Results show that coarse morphosyntax alone is insufficientfor robust LVC detection under controlled contrasts, while lexical identity supports LVC judgments but issensitive to calibration and normalization choices. Overall, our findings motivate targeted evaluation forTurkish MWEs and highlight that “lemma-only” is not a single representation but depends critically on hownormalization is instantiated.
Beyond the Token: Correcting the Tokenization Bias in XAI via Morphologically-Aligned Projection
Muhammet Anil Yagiz | Fahrettin Horasan
Muhammet Anil Yagiz | Fahrettin Horasan
Current interpretability methods for Large Language Models (LLMs) operate on a fundamental yet flawed assumption: that subword tokens represent independent semantic units. We prove that this assumption creates a fidelity bottleneck in Morphologically Rich Languages (MRLs), where semantic meaning is densely encoded in sub-token morphemes. We term this phenomenon the Tokenization-Morphology Misalignment (TMM). To resolve TMM, we introduce MAFEX (Morpheme-Aligned Faithful Explanations), a theoretically grounded framework that redefines feature attribution as a linear projection from the computational (token) basis to the linguistic (morpheme) basis. We evaluate our method on a diverse suite of Turkish LLMs, including BERTurk, BERTurk-Sentiment, Cosmos-BERT, and Kumru-2B. On our embedded benchmark (N=20), MAFEX achieves an average F1@1 of 91.25% compared to 13.75% for standard token-level baselines (IG, SHAP, DeepLIFT), representing a +77.5% absolute improvement, establishing it as the new standard for faithful multilingual interpretability.
Overview of the SIGTURK 2026 Shared Task: Terminology-Aware Machine Translation for English–Turkish Scientific Texts
Ali Gebeşçe | Abdulfattah Safa | Ege Uğur Amasya | Gözde Gül Şahin
Ali Gebeşçe | Abdulfattah Safa | Ege Uğur Amasya | Gözde Gül Şahin
This paper presents an overview of the SIGTURK 2026 Shared Task on Terminology-Aware Machine Translation for English-Turkish Scientific Texts. We address the critical challenge of terminological accuracy in low-resource settings by constructing the first terminology-rich English-Turkish parallel corpus, comprising 3,300 sentence pairs from STEM domains with 10,157 expert-validated term pairs. The shared task consists of three subtasks: term detection, expert-guided correction, and end-to-end post-editing. We evaluate state-of-the-art baselines (including GPT-5.2 and Claude Sonnet 4.5) alongside participant systems employing diverse strategies from fine-tuning to Retrieval-Augmented Generation (RAG). Our results highlight that while massive generalist models dominate zero-shot detection, smaller, domain-adapted models using Supervised Fine-Tuning and Reinforcement Learning can significantly outperform them in end-to-end post-editing. Furthermore, we find that rigid retrieval pipelines often disrupt fluency, whereas Chain-of-Thought prompting allows models to integrate terminology more naturally. Despite these advances, a significant gap remains between automated systems and human expert performance in strict terminology correction.