Esma Fatıma Bilgin Tasdemir

2026

RuznamceNER: A Named Entity Recognition Dataset for Ottoman Turkish
Esma Fatıma Bilgin Tasdemir | Dilara Zeynep Gürer | Saziye Betul Ozates
Proceedings of the Fifteenth Language Resources and Evaluation Conference

Named Entity Recognition (NER) in historical texts poses distinct challenges. Language change reflected in spelling variations, archaic vocabulary, and inconsistent orthography, diminish the efficacy of models trained on contemporary corpora. The limited availability of annotated historical datasets constrains the development and evaluation of accurate, domain-specific NER systems, underscoring the necessity for specialized approaches and domain adaptation. In this work, we introduce the ruznamçe registers as a valuable digital historical resource with broad potential for diverse NLP applications. Our primary contribution is RuznamceNER, a manually annotated NER dataset derived from ruznamçe documents spanning two centuries. The dataset contains 2,138 sentences and a total of 8,730 annotated entities of types PERSON, LOCATION and ORGANIZATION. We further report evaluation results using a BERT-CRF baseline model pre-trained with modern Turkish, highlighting the pivotal importance of in-domain training data for effective NER in historical contexts. Experimental results on the RuznamceNER test set under various training configurations show that even a small amount of supervised in-domain data can yield robust performance for well-structured texts, despite significant lexical and orthographic differences between historical and modern language forms

2025

pdf bib abs

NakbaTR: A Turkish NER Dataset for Nakba Narratives
Esma Fatıma Bilgin Tasdemir | Şaziye Betül Özateş
Proceedings of the first International Workshop on Nakba Narratives as Language Resources

This paper introduces a novel, annotated Named Entity Recognition (NER) dataset derived from a collection of 181 news articles about the Nakba and its witnesses. Given their prominence as a primary source of information on the Nakba in Turkish, news articles were selected as the primary data source. Some 4,032 news sentences are collected from web sites of two news agencies, Anadolu Ajansı and TRTHaber. We applied a filtering process to make sure that only the news which contain witness testimonies regarding the ongoing Nakba are included in the dataset. After a semi-automatic annotation for entities of type Person, Location, and Organization, we obtained a NER dataset of 2,289 PERSON, 5,875 LOCATION, and 1,299 ORGANIZATION tags. We expect the dataset to be useful in several NLP tasks such as sentiment analysis and relation extraction for Nakba event while providing a new language resource for Turkish. As a future work, we aim to improve the dataset by increasing the number of news and entity types.

Co-authors

Venues

Fix author