LLMs in Ottoman Turkish: From MLM to NER

Enes Yılandiloğlu

LLMs in Ottoman Turkish: From MLM to NER

Abstract

This paper introduces three foundational contributions to Digital Ottoman Turkish Studies. It presents: (1) three masked language models (MLMs) trained on over 11 million words from 144 works spanning from the 15th to 20th century, (2) a state-of-the-art Named Entity Recognition (NER) model (F1 = 89.94%) trained on 9,960 manually annotated entities, and (3) a state-of-the-art Universal Dependency (UD) parsing model for Ottoman Turkish. This work differs from others by deploying IJMES-transliterated documents for training and evaluation in order to prevent loss of information due to the change of the script from Perso-Arabic to Latin. The paper further explores probabilistic manuscript reconstruction in preliminary experiments, showing that MLMs can recover unread sections in historical documents with 77.8% top-1 accuracy when a list of candidate words is provided. Followed by a discussion, the paper outlines the future directions as building century-aware MLMs and expanding the training data across genres to enhance model generalization.

Anthology ID:: 2026.lrec-main.281
Volume:: Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:: May
Year:: 2026
Address:: Palma de Mallorca, Spain
Editors:: Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:: LREC
SIG:
Publisher:: ELRA Language Resource Association
Note:
Pages:: 3517–3522
Language:
URL:: https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.281/
DOI:
Bibkey:
Cite (ACL):: Enes Yılandiloğlu. 2026. LLMs in Ottoman Turkish: From MLM to NER. International Conference on Language Resources and Evaluation, main:3517–3522.
Cite (Informal):: LLMs in Ottoman Turkish: From MLM to NER (Yılandiloğlu, LREC 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.281.pdf

PDF Cite Search Fix data