LLMs in Ottoman Turkish: From MLM to NER

Enes Yılandiloğlu


Abstract
This paper introduces three foundational contributions to Digital Ottoman Turkish Studies. It presents: (1) three masked language models (MLMs) trained on over 11 million words from 144 works spanning from the 15th to 20th century, (2) a state-of-the-art Named Entity Recognition (NER) model (F1 = 89.94%) trained on 9,960 manually annotated entities, and (3) a state-of-the-art Universal Dependency (UD) parsing model for Ottoman Turkish. This work differs from others by deploying IJMES-transliterated documents for training and evaluation in order to prevent loss of information due to the change of the script from Perso-Arabic to Latin. The paper further explores probabilistic manuscript reconstruction in preliminary experiments, showing that MLMs can recover unread sections in historical documents with 77.8% top-1 accuracy when a list of candidate words is provided. Followed by a discussion, the paper outlines the future directions as building century-aware MLMs and expanding the training data across genres to enhance model generalization.
Anthology ID:
2026.lrec-main.281
Volume:
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Month:
May
Year:
2026
Address:
Palma de Mallorca, Spain
Editors:
Stelios Piperidis, Núria Bel, Henk van den Heuvel, Nancy Ide, Simon Krek, Antonio Toral
Venue:
LREC
SIG:
Publisher:
ELRA Language Resource Association
Note:
Pages:
3517–3522
Language:
URL:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.281/
DOI:
Bibkey:
Cite (ACL):
Enes Yılandiloğlu. 2026. LLMs in Ottoman Turkish: From MLM to NER. International Conference on Language Resources and Evaluation, main:3517–3522.
Cite (Informal):
LLMs in Ottoman Turkish: From MLM to NER (Yılandiloğlu, LREC 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-lrec/2026.lrec-main.281.pdf