Catalogues as Data: Interpretable NLP Pipelines for Ottoman-Turkish Bibliographies

Mark Hill; Ayse Bulus; Paul Spence

Catalogues as Data: Interpretable NLP Pipelines for Ottoman-Turkish Bibliographies

Abstract

Bibliographies are both humanities infrastructure and historic record. To computationally analyse them, however, requires implementing complex digitisation and standardisation decisions. This paper turns to Seyfettin Özege’s Eski Harflerle Basılmış Türkçe Eserler Kataloğu as an example, a scanned set of volumes marked by complex page layouts, degraded typography, irregular entry structures, and historically contingent inconsistencies. With this we present a pipeline that constructs a structured, machine-readable, and analysable dataset out of the 27,000 entries with computer vision, OCR, large and visual language models, sequence-based validation, and custom review tools. This process captures 97.8% of records, with remaining cases capable of being addressed by targeted review. This process demonstrates that combining LLMs with interpretable, review-centric pipelines, offers an appropriate approach for historically complex bibliographic sources.

Anthology ID:: 2026.latechclfl-1.12
Volume:: Proceedings of the 10th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature 2026
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Diego Alves, Yuri Bizzoni, Stefania Degaetano-Ortlieb, Anna Kazantseva, Janis Pagel, Stan Szpakowicz
Venues:: LaTeCH-CLfL | WS
SIG:: SIGHUM
Publisher:: Association for Computational Linguistics
Note:
Pages:: 128–134
Language:
URL:: https://preview.aclanthology.org/ingest-eacl/2026.latechclfl-1.12/
DOI:
Bibkey:
Cite (ACL):: Mark Hill, Ayse Bulus, and Paul Spence. 2026. Catalogues as Data: Interpretable NLP Pipelines for Ottoman-Turkish Bibliographies. In Proceedings of the 10th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature 2026, pages 128–134, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: Catalogues as Data: Interpretable NLP Pipelines for Ottoman-Turkish Bibliographies (Hill et al., LaTeCH-CLfL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-eacl/2026.latechclfl-1.12.pdf
Supplementarymaterial:: 2026.latechclfl-1.12.SupplementaryMaterial.txt
Supplementarymaterial:: 2026.latechclfl-1.12.SupplementaryMaterial.zip

PDF Cite Search Supplementarymaterial Supplementarymaterial Fix data