David Rosson
2026
Detecting Latin in Historical Books with Large Language Models: A Multimodal Benchmark
Yu Wu | Ke Shu | Jonas Fischer | Lidia Pivovarova | David Rosson | Eetu Mäkelä | Mikko Tolonen
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Yu Wu | Ke Shu | Jonas Fischer | Lidia Pivovarova | David Rosson | Eetu Mäkelä | Mikko Tolonen
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
This paper presents a novel task of extracting low-resourced and noisy Latin fragments from mixed-language historical documents with varied layouts. We benchmark and evaluate the performance of large foundation models against a multimodal dataset of 724 annotated pages. The results demonstrate that reliable Latin detection with contemporary zero-shot models is achievable, yet these models lack a functional comprehension of Latin. This study establishes a comprehensive baseline for processing Latin within mixed-language corpora, supporting quantitative analysis in intellectual history and historical linguistics. Both the dataset and code are available at https://github.com/COMHIS/EACL26-detect-latin.