David Rosson

2026

Detecting Latin in Historical Books with Large Language Models: A Multimodal Benchmark
Yu Wu | Ke Shu | Jonas Fischer | Lidia Pivovarova | David Rosson | Eetu Mäkelä | Mikko Tolonen
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

This paper presents a novel task of extracting low-resourced and noisy Latin fragments from mixed-language historical documents with varied layouts. We benchmark and evaluate the performance of large foundation models against a multimodal dataset of 724 annotated pages. The results demonstrate that reliable Latin detection with contemporary zero-shot models is achievable, yet these models lack a functional comprehension of Latin. This study establishes a comprehensive baseline for processing Latin within mixed-language corpora, supporting quantitative analysis in intellectual history and historical linguistics. Both the dataset and code are available at https://github.com/COMHIS/EACL26-detect-latin.

Co-authors

Yu Wu 1

Venues

EACL1

Fix author