Michele Ciletti

2026

Sense-Based Annotation of Geographical Nouns in Ancient Greek and Latin: A Diachronic Study with LLMs
Andrea Farina | Michele Ciletti | Barbara Mcgillivray | Andrea Ballatore
Proceedings of the 10th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature 2026

This paper investigates the lexicalisation of geographical nouns in Latin and Ancient Greek using a nd Ancient Greek using a diachronic, multi-genre corpus (8th cent. BCE – 2nd cent. CE) and Large Language Models for Word Sense Disambiguation. We focus on two main aspects: the onomasiological question of which words encode core geographical concepts, and the semasiological distribution of senses across lemmas. Across both languages, city-related concepts are the most frequently expressed, but Greek shows a stronger focus on maritime terms, whereas Latin favours concepts related to land. Semasiologically, Latin shows clearer evidence of semantic change over time (e.g., ’citizenship’ - ’city’, aequor ’flat surface’ - ’sea’), while Greek displays more gradual or distributed shifts. These results show that computational annotation enables cross-linguistic and diachronic analysis of spatial semantics, allowing us to compare the frequency of concepts across languages, genres, and periods, and to track when semantic change occurs and how core concepts evolve over time.

pdf bib abs

The Foggia Occupator Corpus: Digitisation, Annotation, and Computational Analysis of an Occupation-Era Newspaper (1945-1946)
Michele Ciletti
Proceedings of the Fifteenth Language Resources and Evaluation Conference

Historical newspapers are crucial sources yet often remain undigitised or lack machine-readable text. We present the Foggia Occupator corpus, a linguistically enriched, openly licensed resource built from twenty-two issues (Dec 1945–Aug 1946) of a weekly newspaper produced by U.S. personnel in occupied Foggia, Italy. High-resolution scans were processed via OCR with LLM-assisted correction (GPT-4o) and full human verification, then segmented into 874 articles ( 216k tokens). We annotate topics, named entities and typed relations via a semi-automatic pipeline with manual reconciliation, and perform argument mining on civics- and conflict-related content, yielding 1,735 arguments. The entity–relation layer supports network analyses that reveal sparse, modular structures linking military units, civic bodies, and social life. We release TEI-XML with entity spans, JSON article files with metadata, CSVs of entities/relations with temporal counts, and an arguments JSON, all under a Creative Commons 4.0 licence. Beyond documenting an in-between moment of reconstruction, the resource enables benchmarking for OCR-robust NER/RE and studies of framing, stance, and community structure in post-war local media.

2025

pdf bib

Veras Audire Et Reddere Voces: A Corpus of Prosodically-Correct Latin Poetic Audio from Large-Language-Model TTS
Michele Ciletti
Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025)

pdf bib abs

Prompting the Muse: Generating Prosodically-Correct Latin Speech with Large Language Models
Michele Ciletti
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

This paper presents a workflow that compels an audio-enabled large language model to recite Latin poetry with metrically accurate stress. One hundred hexameters from the Aeneid and the opening elegiac epistula of Ovid’s Heroides constitute the test bed, drawn from the Pedecerto XML corpus, where ictic syllables are marked. A preprocessing pipeline syllabifies each line, converts alien graphemes into approximate English-Italian counterparts, merges obligatory elisions, adds commas on caesurae, upper-cases every ictic syllable, and places a grave accent on its vowel. Verses are then supplied, one at a time, to an LLM-based Text-to-Speech model under a compact system prompt that instructs slow, articulated delivery. From ten stochastic realisations per verse, a team of Latin experts retained the best; at least one fully correct file was found for 91% of the 216 lines. Upper-casing plus accent marking proved the strongest cue, while hyphenating syllables offered no benefit. Remaining errors cluster around cognates where the model inherits a Romance or English stress template. The corpus of validated audio and all scripts are openly released on Zenodo, opening avenues for pedagogy, accessibility, and prosodic research.

Co-authors

Venues

Fix author