Carlo Zoli

2026

An Enhanced Pipeline for the Manzini-Savoia Dialect Corpus
Achille Fusco | Greta Mazzaggio | Carlo Zoli
Proceedings of the Fifteenth Language Resources and Evaluation Conference

This paper presents a semi-automatic workflow for enriching the Manzini–Savoia Corpus (MSC) of Italian dialects with extended glosses, normalized transcriptions, and projected morpho-syntactic annotations. While the MSC is a unique resource for Romance microvariation, its partial glossing and phonetic transcription in the International Phonetic Alphabet (IPA) pose major challenges for computational processing. We introduce a pipeline for gloss coverage expansion and reliable morpho-syntactic annotation combining rule-based and data-driven components, which includes: (i) automatic completion of truncated verbal paradigms; (ii) hybrid lexical alignment between dialectal tokens and Italian glosses, integrating per-region lexical priors with a dynamic programming alignment algorithm; and (iii) projection-based morpho-syntactic tagging from aligned glosses. The proposed methods offer a reproducible framework for extending partially glossed dialect corpora and contribute new annotated data for research in computational dialectology and cross-variety language modeling.

2025

pdf bib abs

Recent advances in neural machine translation (NMT) have opened new possibilities for developing translation systems also for smaller, so-called low-resource, languages. The rise of large language models (LLMs) has further revolutionized machine translation by enabling more flexible and context-aware generation. However, many challenges remain for low-resource languages, and the availability of high-quality, validated test data is essential to support meaningful development, evaluation, and comparison of translation systems. In this work, we present an extension of the FLORES+ dataset for two Ladin variants, Val Badia and Gherdëina, as a submission to the Open Language Data Initiative Shared Task 2025. To complement existing resources, we additionally release two parallel datasets for Gherdëina–Val Badia and Gherdëina–Italian. We validate these datasets by evaluating state-of-the-art LLMs and NMT systems on this test data, both with and without leveraging the newly released parallel data for fine-tuning and prompting. The results highlight the considerable potential for improving translation quality in Ladin, while also underscoring the need for further research and resource development, for which this contribution provides a basis.

Co-authors

Venues

LREC1
WMT1

Fix author