Alessio Staffini

2026

Tokenization Cost, Retention, and Orthography Robustness for Ladin and Italian Varieties
Alessio Staffini
Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026)

Tokenizer mismatch is a practical bottleneck for low-resource language varieties: when text is fragmented into disproportionately many subwords or bytes, it wastes context, increases truncation, and can be brittle to orthographic variation.We present a lightweight and reproducible audit centered on Ladin and evaluated on the Identification of Languages and Dialects ofItaly benchmark of eleven Italian varieties.Our diagnostic suite combines tokenization cost measures (tokens per word, truncation pressure, bytes per token) with retention indicators (word split rate, continued-token rate, and type-level retention) and fragmentation proxies that reveal splitting patterns beyond fertility.We pair these diagnostics with a conservative orthography robustness protocol (diacritics, casing, punctuation and dash normalization) and assess how diagnostic changes relate to performance drops in lightweight baselines for sentence-level variety identification.We release code and derived statistics to support reproducible tokenizer audits in other low-resource settings.

Co-authors

Venues

LoResLM1

Fix author