Alessio Staffini


2026

Tokenizer mismatch is a practical bottleneck for low-resource language varieties: when text is fragmented into disproportionately many subwords or bytes, it wastes context, increases truncation, and can be brittle to orthographic variation.We present a lightweight and reproducible audit centered on Ladin and evaluated on the Identification of Languages and Dialects ofItaly benchmark of eleven Italian varieties.Our diagnostic suite combines tokenization cost measures (tokens per word, truncation pressure, bytes per token) with retention indicators (word split rate, continued-token rate, and type-level retention) and fragmentation proxies that reveal splitting patterns beyond fertility.We pair these diagnostics with a conservative orthography robustness protocol (diacritics, casing, punctuation and dash normalization) and assess how diagnostic changes relate to performance drops in lightweight baselines for sentence-level variety identification.We release code and derived statistics to support reproducible tokenizer audits in other low-resource settings.
Search
Co-authors
    Venues
    Fix author