Davide Buffelli

2026

Cross-Tokenizer LLM Distillation through a Byte-Level Interface
Avyav Kumar Singh | Yen-Chen Wu | Alexandru Cioba | Alberto Bernacchia | Davide Buffelli
Proceedings of the Second Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U)

Cross-tokenizer distillation (CTD), the transfer of knowledge from a teacher to a student language model when the two use different tokenizers, remains a largely unsolved problem. Existing approaches rely on heuristic strategies to align mismatched vocabularies, introducing considerable complexity. In this paper, we propose a simple but effective baseline called Byte-Level Distillation (BLD) which enables CTD by operating at a common interface across tokenizers: the byte level. In more detail, we convert the teacher’s output distribution to byte-level probabilities, attach a lightweight byte-level decoder head to the student, and distill through this shared byte-level interface. Despite its simplicity, BLD performs competitively with–and on several benchmarks surpasses–significantly more sophisticated CTD methods, across a range of distillation tasks with models from 1B to 8B parameters. Our results suggest that the byte level is a natural common ground for cross-tokenizer knowledge transfer, while also highlighting that consistent improvements across all tasks and benchmarks remain elusive, underscoring that CTD is still an open problem.

2022

pdf bib abs

Recently, Logic Explained Networks (LENs) have been proposed as explainable-by-design neural models providing logic explanations for their predictions.However, these models have only been applied to vision and tabular data, and they mostly favour the generation of global explanations, while local ones tend to be noisy and verbose.For these reasons, we propose LEN<sup>p</sup>, improving local explanations by perturbing input words, and we test it on text classification. Our results show that (i) LEN<sup>p</sup> provides better local explanations than LIME in terms of sensitivity and faithfulness, and (ii) its logic explanations are more useful and user-friendly than the feature scoring provided by LIME as attested by a human survey.

Co-authors

Venues

Fix author