Avyav Kumar Singh

2026

Cross-Tokenizer LLM Distillation through a Byte-Level Interface
Avyav Kumar Singh | Yen-Chen Wu | Alexandru Cioba | Alberto Bernacchia | Davide Buffelli
Proceedings of the Second Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U)

Cross-tokenizer distillation (CTD), the transfer of knowledge from a teacher to a student language model when the two use different tokenizers, remains a largely unsolved problem. Existing approaches rely on heuristic strategies to align mismatched vocabularies, introducing considerable complexity. In this paper, we propose a simple but effective baseline called Byte-Level Distillation (BLD) which enables CTD by operating at a common interface across tokenizers: the byte level. In more detail, we convert the teacher’s output distribution to byte-level probabilities, attach a lightweight byte-level decoder head to the student, and distill through this shared byte-level interface. Despite its simplicity, BLD performs competitively with–and on several benchmarks surpasses–significantly more sophisticated CTD methods, across a range of distillation tasks with models from 1B to 8B parameters. Our results suggest that the byte level is a natural common ground for cross-tokenizer knowledge transfer, while also highlighting that consistent improvements across all tasks and benchmarks remain elusive, underscoring that CTD is still an open problem.

2025

pdf bib abs

Few-Shot Open-Set Classification via Reasoning-Aware Decomposition
Avyav Kumar Singh | Helen Yannakoudakis
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs) excel at few-shot learning, but their ability to reject out-of-distribution examples remains under-explored. We study this challenge under the setting of few-shot open-set classification, where a model must not only classify examples from a small set of seen classes but also reject unseen ones at inference time. This setting is more realistic and challenging than traditional closed-set supervised learning, requiring both fine-grained classification and robust rejection. We show that, for small LLMs, neither chain-of-thought (CoT) prompting nor supervised fine-tuning (SFT) alone are sufficient to generalise reliably, particularly when class semantics are anonymised. We introduce Wasserstein GFN (W-GFN), a novel amortised Generative Flow Network framework that uses latent trajectories to approximate the Bayesian posterior. With as few as 4 examples per class, W-GFN substantially improves performance, enabling Llama 3.2 3B to achieve up to ≥80% of the performance of Llama 3.3 70B in complex datasets, despite being ∼ 23 times smaller, which highlights the importance of reasoning-aware approaches for robust open-set few-shot learning.

Co-authors

Venues

Fix author