Nakanyseth Vuth

2026

From Latents to Labels: Zero-Shot Named Entity Recognition using Sparse Autoencoder Features
Nakanyseth Vuth | Gilles Sérasset | Didier Schwab
Proceedings of the 15th Joint Conference on Lexical and Computational Semantics (*SEM 2026)

Zero-shot Named Entity Recognition is critical for low-resource domains, yet existing approaches rely on opaque prompting of large language models or dense representations that suffer from polysemanticity. We propose an alternative approach that leverages monosemantic features of Sparse Autoencoders. We introduce SAE-NER, a training-free framework that maps monosemantic SAE feature activations to entity types through direct precision estimation, requiring no supervision or prompting. Experiments across general and biomedical domains show that SAE-NER consistently outperforms trained probing classifiers, with especially a large margin in the biomedical domain (up to +20 F1). Finally, we evaluate the utility of SAE-NER predictions as silver training data for downstream NER models. Using controlled perturbations of gold annotations to simulate realistic annotation noise, we show that false negatives are the primary bottleneck for silver-data quality, outweighing the impact of boundary imprecision and false positives.

2025

pdf bib

“POPCORN-RENS : un nouveau jeu de données en français annoté en entités d’intérêts sur une thématique "“sécurité et défense”""
Lucas Aubertin | Guillaume Gadek | Gilles Sérasset | Maxime Prieur | Nakanyseth Vuth | Bruno Grilheres | Didier Schwab | Cédric Lopez
Actes de l'atelier Évaluation des modèles génératifs (LLM) et challenge 2025 (EvalLLM)

2024

pdf bib abs

KGAST: From Knowledge Graphs to Annotated Synthetic Texts
Nakanyseth Vuth | Gilles Sérasset | Didier Schwab
Proceedings of the 1st Workshop on Knowledge Graphs and Large Language Models (KaLLM 2024)

In recent years, the use of synthetic data, either as a complement or a substitute for original data, has emerged as a solution to challenges such as data scarcity and security risks. This paper is an initial attempt to automatically generate such data for Information Extraction tasks. We accomplished this by developing a novel synthetic data generation framework called KGAST, which leverages Knowledge Graphs and Large Language Models. In our preliminary study, we conducted simple experiments to generate synthetic versions of two datasets—a French security defense dataset and an English general domain dataset, after which we evaluated them both intrinsically and extrinsically. The results indicated that synthetic data can effectively complement original data, improving the performance of models on classes with limited training samples. This highlights KGAST’s potential as a tool for generating synthetic data for Information Extraction tasks.

Nakanyseth Vuth

2026

2025

2024

2023

Co-authors

Venues