Xiangbo Zhang

2026

Memory Dial: A Training Framework for Controllable Memorization in Language Models
Xiangbo Zhang | Ali Emami
Findings of the Association for Computational Linguistics: ACL 2026

Memorization in language models is widely studied but remains difficult to isolate and control. Understanding when and what models memorize is essential for explaining their predictions, yet existing approaches are post-hoc: they can detect memorization in trained models, but cannot disentangle its effects from architecture, data, or optimization. We introduce **Memory Dial**, a training framework that makes memorization an explicit, controllable variable. Memory Dial interpolates between standard cross-entropy and a temperature-sharpened objective via a single parameter, producing a family of models identical in architecture, data, and optimization, but varying in memorization pressure. Experiments across six architectures and five benchmarks demonstrate that: (1) reliably controls memorization, with seen-example accuracy increasing monotonically while unseen accuracy remains stable; (2) larger models are more responsive to memorization pressure; and (3) frequent sequences are easier to memorize than rare ones. Memory Dial provides a controlled experimental framework for studying how memorization behavior emerges and interacts with generalization in language models.

pdf bib abs

Stable and Explainable Personality Trait Evaluation in Large Language Models with Internal Activations
Xiaoxu Ma | Xiangbo Zhang | Zhenyu Weng
Findings of the Association for Computational Linguistics: ACL 2026

Evaluating personality-related tendencies in Large Language Models (LLMs) helps characterize model behavior, compare models beyond task accuracy, and support responsible deployment in socially interactive settings. However, existing questionnaire-based evaluation methods exhibit limited stability and offer little explainability, as their results are highly sensitive to minor variations in prompt phrasing or role-play configurations. To address these limitations, we propose an internal-activation–based approach, termed Persona-Vector Neutrality Interpolation (PVNI), for stable and explainable personality trait evaluation in LLMs. PVNI extracts a persona vector associated with a target personality trait from the model’s internal activations using contrastive prompts. It then estimates the corresponding neutral score by interpolating along the persona vector as an anchor axis, enabling an interpretable comparison between the neutral prompt representation and the persona direction. We provide a theoretical analysis of the effectiveness and generalization properties of PVNI. Extensive experiments across diverse LLMs demonstrate that PVNI yields substantially more stable personality trait evaluations than existing methods, even under questionnaire and role-play variants.

Co-authors

Venues

Findings2

Fix author