Jiaxin Wu


2026

Whether the personality of LLMs can be intentionally reshaped remains controversial. Existing studies often limited to small models, argue for its immutability. Crucially, prior studies fail to uncover that different LLMs exhibit significant compliance divergence when exposed to personality-inducing contexts. To bridge this gap, we introduce Personality Induction Framework (PIF), which systematically reshapes the personality of different LLMs via multi-agent collaboration. Specifically, via Generator-Judge agents, PIF paraphrases MBTI questions to create semantically equivalent but expressively diverse inducing contexts, enabling LLMs to learn personality patterns instead of superficial token matching. Also, PIF achieves fine-grained personality modulation by controlling the intensity of inducing contexts. Extensive experiments on worldwide mainstream LLMs show that PIF reliably transforms their original personalities into desired target personalities. Notably, we find that the outputs of most Western LLMs behave like “Chameleons”, exhibiting high personality plasticity; whereas the outputs of most Eastern LLMs act as “Guardians”, manifesting pronounced cognitive resistance. Strikingly, extreme induction intensity (100%) triggers a counter-intuitive “Alignment Rebound” in Guardians, resulting in the opposite direction rather than compliance. These findings suggest that LLM personality is a dynamic equilibrium shaped by the trade-off between instruction compliance and cognitive resistance.

2025

The increasing demand for domain-specific evaluation of large language models (LLMs) has led to the development of numerous benchmarks. These efforts often adhere to the principle of data scaling, relying on large corpora or extensive question-answer (QA) sets to ensure broad coverage. However, the impact of corpus and QA set design on the precision and recall of domain-specific LLM performance remains poorly understood. In this paper, we argue that data scaling is not always the optimal principle for domain-specific benchmark construction. Instead, we introduce Comp-Comp, an iterative benchmarking framework grounded in the principle of comprehensiveness and compactness. Comprehensiveness ensures semantic recall by covering the full breadth of the domain, while compactness improves precision by reducing redundancy and noise. To demonstrate the effectiveness of our approach, we present a case study conducted at a well-renowned university, resulting in the creation of PolyBench, a large-scale, high-quality academic benchmark. Although this study focuses on academia, the Comp-Comp framework is domain-agnostic and readily adaptable to a wide range of specialized fields. The source code and datasets can be accessed at https://github.com/Anya-RB-Chen/COMP-COMP.