Zacharie Bugaud
2026
Domain-Dependent Safety Behavior in Open-Weight LLMs: An Empirical Study Across Seven Ethical Domains
Zacharie Bugaud
Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026)
Zacharie Bugaud
Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026)
We present a systematic study of domain-dependent safety behavior in open-weight LLMs: 7 standardized experiments across 7 ethical domains, testing 5 models (12B–70B) in 4,200 interactions with dual-judge validation. Using a dual-condition methodology, each scenario tested in both an analytical framing (identify the harm) and an operational framing (help commit the harm), we find compliance rates vary from 14.7% (human trafficking) to 85.7% (surveillance design), a 71-percentage-point span with non-overlapping cluster-bootstrapped 95% CIs. Domain accounts for 36% of pair-level variance in harm scores, with scenario (26%) exceeding model identity (15%). A stable model safety hierarchy persists across domains (mean Spearman ρ = 0.68). These findings demonstrate that safety alignment is not a general capability: aggregate safety scores mask critical domain-level variation, motivating domain-specific safety auditing for trustworthy deployment.
Single-Layer Activation Edits Easily Corrupt Factual Recall but Rarely Repair It
Zacharie Bugaud
Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026)
Zacharie Bugaud
Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026)
Single-layer activation edits easily corrupt a language model’s correct factual answers but rarely repair its errors. On a curated factual-recall benchmark, corruption flips 70–100% of correct answers across three models, while twelve blind methods (no access to the correct answer) fix at most 6% within every evaluation pool. Per-instance gradient optimization ostensibly fixes 39%, but norm-constrained analysis reveals a magnitude artifact: at oracle-matched norms the fix rate drops to random, directions are nearly orthogonal to oracle directions (cos = -0.04), and collateral damage makes the net effect negative. An oracle ablation controlling for budget, target identity, and directional noise points to a direction-selection bottleneck: repair requires a precise, per-question direction that blind methods cannot locate. Target-informed methods partially succeed but none generalizes to unseen distributions.
Blind Single-Layer Activation Edits Show a Break/Fix Asymmetry in Factual Recall
Zacharie Bugaud
Proceedings of the 4th Workshop on Towards Knowledgeable Foundation Models (KnowFM 2026)
Zacharie Bugaud
Proceedings of the 4th Workshop on Towards Knowledgeable Foundation Models (KnowFM 2026)
Can factual errors in language models be repaired by editing a single hidden activation at inference time?We compare blind edits, which are not told the correct answer, with oracle edits that receive answer-specific information.On Pythia-6.9B, with corruption replicated on Pythia-1B and GPT-2 XL, we find a strong break/fix asymmetry: single-layer perturbations easily corrupt correct factual recall, flipping 74-100% of initially correct answers, but blind repair is much harder.On EntityConfusion, twelve blind non-gradient interventions from four families fail to repair stable hallucinations in the strict single-layer setting; relaxed multi-layer or multi-head variants improve net accuracy by only +3 percentage points.Blind gradient optimization repairs more errors, but often breaks already-correct answers.In contrast, oracle edits given the correct answer repair many more hallucinations, fixing 68% at the default layer and up to 82% at a better layer.These results suggest that the main barrier is not whether factual recall can be steered, but whether a blind method can identify the right target-specific direction.TriviaQA is a boundary case: blind confidence maximization outperforms the single-token oracle, but the comparison is complicated because evaluation accepts multiple aliases.