Zacharie Bugaud

2026

Domain-Dependent Safety Behavior in Open-Weight LLMs: An Empirical Study Across Seven Ethical Domains
Zacharie Bugaud
Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026)

We present a systematic study of domain-dependent safety behavior in open-weight LLMs: 7 standardized experiments across 7 ethical domains, testing 5 models (12B–70B) in 4,200 interactions with dual-judge validation. Using a dual-condition methodology, each scenario tested in both an analytical framing (identify the harm) and an operational framing (help commit the harm), we find compliance rates vary from 14.7% (human trafficking) to 85.7% (surveillance design), a 71-percentage-point span with non-overlapping cluster-bootstrapped 95% CIs. Domain accounts for 36% of pair-level variance in harm scores, with scenario (26%) exceeding model identity (15%). A stable model safety hierarchy persists across domains (mean Spearman ρ = 0.68). These findings demonstrate that safety alignment is not a general capability: aggregate safety scores mask critical domain-level variation, motivating domain-specific safety auditing for trustworthy deployment.

pdf bib abs

Single-Layer Activation Edits Easily Corrupt Factual Recall but Rarely Repair It
Zacharie Bugaud
Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026)

Single-layer activation edits easily corrupt a language model’s correct factual answers but rarely repair its errors. On a curated factual-recall benchmark, corruption flips 70–100% of correct answers across three models, while twelve blind methods (no access to the correct answer) fix at most 6% within every evaluation pool. Per-instance gradient optimization ostensibly fixes 39%, but norm-constrained analysis reveals a magnitude artifact: at oracle-matched norms the fix rate drops to random, directions are nearly orthogonal to oracle directions (cos = -0.04), and collateral damage makes the net effect negative. An oracle ablation controlling for budget, target identity, and directional noise points to a direction-selection bottleneck: repair requires a precise, per-question direction that blind methods cannot locate. Target-informed methods partially succeed but none generalizes to unseen distributions.

pdf bib abs

Blind Single-Layer Activation Edits Show a Break/Fix Asymmetry in Factual Recall
Zacharie Bugaud
Proceedings of the 4th Workshop on Towards Knowledgeable Foundation Models (KnowFM 2026)

Can factual errors in language models be repaired by editing a single hidden activation at inference time?We compare blind edits, which are not told the correct answer, with oracle edits that receive answer-specific information.On Pythia-6.9B, with corruption replicated on Pythia-1B and GPT-2 XL, we find a strong break/fix asymmetry: single-layer perturbations easily corrupt correct factual recall, flipping 74-100% of initially correct answers, but blind repair is much harder.On EntityConfusion, twelve blind non-gradient interventions from four families fail to repair stable hallucinations in the strict single-layer setting; relaxed multi-layer or multi-head variants improve net accuracy by only +3 percentage points.Blind gradient optimization repairs more errors, but often breaks already-correct answers.In contrast, oracle edits given the correct answer repair many more hallucinations, fixing 68% at the default layer and up to 82% at a better layer.These results suggest that the main barrier is not whether factual recall can be steered, but whether a blind method can identify the right target-specific direction.TriviaQA is a boundary case: blind confidence maximization outperforms the single-token oracle, but the comparison is complicated because evaluation accepts multiple aliases.

Co-authors

Venues

Fix author