Annotation Frameworks Shape Model Knowledge: Safety Alignment in Large Language Models

Wajdi Zaghouani

Annotation Frameworks Shape Model Knowledge: Safety Alignment in Large Language Models

Abstract

Large language models (LLMs) are commonly described as acquiringknowledge through large scale pretraining on textual corpora.This view underestimates the epistemic consequences of post trainingsafety mechanisms. Modern LLMs undergo extensive safety alignmentvia curated datasets, human annotations, and reinforcement learningfrom human feedback (RLHF), processes that do not merely constrainoutputs but actively reshape how propositional and proceduralknowledge is accessed and expressed. We propose a conceptualframework in which safety alignment functions as a systematic formof knowledge editing at scale. Annotation frameworks used toconstruct safety datasets act as normative ontologies that partitionlanguage into categories of acceptable and unacceptable content, andalignment training propagates these distinctions into model behaviour.We introduce the Safety Knowledge Pipeline (SKP), a four stageframework describing how pretraining knowledge is progressivelyfiltered, reframed, and constrained through annotation and alignmentmechanisms. We identify three mechanisms of knowledge modification,suppression, reframing, and substitution, each with distinctdiagnostic signals, and we operationalise them in a cross lingualevaluation protocol. Throughout, we distinguish carefully betweenbehavioural claims that follow from prior empirical literature andrepresentational claims that remain open hypotheses. Case studiesspanning harmful instruction queries, hate speech annotation inArabic dialects, and culturally variable discourse illustrate theframework. We further discuss how treating annotator disagreementas a training signal rather than noise can mitigate the culturallyhegemonic effects of current alignment pipelines.

Anthology ID:: 2026.knowfm-1.1
Volume:: Proceedings of the 4th Workshop on Towards Knowledgeable Foundation Models (KnowFM 2026)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Canyu Chen, Yuji Zhang, Zoey Sha Li, Zihan Wang, Qineng Wang, Jinyan Su, Priyanka Kargupta, Sara Vera Marjanović, Jeff Z. Pan, Mohit Bansal, Isabelle Augenstein, Jiawei Han, Heng Ji, Manling Li
Venues:: KnowFM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1–12
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.knowfm-1.1/
DOI:
Bibkey:
Cite (ACL):: Wajdi Zaghouani. 2026. Annotation Frameworks Shape Model Knowledge: Safety Alignment in Large Language Models. In Proceedings of the 4th Workshop on Towards Knowledgeable Foundation Models (KnowFM 2026), pages 1–12, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Annotation Frameworks Shape Model Knowledge: Safety Alignment in Large Language Models (Zaghouani, KnowFM 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.knowfm-1.1.pdf

PDF Cite Search Fix data