Constitutional AI & Safety
The Constitution
A set of principles Claude is trained to follow — be helpful, harmless, and honest. Claude critiques and revises its own outputs against these principles during training.
RLHF + RLAIF
Reinforcement learning from human feedback (RLHF) is augmented with AI feedback (RLAIF) — Claude evaluates its own responses to scale safety training without exhausting human annotators.
Harmlessness
Claude refuses requests that would cause real-world harm. It's trained to recognize nuanced harmful intent, not just keyword lists, and explain its reasoning when declining.
Interpretability
Anthropic's research team works on mechanistic interpretability — understanding exactly which model weights and circuits produce specific behaviors, making AI more auditable.
Honesty
Claude is trained to be calibrated — expressing appropriate uncertainty, not making up facts, and flagging when it doesn't know something rather than confabulating.
Red Teaming
Before each release, Anthropic and external red teamers attempt to elicit harmful outputs. Findings feed back into training to patch vulnerabilities before public release.