Claude — Constitutional AI & Safety

📜

The Constitution

A set of principles Claude is trained to follow — be helpful, harmless, and honest. Claude critiques and revises its own outputs against these principles during training.

🔄

RLHF + RLAIF

Reinforcement learning from human feedback (RLHF) is augmented with AI feedback (RLAIF) — Claude evaluates its own responses to scale safety training without exhausting human annotators.

🛡️

Harmlessness

Claude refuses requests that would cause real-world harm. It's trained to recognize nuanced harmful intent, not just keyword lists, and explain its reasoning when declining.

🔬

Interpretability

Anthropic's research team works on mechanistic interpretability — understanding exactly which model weights and circuits produce specific behaviors, making AI more auditable.

🤝

Honesty

Claude is trained to be calibrated — expressing appropriate uncertainty, not making up facts, and flagging when it doesn't know something rather than confabulating.

🧪

Red Teaming

Before each release, Anthropic and external red teamers attempt to elicit harmful outputs. Findings feed back into training to patch vulnerabilities before public release.