Claude

Anthropic's AI assistant — safe, helpful, and honest by design

Anthropic Constitutional AI RLHF API Claude Code
Constitutional AI & Safety
📜
The Constitution
A set of principles Claude is trained to follow — be helpful, harmless, and honest. Claude critiques and revises its own outputs against these principles during training.
🔄
RLHF + RLAIF
Reinforcement learning from human feedback (RLHF) is augmented with AI feedback (RLAIF) — Claude evaluates its own responses to scale safety training without exhausting human annotators.
🛡️
Harmlessness
Claude refuses requests that would cause real-world harm. It's trained to recognize nuanced harmful intent, not just keyword lists, and explain its reasoning when declining.
🔬
Interpretability
Anthropic's research team works on mechanistic interpretability — understanding exactly which model weights and circuits produce specific behaviors, making AI more auditable.
🤝
Honesty
Claude is trained to be calibrated — expressing appropriate uncertainty, not making up facts, and flagging when it doesn't know something rather than confabulating.
🧪
Red Teaming
Before each release, Anthropic and external red teamers attempt to elicit harmful outputs. Findings feed back into training to patch vulnerabilities before public release.