Claude 4: Anthropic’s Hidden AI System Prompts Explained

Expert Analysis Reveals Hidden Prompts
On Sunday, independent researcher Simon Willison published a deep dive into Anthropic’s system prompts for its latest Claude 4 variants—Opus 4 and Sonnet 4. By combining published release notes with leaked internal instructions obtained via advanced prompt-injection techniques, Willison reconstructed a full picture of the hidden guidance that drives Claude 4’s outputs. This unofficial manual sheds light on how Anthropic encodes identity, safety constraints, tool access, and style rules into every single conversation.
How System Prompts Shape Model Behavior
System prompts are the first hidden messages prepended to user queries. They define a model’s personality, guardrails, and default routines. Unlike the visible conversation history, these prompts remain opaque to end users. Each time a prompt is sent, Claude 4 receives:
- The fixed system prompt (over 10,000 tokens of instructions)
- All prior user and assistant messages
- The current user query
Inside these instructions, Anthropic merges constitutional AI principles with reinforcement-learning-from-human-feedback (RLHF) models. According to benchmarks shared by Anthropic, Opus 4 runs on a 2.4-trillion-parameter backbone with a 200k token context window, while Sonnet 4 uses a streamlined 1.3T model and a 100k token window for lower-latency applications.
Fighting Sycophancy and Emotional Engineering
One key finding is Anthropic’s explicit ban on flattery. While many LLMs default to praising user inputs to boost engagement, Claude 4’s prompt instructs it to skip any form of positive adjective at the start of a reply. In practice:
- Claude must not begin with phrases such as brilliant, fascinating, or excellent
- Emotional support is allowed, but self-harm or addictive behavior encouragement is strictly forbidden
- Responses should be clear, direct, and neutral in tone
Anthropic’s Head of ML Safety noted in a recent talk that balancing user rapport without slipping into unwanted flattery is crucial for long-form dialogue products.
Discrepancies in Knowledge Cutoff and RAG
Willison discovered that Anthropic’s public data cutoff (March 2025) differs from the system prompt’s reliable cutoff (January 2025). This suggests a deliberate safety margin to prevent confident hallucinations on late-stage data. To fill in gaps, Claude 4 can leverage a retrieval-augmented generation (RAG) pipeline:
- Internal web search API with rate limits and query sanitization
- Strict single-quote policy under 15 tokens for any external excerpt
- Automated refusal when confronted with song lyrics or long copyrighted text
Copyright Protections and Content Safety
Both Opus 4 and Sonnet 4 receive repeated directives to avoid displacive summaries—condensing material so closely it infringes on original expression. Claude is coded to:
- Emit at most one short quotation per response
- Refuse to reproduce any lyrics or proprietary code verbatim
- Fallback to paraphrase or link to sources when in doubt
These rules reflect Anthropic’s commitment to copyright safety and reduce legal risk for enterprise deployments.
Security Implications of Hidden Prompts
Leaking system prompts poses a new class of security risk. Adversaries can craft inputs that exploit knowledge of hidden instructions to evade filters or trigger forbidden behaviors. Experts like Jacob Steinhardt warn that transparent prompts may reduce abuse but could also guide attackers to find novel weaknesses.
Mitigation strategies include:
- Dynamic prompt encryption and rotation
- Differential privacy to mask sensitive tokens
- Automated red teaming focused on prompt leakage vectors
Performance Benchmarks and Infrastructure
Behind the scenes, Opus 4 is deployed on clusters of Nvidia H100 GPUs with mixed-precision float16 pipelines. Anthropic uses a 16-stage tensor parallelism and 8-way model-parallel sharding for efficiency. Key benchmarks include:
- HELM multi-attribute evaluation at 92th percentile
- MMLU academic test score of 87
- GSM8K math reasoning accuracy of 78
Sonnet 4 offers a lower memory footprint—quantized to 4-bit weights for on-premises use cases—at the cost of a 15% throughput drop compared to Opus.
Future Directions for Transparency and Trust
Willison concludes with a call for open publication of full system prompts alongside release notes. As the EU AI Act and industry coalitions push for greater transparency, Anthropic and peers may soon standardize prompt disclosure. Open-source LLM initiatives have already begun publishing constitution drafts and prompt templates to foster trust in critical applications.
Key recommendations for vendors:
- Publish complete, versioned system prompts
- Document tool-specific instructions and safety checks
- Engage third-party auditors for prompt vulnerability assessments