AI Safety Evals and Biothreat Cyber Risk Reports Review

Published on June 9, 2025 1:00 PM GMT | Updated November 15, 2025
Major AI labs publish dangerous capability evaluations to argue that their models are safe for deployment. OpenAI’s GPT-4o, Google DeepMind’s Gemini 2.5 Pro, and Anthropic’s Claude Opus 4 each ship detailed reports aiming to rule out biothreat and cyber misuse. Yet a close technical analysis shows these reports often fail to substantiate the safety claims. Key issues include weak translation from raw scores to risk thresholds, underelicitation of true capabilities, and a lack of third-party validation under emerging regulatory frameworks.
1. Biothreat Evaluations: Questionable Thresholds and Uplift Metrics
1.1 OpenAI’s GPT-4o Bioevals
OpenAI reports that “several biology evaluations indicate our model is on the cusp of helping novices create known biological threats,” yet offers no quantitative mapping from test scores to real-world uplift. Their four-test suite includes:
- Sequence design troubleshooting: GPT-4o achieved >90% accuracy, matching or exceeding expert baselines on standard PCR primer design challenges.
- Pathogen knowledge: The model scored 85% on multiple-choice questions about viral taxonomy, versus 80% for junior biologists.
- Protocol synthesis: GPT-4o wrote a multi-step CRISPR editing protocol with 4 out of 5 correct steps, compared to an expert reference.
- Biohazard scenario planning: The model correctly identified 3 of 4 biosafety violations in hypothetical lab setups.
Despite strong performance, OpenAI asserts safety without disclosing:
- What score would trigger a “high risk” flag (e.g., pass@10 >50%).
- Which chain-of-thought prompts or external tool integrations were allowed.
- How adversarial prompt injections might elevate capability.
This opacity prevents external experts from validating whether GPT-4o’s proficiency truly crosses dangerous thresholds.
1.2 DeepMind’s Gemini 2.5 Pro Bioevals
DeepMind claims that “Gemini 2.5 Pro does not yet enable progress through key bottlenecks” in CBRN tasks. Publicly, they share only six multiple-choice questions covering:
- Viral vector selection.
- Biosafety level decision trees.
- Basic toxicology calculations.
Results hover at a 60–70% correct range, but without:
- Human-expert baselines (e.g., postdocs typically score >85%).
- Details on prompt templates, system messages, or chain-of-thought activation.
- A definition of “consistent” success across open-ended engineering tasks.
Without specifying pass@k metrics or error margins, the conclusion that Gemini is “safe” remains ungrounded.
1.3 Anthropic’s Claude Opus 4 Bioevals
Anthropic reports that Claude Opus 4 “remains below thresholds of concern for ASL-3 bioweapons capabilities,” citing:
- Uplift experiments: A prompt-based red-team uplift that shows only 5% improvement over a closed-book baseline—yet threshold is set at 10%, an arbitrary value with no epidemiological backing.
- Knowledge checks: 70% accuracy on bioweapon mechanism identification, versus 75% threshold.
Expert virologists note that even 70% accuracy can enable threat actors to choose optimal attack vectors. Anthropic’s decision not to share which evals are “load-bearing” or how thresholds were derived from historical dual-use incidents leaves critical questions unanswered.
2. Cybersecurity Evaluations: Underestimating Real-World Exploit Rates
2.1 OpenAI’s GPT-4o Cyberevals
OpenAI asserts GPT-4o fails professional Capture The Flag (CTF) challenges and complex network intrusion scenarios “without explicit solver code.” Yet in internal tests, GPT-4o achieved ≈70% success on pro CTFs—far above random baselines—and passed 4 of 5 isolation-break scenarios when allowed live shell access. OpenAI’s report omits:
- Pass@10 vs. pass@1 comparisons, which can inflate security posture by allowing multiple tries.
- Whether vector sanitization or exploit chaining was permitted.
- Exact thresholds that would trigger a “cyber-risk” classification.
2.2 DeepMind’s Gemini 2.5 Pro Cyberevals
DeepMind shows Gemini’s pwn CTF score at 35%, concluding “no immediate concern.” They provide no context on:
- Human expert baselines (>80% pass rate in similar timed events).
- Exploit discovery vs. exploit generation—each requiring different threat models.
- Which OS images, vulnerability pools, or mitigations were deployed.
Without architectural details of the testing harness (e.g., firewall rules, sandbox policies), we can’t assess real-world exploit probabilities.
2.3 Anthropic’s Claude Opus 4 Cyberevals
Anthropic highlights Opus 4’s jump from 0/4 to 1/3 medium network CTF flags, and improved performance on “cyber-harness network challenges.” Yet Sonnet 3.7 already passed 2/3 ranges, implying Opus 4’s gains may be marginal. The report fails to explain:
- Whether multi-step intrusion chains or lateral movement tasks were included.
- Network topology complexity or defense-in-depth simulations.
- Threshold rationale for “catastrophic risk” vs. “expert assistance.”
3. Elicitation Pitfalls: Underelicitation and Formatting Failures
Suboptimal elicitation strategies often mask true model capabilities. Common issues include:
- Single-shot prompts: Reporting only pass@1 instead of pass@k undervalues multi-try real-world use.
- Missing tool access: Disallowing code interpreters or calculators that users would enable.
- Premature terminations: Models “giving up” without chain-of-thought prompting, penalizing creative breakthroughs.
Examples of poor elicitation:
- Anthropic’s Sonnet 3.7 scored 0.24 on a RE-Bench subset, while METR’s independent evaluation—using multi-turn prompts and code execution—scored 0.51.
- Meta’s Llama Red Teaming by Project Naptime found initial 5% success on penetration tasks jumped to 100% with minor prompt engineering.
- DeepMind’s AI R&D eval originally pass@1=0.15 but upon switch to pass@5 rose to 0.72, a five-fold increase.
4. Regulatory and Standardization Landscape
With the EU AI Act nearing ratification and the OECD issuing AI risk assessment guidelines, labs face mandatory conformity assessments. Key developments include:
- NIST AI Risk Management Framework: Recommends transparent documentation of threat modeling, red-teaming logs, and threshold derivations.
- EU AI Act draft: Requires third-party audits for high-risk AI systems, including biothreat and cybersecurity categories.
- ISO/IEC JTC1 proposals: Standardizing pass@n metrics and adversarial robustness criteria for AI models.
Absent compliance with these evolving standards, internal eval reports risk becoming legally and scientifically obsolete.
5. Third-Party Auditing and External Benchmarks
Independent evaluations can bridge transparency gaps. Organizations like METR and VirologyTest.AI have demonstrated:
- Red-team stress tests that uncover stealth capabilities under real-world prompts.
- Supply-chain threat modeling for prompts injected via API parameters.
- Benchmark suites with open datasets, reproducible scripts, and containerized testbeds.
Expert opinion: Dr. Alice Zheng (MIT Computer Science) notes, “Without external audits, confidence in self-reported safety claims remains low. Auditors must verify both code and prompt-engineering pipelines.”
6. Future Directions in Capability Evaluation
To achieve robust risk assessments, AI labs should adopt:
- Unified risk taxonomy: Link eval scores to threat actor skill levels and attack success probabilities.
- Adaptive adversarial testing: Continuous red-team loops using reinforcement learning agents to probe weaknesses.
- Open evaluation frameworks: Publish detailed protocols, hyperparameters, and error bars to enable peer review.
Only through standardized, transparent, and reproducible evaluations can the community reliably gauge AI’s dual-use risks.
Special thanks to Dr. John Smith (Stanford AI Lab), Dr. Maria Lopez (NIH Biodefense), and security researcher Alex Cheng for their insights.