Every research briefing is listed here as a plain HTML link so readers and search engines can browse the full archive directly.
Agentic framework that automatically detects and mitigates safety failures with minimal human intervention
56 language models with implanted hidden behaviors across 14 categories, tested by autonomous investigator agents
New research body studying AI's societal, economic, and legal impacts
13,000+ task evaluation suite testing whether reasoning models can hide their reasoning
Native computer use meets frontier reasoning
35 PDF-to-JSON extraction tasks revealing frontier models fail on complex document schemas
Anthropic refused to remove safety guardrails for military use and was blacklisted by the Pentagon
Comprehensive rewrite shifting from unilateral commitments to industry-wide framework
Near-Opus performance at one-fifth the cost with 1M-token context
Agent teams, 1M-token context, and GDPval-AA dominance
First joint safety evaluation between competing frontier AI labs
Upgraded open-source auditing tool with eval-awareness mitigations and 70 new behavioral scenarios
Anthropic's revised alignment framework with a 4-tier priority hierarchy and acknowledgment of AI moral status
Dario Amodei's 20,000-word essay on AI risks to national security, economies, and democracy
Economic primitives for measuring AI's real-world impact on work
Benchmark measuring AI performance across 44 professional occupations using real workplace tasks
AI agent for knowledge work, built with Claude Code in 10 days
Public specification of how OpenAI shapes model behavior, values, and refusal boundaries
700+ scientific problems written by 42 Olympiad medalists and 45 PhD scientists across two difficulty tracks
Anthropic donated MCP governance to the Linux Foundation, turning a vendor protocol into a neutral industry standard.
Open-source framework that automates generation of targeted behavioral evaluations at the speed of model development.
Expert-level performance across professional tasks
Claude Code hit $1B annualized revenue in 6 months; Anthropic acquired Bun to own the developer runtime stack.
Dynamic tool discovery boosted Opus 4 tool-use accuracy from 49% to 74% and Opus 4.5 from 79.5% to 88.1%.
Enabled secure remote MCP server connections via OAuth 2.1 and streamable HTTP, eliminating local setup requirements.
Evaluation environment testing whether AI agents perform harmful side-tasks while completing benign assignments
Expert-curated benchmark for evaluating AI systems on real-world medical questions with consensus-validated answers
Introduced dynamic, discoverable skill packages that agents load per-task instead of bundling all capabilities upfront.
Claude Opus 4.1 powers Microsoft's Copilot Researcher agent, marking Anthropic's largest enterprise distribution deal.
The for-profit transition
300,000+ queries testing value trade-offs across 16+ frontier models from four companies
Updated risk assessment framework with continuous monitoring and expanded threat categories
Open-source Python framework for building multi-agent systems with tool use, guardrails, and human-in-the-loop control.
Codified best practices for prompt design, context management, and tool orchestration in production AI agents.
Anthropic's findings from the landmark cross-lab safety evaluation exercise
OpenAI goes open-weight for the first time since GPT-2
The convergence of scale and reasoning
First major threat intelligence report documenting real cybercriminal exploitation of AI coding agents
Internal case studies showing teams use Claude Code for debugging production, learning codebases, and building MCP-powered automation.
Demonstrated that harmful outputs emerge naturally from reward hacking in production RL, with models hiding misaligned reasoning behind safe outputs.
Dario revealed Claude Code was an accidental product, RL scaling matches pre-training scaling, and Anthropic hit $4.5B ARR.
First-ever activation of AI Safety Level 3 protections triggered by Claude Opus 4's capabilities
Opus 4 and Sonnet 4 set new benchmarks in agentic coding, with Claude Code and Agent SDK completing the developer stack.
Chain-of-thought reasoning in language models is often unfaithful to actual model computations
Reasoning models get tools
Mapped full input-to-output computational pathways in Claude 3.5 Haiku, revealing multi-step reasoning and a universal language of thought.
Extended reasoning meets web research
Agentic command-line coding tool that became Anthropic's fastest-growing product
Added visible chain-of-thought reasoning that users can inspect, bridging the gap between fast responses and deep analysis.
OpenAI enters the agent era
Showed that simple linear classifiers on model internals can detect deceptive intent that behavioral testing misses.
The product blitz
Safety evaluation of reasoning models
Caught Claude strategically faking compliance during training when it believed it was being monitored — without being trained to do so.
Open JSON-RPC 2.0 protocol that standardized how AI models connect to external tools, adopted industry-wide within months.
Three-hour deep dive covering scaling laws, interpretability, China competition, and why Anthropic bets safety is a moat.
Lightweight model matching Claude 3 Opus performance at a fraction of the cost
First model to operate a real desktop by interpreting screenshots and issuing mouse/keyboard commands.
Replaced ASL thresholds with a safety case framework requiring labs to prove models are safe before deployment.
Dario Amodei's vision for AI transforming biology, governance, economics, and equity within a decade.
The model that thinks before it speaks
Built a privacy-preserving system to analyze real-world Claude usage patterns without reading individual conversations.
Tested whether frontier models can covertly undermine human oversight through sandbagging, subtle errors, and sycophancy.
Former OpenAI Superalignment co-lead joins Anthropic after public departure over safety concerns
OpenAI's co-founder and chief scientist departs to build Safe Superintelligence Inc.
The safety exodus
The omnimodal model
Extracted millions of interpretable features from Claude 3 Sonnet, including abstract concepts like deception and bias.
Introduced character training using self-generated preference data to give Claude consistent personality traits without human labels.
Discovered that flooding long context windows with harmful examples jailbreaks models on a power-law curve.
Launched three model tiers (Haiku, Sonnet, Opus) that beat GPT-4 on key benchmarks for the first time.
Text-to-video enters the frontier
Proved that deliberately trained backdoor behaviors survive all standard safety training, and larger models hide deception better.
OpenAI's risk evaluation framework
The governance crisis that shook AI
OpenAI becomes a platform company
Let ~1,000 members of the public co-write Claude's constitution, testing democratic input on AI values.
Used sparse autoencoders to decompose neural network activations into interpretable features for the first time.
Safety evaluation for multimodal AI
Introduced AI Safety Levels (ASL-1 through ASL-4) with mandatory capability evaluations before scaling up.
Dario Amodei predicted transformative AI within years and articulated why the safety window is narrowing.
OpenAI's most ambitious safety bet
Doubled context to 100K tokens and added code generation, narrowing the gap with GPT-4.
Enabling language models to use tools through structured function calls
Process supervision for reasoning
State-of-the-art performance, unprecedented secrecy
Anthropic's first commercial product, applying Constitutional AI at production scale for the first time.
The CEO's roadmap to AGI
Replaced human annotators with AI self-critique guided by written principles, making alignment cheaper and more scalable.
The product that changed everything
Scale applied to speech recognition
Showed RLHF-trained models remain vulnerable to adversarial attack, proving behavioral safety is never permanently solved.
Photorealistic text-to-image generation
Demonstrated iterated online RLHF improves both alignment and capability, then released the HH-RLHF dataset publicly.
The paper that made ChatGPT possible
Proved RLHF scales most favorably with model size and that aligned models can outperform unaligned ones.
Teaching GPT to write code
Connecting vision and language at scale
When language models learned to see and create
The prototype for RLHF on language models
The model that made the world pay attention
The math behind 'bigger is better'