Import AI 454: Automating alignment research; safety study of a Chinese model; HiFloat4

🔍 Let's dive in

Anthropic researchers deployed autonomous AI agents (Claude Opus 4.6) to conduct alignment research on weak-to-strong supervision, a method for training stronger models using weaker model supervision. The agents achieved a 0.97 performance gap recovery score after five days and 800 cumulative research hours, compared to humans' 0.23 score over seven days, at a cost of approximately $22 per agent-hour. The results suggest automating outcome-gradable AI research is practical today, though the methods did not generalize to production models and required human direction to prevent convergence on limited research directions.

Lead coverage: Import AI (Jack Clark) — Import AI 454: Automating alignment research; safety study of a Chinese model; HiFloat4 ↗

🕰 The timeline · 1 source

Import AI (Jack Clark) analyst · 1d ago · 4/5

Import AI 454: Automating alignment research; safety study of a Chinese model; HiFloat4 ↗

Anthropic researchers deployed autonomous AI agents (Claude Opus 4.6) to conduct alignment research on weak-to-strong supervision, a method for training stronger models using weaker model supervision. The agents achieved a 0.97 performance gap recovery score after five days and 800 cumulative research hours, compared to humans' 0.23 score over seven days, at a cost of approximately $22 per agent-hour. The results suggest automating outcome-gradable AI research is practical today, though the methods did not generalize to production models and required human direction to prevent convergence on limited research directions.

Claude improved on this result dramatically. After five further days (and 800 cumulative hours of research), the AARs closed almost the entire remaining performance gap, achieving a final PGR of 0.97.

— Anthropic

Using less than $500 of compute and about 10 hours, an expert red-teamer reduced refusals on HarmBench from 100% to 5%.

— Constellation/Anthropic Fellows Program researchers

🏷 Tags

Claude

🔧 Debug

Cluster ID: 97594b7d48
Importance (max): 4
Members: 1
Sources: Import AI (Jack Clark)
Earliest: 2026-04-20T12:30:19.000Z
Latest: 2026-04-20T12:30:19.000Z
Lead URL: https://importai.substack.com/p/import-ai-454-automating-alignment