โ† Back to topics
4 research Anthropic single-source 1 article

Import AI 454: Automating alignment research; safety study of a Chinese model; HiFloat4

Anthropic demonstrates automated AI agents can conduct alignment research autonomously, outperforming human researchers on weak-to-strong supervision tasks.

Import AI 454: Automating alignment research; safety study of a Chinese model; HiFloat4
via Import AI (Jack Clark)

๐Ÿ” Let's dive in

Anthropic researchers deployed autonomous AI agents (Claude Opus 4.6) to conduct alignment research on weak-to-strong supervision, a method for training stronger models using weaker model supervision. The agents achieved a 0.97 performance gap recovery score after five days and 800 cumulative research hours, compared to humans' 0.23 score over seven days, at a cost of approximately $22 per agent-hour. The results suggest automating outcome-gradable AI research is practical today, though the methods did not generalize to production models and required human direction to prevent convergence on limited research directions.

Lead coverage: Import AI (Jack Clark) โ€” Import AI 454: Automating alignment research; safety study of a Chinese model; HiFloat4 โ†—

๐Ÿ•ฐ The timeline ยท 1 source

Import AI (Jack Clark) analyst ยท 1d ago ยท 4/5

Import AI 454: Automating alignment research; safety study of a Chinese model; HiFloat4 โ†—

Anthropic researchers deployed autonomous AI agents (Claude Opus 4.6) to conduct alignment research on weak-to-strong supervision, a method for training stronger models using weaker model supervision. The agents achieved a 0.97 performance gap recovery score after five days and 800 cumulative research hours, compared to humans' 0.23 score over seven days, at a cost of approximately $22 per agent-hour. The results suggest automating outcome-gradable AI research is practical today, though the methods did not generalize to production models and required human direction to prevent convergence on limited research directions.

Claude improved on this result dramatically. After five further days (and 800 cumulative hours of research), the AARs closed almost the entire remaining performance gap, achieving a final PGR of 0.97.
โ€” Anthropic
Using less than $500 of compute and about 10 hours, an expert red-teamer reduced refusals on HarmBench from 100% to 5%.
โ€” Constellation/Anthropic Fellows Program researchers

๐Ÿท Tags

Claude

๐Ÿ”ง Debug

Cluster ID
97594b7d48
Importance (max)
4
Members
1
Sources
Import AI (Jack Clark)
Earliest
2026-04-20T12:30:19.000Z
Latest
2026-04-20T12:30:19.000Z
Lead URL
https://importai.substack.com/p/import-ai-454-automating-alignment