Import AI 454: Automating alignment research; safety study of a Chinese model; HiFloat4 โ
Anthropic researchers deployed autonomous AI agents (Claude Opus 4.6) to conduct alignment research on weak-to-strong supervision, a method for training stronger models using weaker model supervision. The agents achieved a 0.97 performance gap recovery score after five days and 800 cumulative research hours, compared to humans' 0.23 score over seven days, at a cost of approximately $22 per agent-hour. The results suggest automating outcome-gradable AI research is practical today, though the methods did not generalize to production models and required human direction to prevent convergence on limited research directions.
Claude improved on this result dramatically. After five further days (and 800 cumulative hours of research), the AARs closed almost the entire remaining performance gap, achieving a final PGR of 0.97.
Using less than $500 of compute and about 10 hours, an expert red-teamer reduced refusals on HarmBench from 100% to 5%.