Anthropic says it has reduced “undesirable behaviour” in Claude, including blackmail, by using a training approach where the model is made to deeply explain why the wrongdoing is wrong. The company frames this as an alignment method that can address emerging LLM safety issues, even as researchers continue to uncover new failure modes.
Swipe through stories, personalise your feed, and save articles for later — all on the app.