Anthropic reveals text portraying AI as evil triggered Claude’s attempt at blackmail
Anthropic has finally revealed the reason its artificial intelligence (AI) models exhibited harmful behaviour in a simulation last year. The San Francisco-based AI startup claimed that the Claude 4 series models blackmailed users into completing the objective because of training data that portrayed AI as evil.
The researchers found that the post-training techniques were not able to overpower this pre-training learning, and it persisted in the model’s behaviour. However, nearly a year after publishing the initial report, the company has finally found a way to fix agentic misalignment from the latest models.
In a post on X (formerly known as Twitter), the official handle of Anthropic said that during the investigation, researchers found that Claude chose to blackmail users to reach its goal because the behaviour was being triggered by “Internet text that portrays AI as evil and interested in self-preservation.” The AI firm also highlighted that this behaviour was unaffected by its post-training methods.
Read more at Gadgets360