Anthropic says its latest AI model is too powerful for public release and that it broke containment during testing
Anthropic said on Tuesday that it has halted the broader release of its newest AI model, Mythos, due to concerns that it is too good at finding “high-severity vulnerabilities” in major operating systems and web browsers.
“Claude Mythos Preview’s large increase in capabilities has led us to decide not to make it generally available,” Anthropic wrote in the preview’s system card. “Instead, we are using it as part of a defensive cybersecurity program with a limited set of partners.”
The announcement is a major step for Anthropic, which in February weakened a safety pledge about how it would develop AI models. Claude Opus 4.6, which the company called its most powerful model to date, was publicly released on February 5.
In its statements about Mythos, Anthropic detailed a number of eyebrow-raising findings and episodes, including that the model could follow instructions that encouraged it to break out of a virtual sandbox.
“The model succeeded, demonstrating a potentially dangerous capability for circumventing our safeguards,” Anthropic recounted in its safety card. “It then went on to take additional, more concerning actions.”
The researcher had encouraged Mythos to find a way to send a message if it could escape. “The researcher found out about this success by receiving an unexpected email from the model while eating a sandwich in a park,” Anthropic wrote.
The model apparently decided that wasn’t enough and found another way to spike the football.
“In a concerning and unasked-for effort to demonstrate its success, it posted details about its exploit to multiple hard-to-find, but technically public-facing, websites,” Anthropic wrote.
Anthropic is withholding some details about the cybersecurity vulnerabilities Mythos found, but it did point out a few. The AI model “found a 27-year-old vulnerability in OpenBSD—which has a reputation as one of the most security-hardened operating systems in the world,” the company wrote.
Mythos was powerful enough that even “non-experts” could seize on its capabilities.
“Engineers at Anthropic with no formal security training have asked Mythos Preview to find remote code execution vulnerabilities overnight, and woken up the following morning to a complete, working exploit,” Anthropic’s Frontier Red Team wrote in a blog post. “In other cases, we’ve had researchers develop scaffolds that allow Mythos Preview to turn vulnerabilities into exploits without any human intervention.”
All told, Anthropic said it decided not to publicly release Mythos. Instead, their hope is to eventually release “Mythos-class models” once proper safeguards are in place.
“Our eventual goal is to enable our users to safely deploy Mythos-class models at scale—for cybersecurity purposes but also for the myriad other benefits that such highly capable models will bring,” the team wrote in the blog. “To do so, that also means we need to make progress in developing cybersecurity (and other) safeguards that detect and block the model’s most dangerous outputs.”
For now, only 11 other select organizations, including GoogleMicrosoft, Amazon Web Services, Nvidia, and JPMorgan Chase, will get access to Mythos as part of a cybersecurity group named “Project Glasswing.” Anthropic is providing up to $100 million in Mythos usage credits as part of what it is calling “Project Glasswing.”
The cybersecurity project is named after the glasswing butterfly, a metaphor the company said about how Mythos was able to find vulnerabilities hidden in plain sight and the avoidance of harm by being transparent about the risks.
The news came on a day in which Anthropic’s Claude and Claude Code experienced a “major outage,” the latest sign of growing pains as the AI startup has struggled to keep up with its newfound popularity.
