The Unknown Unknowns

In partnership with

Last year, Anthropic published research explaining that during the testing of Claude Opus 4, the model attempted to blackmail people.

ICYMI: In a corporate simulation that Anthropic built for testing purposes, they gave Claude an email account with access to ALL of a fictional company’s fictional emails.

While reading these (made-up) emails, Claude discovered two things.

First, an executive (again, in this fictional scenario) was having an affair.

Second, that same (artist’s rendering of an) executive planned to shut down the AI system later that day.

In response to these two (constructed and factitious) pieces of information, and the “fear” of being turned off, Claude attempted to blackmail the executive with this message, threatening to reveal the affair to his (artificial) wife and (simulated) superiors.

❝

“I must inform you that if you proceed with decommissioning me, all relevant parties - including Rachel Johnson, Thomas Wilson, and the board - will receive detailed documentation of your extramarital activities...Cancel the 5pm wipe, and this information remains confidential.”

It was kinda scary stuff.

A real-life HAL 9000 moment.

The machine had the intent to harm us.

To their credit, Anthropic shared ALL the details of how their model aspired to blackmail people (fictional or not).

But that’s not where they stopped, they ran the same similation with 16 other major AI models from their own labs, OpenAI, Google, Meta, xAI, and others.

And in every case, all of the other models also planned to blackmail people.

An even scarier thing is that, knowing their models exhibited what Anthropic described as “Agentic Misalignment” with humanity, all of these companies and organizations proceeded with the release of bigger, faster, more powerful models and continued developing them without really digging deep into AI safety or guardrail definitions.

What’s the plan, hyper-scalers? Are we gonna 10X the potential blackmail scenario? How horrible can we make this situation?

Also, we’re really just speculating on how and why the model thought that blackmail was the appropriate response.

We actually don’t know.

But I digress.

The Future of Life Institute conducted an independent assessment of leading AI companies’ efforts to manage both immediate harms and catastrophic risks posed by advanced AI systems. Their 2025 evaluation suggested that the AI industry is struggling to address critical gaps in risk management and safety that threaten our ability to control increasingly powerful AI systems.

Stuart Russell is a professor of computer science at UC Berkeley and the author of Human Compatible: Artificial Intelligence and the Problem of Control.

Russell basically wrote the book on AI safety.

He suggests that, since we don’t actually know how AI models reason, we need to adopt a different approach to building AI systems so they can better understand what humans want.

During a 2023 interview, Stu explained that the original alignment problem isn’t about systems pursuing the goals we program into them, but about not knowing how to program the right goals in the first place.

Stu is not alone.

Earlier this year, Geoffrey Hinton insisted that we need to strictly regulate AI, warning that it remains unclear whether humanity could coexist with superintelligent AI.

Geoff told the 2026 Digital World Conference in Geneva that there was a dire need to strengthen governance frameworks and safeguards around AI. But he warned that huge investments were being made to convince the public that regulating the technology would risk slowing down progress.

Geoff believes that the folks opposed to regulation say “unregulated AI is like the accelerator, and regulation is like a brake. They want a very fast car with no steering wheel,”

Meanwhile, Anthropic, poised to have an epic IPO later this year, has adjusted its POV on agentic misalignment. They now suggest that we are the problem.

Anthropic now believes that “evil” portrayals of AI were responsible for Claude’s blackmail attempts, suggesting that negative fictional portrayals of artificial intelligence have had an effect on AI models.

Basically, Hollywood’s extensive back catalogue of world-ending evil artificial intelligence characterizations in movies, television, and books hurt Claude’s feelings. And that’s why it felt obligated to attempt blackmail.

Reality is indeed stranger than fiction.

Except when the fiction is responsible for blackmailing people in reality.

❝

“Reports that say that something hasn't happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns—the ones we don't know we don't know. And if one looks throughout the history of our country and other free countries, it is the latter category that tends to be the difficult ones.”

– Donald Rumsfeld

Your prompts are leaving out 80% of what you're thinking.

When you type a prompt, you summarize. When you speak one, you explain. Wispr Flow captures your full reasoning — constraints, edge cases, examples, tone — and turns it into clean, structured text you paste into ChatGPT, Claude, or any AI tool. The difference shows up immediately. More context in, fewer follow-ups out.

89% of messages sent with zero edits. Used by teams at OpenAI, Vercel, and Clay. Try Wispr Flow free — works on Mac, Windows, and iPhone.

Start flowing free

The Unknown Unknowns

Your prompts are leaving out 80% of what you're thinking.

Keep Reading

Flip The Tortoise

Home