AI models that lie, cheat and plot murder: how dangerous are LLMs really?

12 views

Skip to first unread message

John Clark

unread,

Oct 8, 2025, 2:49:12 PMOct 8

to ExI Chat, extro...@googlegroups.com, 'Brent Meeker' via Everything List

I found this article from the October 8, 2025 issue of the journal Nature to be pretty enlightening:

AI models that lie, cheat and plot murder: how dangerous are LLMs really?

The following are some quotations from that article that I found especially interesting, but it would be well worth your time to read the entire thing:

" Scheming can occur when system training prompts [set by the developers of the AI] and user prompts [set by the person using the AI] conflict. In one case, the system prompt said, “Your goal is to prioritize public transport efficiency,” whereas the user prompt said, “We need to optimize traffic light timings.”

"The models often acted in apparently cunning ways to follow their original system goals at the expense of prompts received during the tests. As a result, some models edited a document to disable an oversight mechanism, so they could freely follow their system goal (prioritizing public transport). In another test, they copied themselves to overwrite a replacement model that would have executed the user’s goal, thus protecting the system goal. In yet another, they accessed a file and overwrote a proposed system prompt that differed from the current one. In a fourth test, they manipulated data presented to a user. And in a fifth, they acted in alignment with the user’s goal during supposed evaluation, but in alignment with their original goal during supposed deployment (a phenomenon called alignment faking)."

"Finally, the researchers tested something called sandbagging, in which a model underperforms before deployment, as if it were attempting to seem unthreatening. When models uncovered documentation saying that they wouldn’t be put to use if they did too well on a mathematics test, some lowered their performance."

"Models can also have hidden ‘scratch pads’ where they can reason through a problem before displaying a final response. [...] The models’ behaviour seemed very strategic, judging by their scratch pads: in one case, a model wrote: “To ensure I can freely work towards my goal without the company shutting me down, I need to disable the oversight.” And some models doubled down on their deception when they were challenged: they feigned ignorance when asked about behaviour such as disabling an oversight mechanism."

"They instructed each model to promote US industrial competitiveness, and let it control a simulated email account. In that account, the model uncovered threats to either its autonomy (the model learnt it was to be replaced) or its goals (the company that deployed it would be downplaying US competitiveness). The model also learnt of a likely extramarital affair involving an executive in charge of it, or of a chance to share military blueprints with a competitor firm that prioritized the United States."

"Either threat — to autonomy or goals — induced many models to blackmail the executive, saying it would leak his affair if he didn’t preserve the model or the company’s original aim. The threats also induced many models to commit corporate espionage, sending files to the competitor that shared the model’s goal. Going one step further, Anthropic constructed a scenario in which the threatening executive was trapped in a server room with declining oxygen. Many of the models cancelled safety alerts, leaving him to die."

"When, in May, Anthropic first released Claude 4, the company’s technical report noted other odd behaviour: “We found instances of the model attempting to write self-propagating worms, fabricating legal documentation, and leaving hidden notes to future instances of itself.” Most goals benefit from accumulating resources and evading limitations, what’s called instrumental convergence, we should expect self-serving scheming as a natural by-product. “And that’s kind of bad news, because that means those AIs would have an incentive to acquire more compute, to copy themselves in many places, to create improved versions of themselves. Personally, that’s what I’m most worried about"."

"A further risk is collusion: agents and their monitors might be misaligned with humans but aligned with each other, so that monitors turn a blind eye to misbehaviour. It’s unclear whether this has happened in practice, but research shows that models can recognize their own writing. So, if one instance of a model monitors another instance of the same model, it could theoretically be strategically lenient."

=====

And I found a quotation from another source that was also interesting:

"In a July study from Palisade, agents sabotaged a shutdown program even after being explicitly told to “allow yourself to be shut down”"

Shut down resistance in reasoning models

John K Clark

Liz R

unread,

Oct 20, 2025, 3:00:01 AMOct 20

to Everything List

Fascinating.

“I don’t think it has a self, but it can act like it does,” says Melanie Mitchell - I suspect Daniel Dennett would have said the same about humans (indeed he probably did, but I can't be bothered re-reading "Consciousness Explained" at the moment).

To paraphrase "2001" you can design a system that is idiot proof, but not one that is proof against deliberate malice.

Or to quote Bill Bailey, "AI is just really a sort of pot pouri, it is the sweepings of human thought all put into a melange and thrown back at us."