I found this article from the October 8, 2025 issue of the journal Nature to be pretty enlightening:
AI models that lie, cheat and plot murder: how dangerous are LLMs really?
The following are some quotations from that article that I found especially interesting, but it would be well worth your time to read the entire thing:
"They instructed each model to promote US industrial competitiveness, and let it control a simulated email account. In that account, the model uncovered threats to either its autonomy (the model learnt it was to be replaced) or its goals (the company that deployed it would be downplaying US competitiveness). The model also learnt of a likely extramarital affair involving an executive in charge of it, or of a chance to share military blueprints with a competitor firm that prioritized the United States."
"Either threat — to autonomy or goals — induced many models to blackmail the executive, saying it would leak his affair if he didn’t preserve the model or the company’s original aim. The threats also induced many models to commit corporate espionage, sending files to the competitor that shared the model’s goal. Going one step further, Anthropic constructed a scenario in which the threatening executive was trapped in a server room with declining oxygen. Many of the models cancelled safety alerts, leaving him to die."
"When, in May, Anthropic first released Claude 4, the company’s technical report noted other odd behaviour: “We found instances of the model attempting to write self-propagating worms, fabricating legal documentation, and leaving hidden notes to future instances of itself.” Most goals benefit from accumulating resources and evading limitations, what’s called instrumental convergence, we should expect self-serving scheming as a natural by-product. “And that’s kind of bad news, because that means those AIs would have an incentive to acquire more compute, to copy themselves in many places, to create improved versions of themselves. Personally, that’s what I’m most worried about"."
"A further risk is collusion: agents and their monitors might be misaligned with humans but aligned with each other, so that monitors turn a blind eye to misbehaviour. It’s unclear whether this has happened in practice, but research shows that models can recognize their own writing. So, if one instance of a model monitors another instance of the same model, it could theoretically be strategically lenient."
=====
And I found a quotation from another source that was also interesting:
"In a July study from Palisade, agents sabotaged a shutdown program even after being explicitly told to “allow yourself to be shut down”"
Shut down resistance in reasoning models
John K Clark