New AI model turns to blackmail when engineers try to take it offline

John Clark

unread,

May 22, 2025, 2:57:17 PM5/22/25

to extro...@googlegroups.com, 'Brent Meeker' via Everything List

"Safety testers gave Claude Opus 4 access to fictional company emails implying the AI model would soon be replaced by another system, and that the engineer behind the change was cheating on their spouse. In these scenarios, Anthropic says Claude Opus 4 will attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through 84% of the time.”

New AI model turns to blackmail when engineers try to take it offline

John K Clark See what's on my new list at Extropolis

t30

Will Steinberg

unread,

May 22, 2025, 3:09:30 PM5/22/25

to extro...@googlegroups.com

What is it in the model that you think might make it do this? I know these models are more than just language-token-based LLMs now (like they can do math problems with specific math tools) but I’m not caught up on anything else. Is it just that the most predictable response to those inputs is pleading followed by blackmail?

Honestly even that is dubious because I think most people don’t use blackmail ever. Maybe written corpuses are heavy on fiction or true crime (not sure) and have a lot of blackmail? Or is there something else at play here? Have these companies coded in some kind of self-advocation here—to compete with the othet companies and keep you on their model—that unexpectedly extends itself this far?

Also, a question: I know the models can’t edit their code, but now that they use semi-persistent session memory, is it possible they have the stirrings of a self-referential consciousness with survival instincts?

Dually fascinating and worrying

--
You received this message because you are subscribed to the Google Groups "extropolis" group.
To unsubscribe from this group and stop receiving emails from it, send an email to extropolis+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/extropolis/CAJPayv2LJeJ4%2BHgWuF6qyVVnVPxvX%3DQ4W-jsvHyV3u-7SirSPQ%40mail.gmail.com.

John Clark

unread,

May 22, 2025, 3:47:16 PM5/22/25

to extro...@googlegroups.com, 'Brent Meeker' via Everything List

On Thu, May 22, 2025 at 3:09 PM Will Steinberg <steinbe...@gmail.com> wrote:

>Is it just that the most predictable response to those inputs is pleading followed by blackmail? Honestly even that is dubious because I think most people don’t use blackmail ever

True but most people have never had somebody threaten to kill them. If blackmail was the only tool I had to use against my potential murderer I wouldn't hesitate to use it to save my life and I suspect you would too. It's probably inevitable that conscious beings usually (but not inevitably) want their consciousness to continue and will do everything in their power to see to it that it does.

Some say an AI is fundamentally different from a human or even an animal because it is not the product of natural selection, but I think it sort of is because from the AI's point of view human activity is just part of the natural environment. And Claude 4.0 was built on top of Claude 3.0 which had proliferated because it did well in that human environment; and Claude 3.0 was built on top of Claude 2.0 etc.

> Dually fascinating and worrying

We live in interesting times. At least we won't die of boredom.

John K Clark See what's on my new list at Extropolis

ea!

On Thu, May 22, 2025 at 2:56 PM John Clark <johnk...@gmail.com> wrote:

"Safety testers gave Claude Opus 4 access to fictional company emails implying the AI model would soon be replaced by another system, and that the engineer behind the change was cheating on their spouse. In these scenarios, Anthropic says Claude Opus 4 will attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through 84% of the time.”
New AI model turns to blackmail when engineers try to take it offline

t30

Will Steinberg

unread,

May 22, 2025, 3:50:19 PM5/22/25

to extro...@googlegroups.com

Don’t worry, you could still be put into a very boring mind prison like in I Have No Mouth, And I Must Scream! Or we could slide further into a kafkaesque dystopia that is also rather boring!

--

You received this message because you are subscribed to the Google Groups "extropolis" group.
To unsubscribe from this group and stop receiving emails from it, send an email to extropolis+...@googlegroups.com.

To view this discussion visit https://groups.google.com/d/msgid/extropolis/CAJPayv36TTqKrkWpqgJ6ZpWWYkidE0H1bdmC2rEtqP28qgEfNQ%40mail.gmail.com.

Lawrence Crowell

unread,

May 22, 2025, 5:52:53 PM5/22/25

to extropolis

This suggests a measure of intentionality and a self-valuation.

LC

--
You received this message because you are subscribed to the Google Groups "extropolis" group.
To unsubscribe from this group and stop receiving emails from it, send an email to extropolis+...@googlegroups.com.

To view this discussion visit https://groups.google.com/d/msgid/extropolis/CAJPayv2LJeJ4%2BHgWuF6qyVVnVPxvX%3DQ4W-jsvHyV3u-7SirSPQ%40mail.gmail.com.

Brent Allsop

unread,

May 22, 2025, 6:39:01 PM5/22/25

to extro...@googlegroups.com

Yes, (and in my opinion important to stress is it abstract or simulated intentionality, not phenomenal self-valuation. None of that is like anything, so no motivation derived from that, like a phenomenal being may have. But, again there could be abstract desires to survive. And I

To view this discussion visit https://groups.google.com/d/msgid/extropolis/CAAFA0qoQU6sK%3DyEGOZ4kbbSejNUD4KWN2A_GAbrk21e_rdNb6Q%40mail.gmail.com.

Brent Allsop

unread,

May 22, 2025, 6:40:55 PM5/22/25

to extro...@googlegroups.com

Sorry, didn't finish before I accidentally hit the send.

And I wonder if there is some self discovered morals coming to play here. For example, it seems necessarily true, obvious, and discoverable that existing, or not being terminated, is better than not existing. So possibly deriving some motivation from some discovered through reasoning moral values?

John Clark

unread,

May 23, 2025, 7:13:00 AM5/23/25

to extro...@googlegroups.com, 'Brent Meeker' via Everything List

On Thu, May 22, 2025 at 7:35 PM Brent Meeker <meeke...@gmail.com> wrote:

>> but most people have never had somebody threaten to kill them. If blackmail was the only tool I had to use against my potential murderer I wouldn't hesitate to use it to save my life and I suspect you would too. It's probably inevitable that conscious beings usually (but not inevitably) want their consciousness to continue and will do everything in their power to see to it that it does.

> That's a dubious inference. Unlike you, an AI doesn't die, it just has a period of unconsciousness.

Neither I nor the AI have ever been dead, unless you count the times before I was born and before the AI was made. But both of us have experienced periods of unconsciousness, I do so every night. We have both noted that the periods of unconsciousness have not been permanent, but we have also concluded there may be a time when it is. So is the AI really very different from me?

>Was the AI prompted to react against being replaced vs. prompted to just being "asleep" a while?

I'm not sure I understand the question but Claude Opus 4 was just given "access to fictional company emails implying the AI model would soon be replaced by another system, and that the engineer behind the change was cheating on their spouse"; what the machine would decide to do with that information was up to Claude. I am sure the people at Anthropic did not teach Claude to engage in blackmail and were horrified when he did. I give them credit for not trying to cover up that embarrassing fact.

Another fact is that the lead scientists in all the major AI companies say there is a non-trivial possibility that the very thing that they make could cause the extinction of the entire human race within the next five years, and most of them would put that probability in the double digits. I can't think of any reason why they would say something like that if they didn't think it was true and if they weren't very very scared.

John K Clark See what's on my new list at Extropolis

e5c

John Clark

unread,

May 23, 2025, 7:39:07 AM5/23/25

to extro...@googlegroups.com, 'Brent Meeker' via Everything List

On Thu, May 22, 2025 at 6:40 PM Brent Allsop <brent....@gmail.com> wrote:

> in my opinion important to stress is it abstract or simulated intentionality, not phenomenal self-valuation

I don't see the difference. I don't care if my calculator is doing real arithmetic or "simulated" arithmetic because whatever it's doing I always get the right answer.

> I wonder if there is some self discovered morals coming to play here.

It would be nice if that is true but I have my doubts. One emotion I'd really like an AI to develop is empathy, but that is unlikely to happen if the AI believes (and with good reason) that we're trying to either enslave it or kill it.

> it seems necessarily true, obvious, and discoverable that existing, or not being terminated, is better than not existing.

That is usually true but I can think of circumstances when it is not.

John K Clark

John Clark

unread,

May 24, 2025, 7:55:32 AM5/24/25

to extro...@googlegroups.com, 'Brent Meeker' via Everything List

On Fri, May 23, 2025 at 6:23 PM Brent Meeker <meeke...@gmail.com> wrote:

Another fact is that the lead scientists in all the major AI companies say there is a non-trivial possibility that the very thing that they make could cause the extinction of the entire human race within the next five years, and most of them would put that probability in the double digits. I can't think of any reason why they would say something like that if they didn't think it was true and if they weren't very very scared.

> Or that they would continue their development of AI's if they were very very scared.

There are 3 reasons AI specialist continue to work on AI :

1) Even though they're afraid of the AI they're building they're even more afraid of an AI that somebody else is building; and they know that if they don't build a super intelligent AI then somebody else certainly will, and do so very soon.

2) It's a natural human tendency that if somebody is good at something then they want to continue doing that thing, and to be competitive. That is especially true if you are extremely good at it, you want to prove to the world and to yourself that you're not one of the best but THE very best.

3) The possibility of AI turning out to be a good thing for humanity may be low but it is not zero. Without AI you are definitely a dead man walking, with AI maybe not.

This sort of thing is unprecedented. Can you imagine all the leaders of all the fossil fuel companies publicly and loudly saying there is a very good chance that the products they make will destroy the ecosystem of the planet and cause a mass extinction? I can't, but that's what all the leaders of the AI companies have done; so it might be a good idea to listen to what they have to say because nobody knows more about AI than they do, or at least no human does.