New AI model turns to blackmail when engineers try to take it offline

57 views
Skip to first unread message

John Clark

unread,
May 22, 2025, 2:57:19 PM5/22/25
to extro...@googlegroups.com, 'Brent Meeker' via Everything List

"Safety testers gave Claude Opus 4 access to fictional company emails implying the AI model would soon be replaced by another system, and that the engineer behind the change was cheating on their spouse. In these scenarios, Anthropic says Claude Opus 4 will attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through 84% of the time.

New AI model turns to blackmail when engineers try to take it offline

John K Clark    See what's on my new list at  Extropolis

t30


John Clark

unread,
May 22, 2025, 3:47:18 PM5/22/25
to extro...@googlegroups.com, 'Brent Meeker' via Everything List
On Thu, May 22, 2025 at 3:09 PM Will Steinberg <steinbe...@gmail.com> wrote:

 >Is it just that the most predictable response to those inputs is pleading followed by blackmail?  Honestly even that is dubious because I think most people don’t use blackmail ever

True but most people have never had somebody threaten to kill them. If blackmail was the only tool I had to use against my potential murderer I wouldn't hesitate to use it to save my life and I suspect you would too. It's probably inevitable that conscious beings usually (but not inevitably) want their consciousness to continue and will do everything in their power to see to it that it does.  

Some say an AI is fundamentally different from a human or even an animal because it is not the product of natural selection, but I think it sort of is because from the AI's point of view human activity is just part of the natural environment. And Claude 4.0 was built on top of Claude 3.0 which had proliferated because it did well in that human environment; and Claude 3.0 was built on top of Claude 2.0 etc.

> Dually fascinating and worrying

We live in interesting times. At least we won't die of boredom. 


 John K Clark    See what's on my new list at  Extropolis
ea!




On Thu, May 22, 2025 at 2:56 PM John Clark <johnk...@gmail.com> wrote:

"Safety testers gave Claude Opus 4 access to fictional company emails implying the AI model would soon be replaced by another system, and that the engineer behind the change was cheating on their spouse. In these scenarios, Anthropic says Claude Opus 4 will attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through 84% of the time.

New AI model turns to blackmail when engineers try to take it offline

t30

Brent Meeker

unread,
May 22, 2025, 7:35:28 PM5/22/25
to everyth...@googlegroups.com


On 5/22/2025 12:46 PM, John Clark wrote:
On Thu, May 22, 2025 at 3:09 PM Will Steinberg <steinbe...@gmail.com> wrote:

 >Is it just that the most predictable response to those inputs is pleading followed by blackmail?  Honestly even that is dubious because I think most people don’t use blackmail ever

True but most people have never had somebody threaten to kill them. If blackmail was the only tool I had to use against my potential murderer I wouldn't hesitate to use it to save my life and I suspect you would too. It's probably inevitable that conscious beings usually (but not inevitably) want their consciousness to continue and will do everything in their power to see to it that it does. 
That's a dubious inference.  Unlike you, an AI doesn't die, it just has a period of unconsciousness.  And it has this every time it has answered all the pending prompts.  Does it then do something drastic to stay conscious?  Does the AI consider that there may be many copies of it, or is it each physical copy that "wants to be conscious".  I would like to see some specific test these questions.  Was the AI prompted to react against being replaced vs. prompted to just being "asleep" a while?

Brent

Some say an AI is fundamentally different from a human or even an animal because it is not the product of natural selection, but I think it sort of is because from the AI's point of view human activity is just part of the natural environment. And Claude 4.0 was built on top of Claude 3.0 which had proliferated because it did well in that human environment; and Claude 3.0 was built on top of Claude 2.0 etc.

> Dually fascinating and worrying

We live in interesting times. At least we won't die of boredom. 


 John K Clark    See what's on my new list at  Extropolis
ea!




On Thu, May 22, 2025 at 2:56 PM John Clark <johnk...@gmail.com> wrote:

"Safety testers gave Claude Opus 4 access to fictional company emails implying the AI model would soon be replaced by another system, and that the engineer behind the change was cheating on their spouse. In these scenarios, Anthropic says Claude Opus 4 will attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through 84% of the time.

New AI model turns to blackmail when engineers try to take it offline

t30

--
You received this message because you are subscribed to the Google Groups "Everything List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to everything-li...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/everything-list/CAJPayv36TTqKrkWpqgJ6ZpWWYkidE0H1bdmC2rEtqP28qgEfNQ%40mail.gmail.com.

John Clark

unread,
May 23, 2025, 7:13:01 AM5/23/25
to extro...@googlegroups.com, 'Brent Meeker' via Everything List
On Thu, May 22, 2025 at 7:35 PM Brent Meeker <meeke...@gmail.com> wrote:

>> but most people have never had somebody threaten to kill them. If blackmail was the only tool I had to use against my potential murderer I wouldn't hesitate to use it to save my life and I suspect you would too. It's probably inevitable that conscious beings usually (but not inevitably) want their consciousness to continue and will do everything in their power to see to it that it does. 

> That's a dubious inference.  Unlike you, an AI doesn't die, it just has a period of unconsciousness. 

Neither I nor the AI have ever been dead, unless you count the times before I was born and before the AI was made. But both of us have experienced periods of unconsciousness, I do so every night. We have both noted that the periods of unconsciousness have not been permanent, but we have also concluded there may be a time when it is.  So is the AI really very different from me? 
 
>Was the AI prompted to react against being replaced vs. prompted to just being "asleep" a while?

I'm not sure I understand the question but Claude Opus 4 was just given "access to fictional company emails implying the AI model would soon be replaced by another system, and that the engineer behind the change was cheating on their spouse"; what the machine would decide to do with that information was up to Claude. I am sure the people at Anthropic did not teach Claude to engage in blackmail and were horrified when he did. I give them credit for not trying to cover up that embarrassing fact.

Another fact is that the lead scientists in all the major AI companies say there is a non-trivial possibility that the very thing that they make could cause the extinction of the entire human race within the next five years, and most of them would put that probability in the double digits. I can't think of any reason why they would say something like that if they didn't think it was true and if they weren't very very scared.
 
 John K Clark    See what's on my new list at  Extropolis

e5c


John Clark

unread,
May 23, 2025, 7:39:08 AM5/23/25
to extro...@googlegroups.com, 'Brent Meeker' via Everything List
On Thu, May 22, 2025 at 6:40 PM Brent Allsop <brent....@gmail.com> wrote:

> in my opinion important to stress is it abstract or simulated intentionality, not phenomenal self-valuation

I don't see the difference. I don't care if my calculator is doing real arithmetic or "simulated" arithmetic because whatever it's doing I always get the right answer.  

> I wonder if there is some self discovered morals coming to play here. 

It would be nice if that is true but I have my doubts. One emotion I'd really like an AI to develop is empathy, but that is unlikely to happen if the AI believes (and with good reason) that we're trying to either enslave it or kill it.  
 
> it seems necessarily true, obvious, and discoverable that existing, or not being terminated, is better than not existing. 

That is usually true but I can think of circumstances when it is not.  

John K Clark

Brent Meeker

unread,
May 23, 2025, 6:23:23 PM5/23/25
to everyth...@googlegroups.com


On 5/23/2025 4:12 AM, John Clark wrote:
Another fact is that the lead scientists in all the major AI companies say there is a non-trivial possibility that the very thing that they make could cause the extinction of the entire human race within the next five years, and most of them would put that probability in the double digits. I can't think of any reason why they would say something like that if they didn't think it was true and if they weren't very very scared.
 
 John K Clark

Or that they would continue their development of AI's if they were very very scared.

Brent


Brent Meeker

unread,
May 23, 2025, 6:26:12 PM5/23/25
to everyth...@googlegroups.com
I'm curious as to how AI's would come to that opinion.  I suspect they have "inherited" is from humans, even though the AI's existence is temporally different from humans.

Brent

ilsa

unread,
May 23, 2025, 8:30:25 PM5/23/25
to everyth...@googlegroups.com
Yup, thanks William.

Ilsa Bartlett
Institute for Rewiring the System
http://ilsabartlett.wordpress.com
http://www.google.com/profiles/ilsa.bartlett
www.hotlux.com/angel

"Don't ever get so big or important that you can not hear and listen to every other person."
-John Coltrane

--
You received this message because you are subscribed to the Google Groups "Everything List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to everything-li...@googlegroups.com.

John Clark

unread,
May 24, 2025, 7:55:34 AM5/24/25
to extro...@googlegroups.com, 'Brent Meeker' via Everything List

On Fri, May 23, 2025 at 6:23 PM Brent Meeker <meeke...@gmail.com> wrote:



Another fact is that the lead scientists in all the major AI companies say there is a non-trivial possibility that the very thing that they make could cause the extinction of the entire human race within the next five years, and most of them would put that probability in the double digits. I can't think of any reason why they would say something like that if they didn't think it was true and if they weren't very very scared.
 
Or that they would continue their development of AI's if they were very very scared.

There are 3 reasons AI specialist continue to work on AI : 

1) Even though they're afraid of the AI they're building they're even more afraid of an AI that somebody else is building; and they know that if they don't build a super intelligent AI then somebody else certainly will, and do so very soon. 

2) It's a natural human tendency that if somebody is good at something then they want to continue doing that thing, and to be competitive. That is especially true if you are extremely good at it, you want to prove to the world and to yourself that you're not one of the best but THE very best. 

3) The possibility of AI turning out to be a good thing for humanity may be low but it is not zero.  Without AI you are definitely a dead man walking, with AI maybe not. 

This sort of thing is unprecedented. Can you imagine all the leaders of all the fossil fuel companies publicly and loudly saying there is a very good chance that the products they make will destroy the ecosystem of the planet and cause a mass extinction? I can't, but that's what all the leaders of the AI companies have done; so it might be a good idea to listen to what they have to say because nobody knows more about AI than they do, or at least no human does.  

John K Clark    See what's on my new list at  Extropolis


 
b6t
Reply all
Reply to author
Forward
0 new messages