Jira (PUP-9174) 50% chance system crash on puppet apply

2 views
Skip to first unread message

David Hodgkinson (JIRA)

unread,
Sep 25, 2018, 6:27:04 AM9/25/18
to puppe...@googlegroups.com
David Hodgkinson created an issue
 
Puppet / Bug PUP-9174
50% chance system crash on puppet apply
Issue Type: Bug Bug
Affects Versions: PUP 4.10.12
Assignee: Unassigned
Attachments: IMG_0699.JPG
Components: Puppet Server
Created: 2018/09/25 3:26 AM
Environment:

Debian 8.

Sunfire X4200.

3.16.0-4 kernel.

 

Priority: Normal Normal
Reporter: David Hodgkinson

We have a fairly large farm of some 100 machines and some 10Pb of data. Maybe 80 of them are Sunfire x4200s and the rest are Dells. When doing a puppet apply on the Suns, there's a 50% chance of a kernel crash. The servers are running Debian 8 with a 3.16.0-4-amd64 kernel.

I've attached the dump. It seems to happen in get_empty_filp() in a do_sys_open() call. Google returns no hints except a possible security hole in that kernel.

Don't really want to upgrade some of the boxes as there's C code compiled against the kernel. We can do it if we have to.

Add Comment Add Comment
 
This message was sent by Atlassian JIRA (v7.7.1#77002-sha1:e75ca93)
Atlassian logo

Eric Sorenson (JIRA)

unread,
Sep 27, 2018, 7:27:06 PM9/27/18
to puppe...@googlegroups.com
Eric Sorenson assigned an issue to David Hodgkinson
Change By: Eric Sorenson
Assignee: David Hodgkinson

Eric Sorenson (JIRA)

unread,
Sep 27, 2018, 7:27:07 PM9/27/18
to puppe...@googlegroups.com
Eric Sorenson commented on Bug PUP-9174
 
Re: 50% chance system crash on puppet apply

David Hodgkinson This is truly a bizarre problem. I've never seen this before, and I'm pretty sure if there was a widespread problem where running puppet caused kernel panics that we would have heard about it previously.

The stack trace unfortunately just shows system calls and not the offending Puppet/Ruby code that's triggering the problem. Can you create a minimal reproducing manifest and run truss -f puppet apply file.pp to get more info about what's happening?

Can you upgrade the kernel on one host and see if it fixes the problem...?

David Hodgkinson (JIRA)

unread,
Sep 28, 2018, 4:43:04 AM9/28/18
to puppe...@googlegroups.com

Yes, it's bizarre and sadly it's not reliably reproducible. I'll try to reproduce. And I'll try to use strace. It really does smell of an operating system problem more than anything.

Josh Cooper (JIRA)

unread,
Dec 12, 2019, 12:38:04 AM12/12/19
to puppe...@googlegroups.com
Josh Cooper commented on Bug PUP-9174

Thanks for reporting this issue. However, we haven’t heard any updates, and are closing this issue now as Cannot Reproduce. If you have additional information or reproduction scenarios that may be of use, please comment in this ticket with details.

Reply all
Reply to author
Forward
0 new messages