Re: Executing puppet crash the machine

86 views
Skip to first unread message

Mon

unread,
Nov 22, 2012, 12:23:06 PM11/22/12
to puppet...@googlegroups.com



Hello all,

We have a problem with puppet and certain kind of machines from our farm (+300), those with Supermicro X8SIE motherboard. Sometime when running puppet the machine crashes, we lose access to it and logging through IPMI doesn't show anything in the console, the only thing we can do is a cold reboot. Then if we run puppet again, nothing happens. If we run puppet several days after it could be another crash or not, it is random.
I debugged the problem and got the conclusion that the cause was when running "facter", running it in a mpssh session caused 7 or 8 crashes in different machines.

Soft Version:
S.O: ubuntu 8.04
facter                          1.5.4-1ubuntu1
puppet                         0.25.1-2          

After upgrading to facter -1.6.11-1 crashes continued. (last .deb in puppetlabs to hardy)


Sorry, I sent before ending.......

I managed to get some traces executing with "strace" that I could paste if you consider so.

Someone has experienced something like that?


jcbollinger

unread,
Nov 26, 2012, 5:59:06 PM11/26/12
to puppet...@googlegroups.com




For what it's worth, Facter itself is unlikely to be crashing your system, but it runs a variety of commands that probe system details, and it's possible that one or a combination of those sometimes crashes them.  It should be possible to crash the systems by running the same commands from the shell.

If you have straces of facter sessions that resulted in crashes then they might be illuminating.  The key thing I would be looking for is what commands Facter is trying to run when the crashes occurred.  Unfortunately, the nature of the problem precludes being certain that the last thing in the captured trace is actually the thing Facter was trying to do when the crash happened.

If there is a software bug then it is probably in a separate tool or in the OS kernel.  It might also be that you have a firmware (i.e. BIOS) bug on the affected systems, or even that the particular motherboard model that is affected has a design or fabrication flaw.


John

mseisdedos

unread,
Nov 28, 2012, 10:49:13 AM11/28/12
to puppet...@googlegroups.com
Hello John,
Thanks for your answer. I have open an issue with my hardward manufacturer and so I will do it with my SO one.
Anyway I paste the strace listings so maybe someone can shed light on it:

server1:

BIOS: American Megatrends Inc. 1.2      
SYS: Supermicro X8SIE
CPU: Intel(R) Core(TM) i3 CPU 550 @ 3.20GHz [4 cores]
MEM:
  SLOT0  2048 MB
  SLOT1  2048 MB


open("/usr/lib/ruby/1.8/facter/osfamily.rb", O_RDONLY|O_LARGEFILE) = 3
close(3) = 0
open("/usr/lib/ruby/1.8/facter/osfamily.rb", O_RDONLY|O_LARGEFILE) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=800, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7297000
read(3, "# Fact: osfamily\n#\n# Purpose: Re"..., 4096) = 800
......CRASH


server2:

BIOS: American Megatrends Inc. 1.2      
SYS: Supermicro X8SIE
CPU: Intel(R) Core(TM) i3 CPU 560 @ 3.33GHz [4 cores]
MEM:
  SLOT0  2048 MB
  SLOT1  2048 MB



stat64("/usr/sbin/dmidecode", {st_mode=S_IFREG|0755, st_size=48408, ...}) = 0
pipe([3, 4]) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb74e5ba8) = 8709
close(4) = 0
fcntl64(3, F_GETFL) = 0 (flags O_RDONLY)
fstat64(3, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb725e000
_llseek(3, 0, 0xbf900930, SEEK_CUR) = -1 ESPIPE(Illegal seek)
fstat64(3, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 0
read(3, "# dmidecode 2.9\nSMBIOS 2.6 prese"..., 1024) = 1024
read(3, "oot is supported\n\t\tBIOS boot spe"..., 1024) = 1024
read(3, "tate: Safe\n\tThermal State: Safe\n"..., 1024) = 1024
read(3, "Maximum Size: 128 KB\n\tSupported "..., 1024) = 1024
read(3, "e 5, 28 bytes\nMemory Controller "..., 1024) = 1024
read(3, " Installed\n\tError Status: OK\n\nHa"..., 1024) = 1024
read(3, " type 8, 9 bytes\nPort Connector "..., 1024) = 1024
read(3, "ternal Reference Designator: LPT"..., 1024) = 1024
read(3, "nal Reference Designator: Not Sp"..., 1024) = 1024
read(3, "nator: Not Specified\n\tExternal C"..., 1024) = 1024
read(3, "or Type: None\n\tPort Type: Other\n"..., 1024) = 1024
read(3, "ector Information\n\tInternal Refe"..., 1024) = 1024
read(3, "\tLength: Short\n\tID: 1\n\tCharacter"..., 1024) = 1024
read(3, "escriptor 5: POST error\n\tData Fo"..., 1024) = 1024
read(3, "ype 19, 15 bytes\nMemory Array Ma"..., 1024) = 1024
read(3, " Width: Unknown\n\tSize: No Module"..., 1024) = 1024
read(3, "ry Device Mapped Address\n\tStarti"..., 1024) = 1024
read(3, "on Handle: Not Provided\n\tTotal W"..., 1024) = 1024
--- SIGCHLD (Child exited) @ 0 (0) ---
read(3, "\n\nHandle 0x0039, DMI type 20, 19"..., 1024) = 1024
read(3, "on-recoverable Threshold: 6\n\nHan"..., 1024) = 1024
read(3, "UT OF SPEC>\n\tCooling Unit Group:"..., 1024) = 1024
read(3, "ed: Yes\n\tHot Replaceable: No\n\tCo"..., 1024) = 669
read(3, "", 1024) = 0
close(3) = 0
munmap(0xb725e000, 4096) = 0
rt_sigaction(SIGHUP, {SIG_IGN}, {0xb77388f0, [HUP], SA_RESTART}, 8) = 0
rt_sigaction(SIGQUIT, {SIG_IGN}, {0xb77388f0, [QUIT], SA_RESTART}, 8) = 0
rt_sigaction(SIGINT, {SIG_IGN}, {0xb77388f0, [INT], SA_RESTART}, 8) = 0
waitpid(8709, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0) = 8709
rt_sigaction(SIGHUP, {0xb77388f0, [HUP], SA_RESTART}, {SIG_IGN}, 8) = 0
rt_sigaction(SIGQUIT, {0xb77388f0, [QUIT], SA_RESTART}, {SIG_IGN}, 8) = 0
rt_sigaction(SIGINT, {0xb77388f0, [INT], SA_RESTART}, {SIG_IGN}, 8) = 0
............
sigprocmask(SIG_SETMASK, [], NULL) = 0
sigprocmask(SIG_BLOCK, NULL, []) = 0
sigprocmask(SIG_BLOCK, NULL, []) = 0
sigprocmask(SIG_BLOCK, NULL, []) = 0
sigprocmask(SIG_SETMASK, [], NULL) = 0
sigprocmask(SIG_BLOCK, NULL, []) = 0
sigprocmask(SIG_BLOCK, NULL, []) = 0
sigprocmask(SIG_BLOCK, NULL, []) = 0
.............
sigprocmask(SIG_BLOCK, NULL, []) = 0
sigprocmask(SIG_BLOCK, NULL, []) = 0
sigprocmask(SIG_BLOCK, NULL, []) = 0
sigprocmask(SIG_BLOCK, NULL, []) = 0
sigprocmask(SIG_BLOCK, NULL, []) = 0
sigprocmask(SIG_BLOCK, NULL, []) = 0
sigprocmask(SIG_SETMASK, [], NULL) = 0
sigprocmask(SIG_BLOCK, NULL, []) = 0
sigprocmask(SIG_BLOCK, NULL, []) = 0
.........
sigprocmask(SIG_BLOCK, NULL, []) = 0
sigprocmask(SIG_BLOCK, NULL, []) = 0
.......CRASH


2012/11/26 jcbollinger <John.Bo...@stjude.org>

--
You received this message because you are subscribed to the Google Groups "Puppet Users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/puppet-users/-/uRikgvYaJN8J.

To post to this group, send email to puppet...@googlegroups.com.
To unsubscribe from this group, send email to puppet-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.

jcbollinger

unread,
Nov 28, 2012, 3:32:32 PM11/28/12
to puppet...@googlegroups.com

I'm supposing that ".......CRASH" means "more of the same syscall, with similar results, until the trace ends on account of a system crash.

The second trace says nothing useful, as far as I can tell.  The last thing it shows before all the signal mask handling is the successful completion of a fact evaluation.

The first trace is not much more helpful.  The last thing it shows is Facter reading the Ruby code for the 'osfamily' fact.  That might indicate that it is during evaluation of that fact that the system crashed, but it's too far removed from fact evaluation for me to have any confidence in that.

My bet would be that the crash cuts off communication before its cause is reported in the trace, as I warned might be the case.

Here's another thing you could try: since facter doesn't always crash the system (if I understand correctly), you should be able to get a list of all the facts it is evaluating (and their values) by running "facter -p" from the command line.  Take that list, and use it to stress test facter on each fact individually (i.e. run facter -p <factname> many times in a loop), in a way that lets you be sure you always know which fact is currently under test.  In this way you may be able to identify one or more facts whose evaluation sometimes crashes the machine.

Note: don't neglect the "or more" above.  It is conceivable that your problem is deeper than just one fact.

Once you know the facts with which the problem is associated, we can investigate the commands facter is running, and thereby narrow down the cause of the crash.


John

mseisdedos

unread,
Nov 28, 2012, 4:41:00 PM11/28/12
to puppet...@googlegroups.com
Hello John,
Your assumption is ok.
I can not do the facter loop because we are in a production environment. Every time I run puppet on this machines I make sure I can reach its IPMI interface so I can reboot the machine in few minutes.
Thanks for you help
Regards.


2012/11/28 jcbollinger <John.Bo...@stjude.org>

--
You received this message because you are subscribed to the Google Groups "Puppet Users" group.
To view this discussion on the web visit https://groups.google.com/d/msg/puppet-users/-/B7AKDJ-7U40J.

Montse Seisdedos

unread,
Jun 20, 2013, 3:03:35 PM6/20/13
to puppet...@googlegroups.com
Hello group:
We eventually performed the test John suggested and we caught  the "thief" -> virtual.rb 
We didn't even try to analyze why it is hanging the machine. Due to the fact that this facter is not being used in ours recipes we just dropped it out.
Thanks for you help. 
Reply all
Reply to author
Forward
0 new messages