IPMI Discover Task on OpenCrowbar hung

90 views
Skip to first unread message

Soheil Eizadi

unread,
Mar 3, 2015, 1:17:24 AM3/3/15
to openc...@googlegroups.com
I have a new node that has been stuck in the IPMI Discover state for hours. I can ssh to the node from my workstation, here is the message and dmesg logs. What should I do next?
-Soheil

This the repeating error in /var/log/messages:

Mar  3 02:50:47 d00-8c-fa-03-a5-9c kernel: INFO: task modprobe:1270 blocked for more than 120 seconds.
Mar  3 02:50:47 d00-8c-fa-03-a5-9c kernel:      Not tainted 2.6.32-504.3.3.el6.x86_64 #1
Mar  3 02:50:47 d00-8c-fa-03-a5-9c kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar  3 02:50:47 d00-8c-fa-03-a5-9c kernel: modprobe      D 000000000000000d     0  1270      1 0x00000000
Mar  3 02:50:47 d00-8c-fa-03-a5-9c kernel: ffff880c2bb2fca8 0000000000000082 0000000000000000 ffff880c2bb2fd28
Mar  3 02:50:47 d00-8c-fa-03-a5-9c kernel: ffff880c2bb2fca8 ffffffffa0194a5e 00000003bef751ff ffffffff8106d175
Mar  3 02:50:47 d00-8c-fa-03-a5-9c kernel: ffff880c2bb2fc28 00000000fffba750 ffff880c2bf6dab8 ffff880c2bb2ffd8
Mar  3 02:50:47 d00-8c-fa-03-a5-9c kernel: Call Trace:
Mar  3 02:50:47 d00-8c-fa-03-a5-9c kernel: [<ffffffffa0194a5e>] ? i_ipmi_request+0x28e/0x680 [ipmi_msghandler]
Mar  3 02:50:47 d00-8c-fa-03-a5-9c kernel: [<ffffffff8106d175>] ? enqueue_entity+0x125/0x450
Mar  3 02:50:47 d00-8c-fa-03-a5-9c kernel: [<ffffffffa0194f65>] get_guid+0x115/0x140 [ipmi_msghandler]
Mar  3 02:50:47 d00-8c-fa-03-a5-9c kernel: [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40
Mar  3 02:50:47 d00-8c-fa-03-a5-9c kernel: [<ffffffff81064be5>] ? wake_up_process+0x15/0x20
Mar  3 02:50:47 d00-8c-fa-03-a5-9c kernel: [<ffffffffa0197dbe>] ipmi_register_smi+0x45e/0xdf0 [ipmi_msghandler]
Mar  3 02:50:47 d00-8c-fa-03-a5-9c kernel: [<ffffffff810874f0>] ? process_timeout+0x0/0x10
Mar  3 02:50:47 d00-8c-fa-03-a5-9c kernel: [<ffffffffa01a112b>] ? clear_obf+0x1b/0x20 [ipmi_si]
Mar  3 02:50:47 d00-8c-fa-03-a5-9c kernel: [<ffffffffa01a15c1>] ? kcs_event+0x351/0x590 [ipmi_si]
Mar  3 02:50:47 d00-8c-fa-03-a5-9c kernel: [<ffffffffa019fca3>] try_smi_init+0x523/0x890 [ipmi_si]
Mar  3 02:50:47 d00-8c-fa-03-a5-9c kernel: [<ffffffff8120a90b>] ? sysfs_create_file+0x2b/0x40
Mar  3 02:50:47 d00-8c-fa-03-a5-9c kernel: [<ffffffffa01a413e>] init_ipmi_si+0x82b/0x9fe [ipmi_si]
Mar  3 02:50:47 d00-8c-fa-03-a5-9c kernel: [<ffffffffa01a3913>] ? init_ipmi_si+0x0/0x9fe [ipmi_si]
Mar  3 02:50:47 d00-8c-fa-03-a5-9c kernel: [<ffffffff8100204c>] do_one_initcall+0x3c/0x1d0
Mar  3 02:50:47 d00-8c-fa-03-a5-9c kernel: [<ffffffff810bfff1>] sys_init_module+0xe1/0x250
Mar  3 02:50:47 d00-8c-fa-03-a5-9c kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Mar  3 02:51:28 d00-8c-fa-03-a5-9c ntpd[3287]: 0.0.0.0 0612 02 freq_set kernel -4.933 PPM
Mar  3 02:51:28 d00-8c-fa-03-a5-9c ntpd[3287]: 0.0.0.0 0615 05 clock_sync


This is the last thing in the /var/log/dmseg :

scsi 6:0:2:0: qdepth(32), tagged(1), simple(1), ordered(0), scsi_level(6), cmd_que(1)
iTCO_vendor_support: vendor-support=0
iTCO_wdt: Intel TCO WatchDog Timer Driver v1.07rh
iTCO_wdt: unable to reset NO_REBOOT flag, device disabled by hardware/BIOS
STARTING CRC_T10DIF
sd 6:1:0:0: [sda] 2923825152 512-byte logical blocks: (1.49 TB/1.36 TiB)
sd 6:1:0:0: [sda] Write Protect is off
sd 6:1:0:0: [sda] Mode Sense: 03 00 00 08
sd 6:1:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
 sda: sda1 sda2 sda3 sda5 sda6 sda7 sda8
sd 6:1:0:0: [sda] Attached SCSI disk
ipmi message handler version 39.2
IPMI System Interface driver.
ipmi_si: probing via ACPI
ipmi_si 00:07: [io  0x0ca2] regsize 1 spacing 1 irq 0
ipmi_si: Adding ACPI-specified kcs state machine
ipmi_si: probing via SMBIOS
ipmi_si: SMBIOS: io 0xca2 regsize 1 spacing 1 irq 0
ipmi_si: Adding SMBIOS-specified kcs state machine duplicate interface
ipmi_si: Trying ACPI-specified kcs state machine at i/o address 0xca2, slave address 0x0, irq 0

Greg Althaus

unread,
Mar 3, 2015, 9:44:02 AM3/3/15
to openc...@googlegroups.com
It looks like the IPMI module crashed/hung when loading. I've seen
this if the BMC/iDrac is wedged. Sometimes that component needs to be
power cycled (Full power removal or logging into the BMC/iDrac and
reboot it). Otherwise, I'm at a loss. It would nice to know hardware
type.

Thanks,
Greg Althaus
> --
> You received this message because you are subscribed to the Google Groups
> "Crowbar" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to opencrowbar...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Soheil

unread,
Mar 3, 2015, 10:53:45 AM3/3/15
to openc...@googlegroups.com
I'm pretty sure I powered that node, will do it again.

That node was managed by a previous instance of OC at same IP, does Sledgehammer retain any state that would cause this?

-Soheil

Greg Althaus

unread,
Mar 3, 2015, 11:06:10 AM3/3/15
to openc...@googlegroups.com
It shouldn't retain state. hmm - already configured IPMI. We have
problems with IP address reuse in some cases, but that wouldn't cause
this.

Thanks,
Greg

Soheil

unread,
Mar 3, 2015, 11:48:10 AM3/3/15
to openc...@googlegroups.com
Where is the tree for Sledgehammer, in case I wanted to look at the call trace?
-Soheil

Greg Althaus

unread,
Mar 3, 2015, 11:51:15 AM3/3/15
to openc...@googlegroups.com
it is stock centos-6.6 from iso (pre-built and cached in AWS). the
files are in sledgehammer directory in core.

Thanks,
Greg
Reply all
Reply to author
Forward
0 new messages