Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

ipmi_watchdog: IPMI Watchdog: response: Error 80 on cmd 22

1,187 views
Skip to first unread message

Arkadiusz Miśkiewicz

unread,
Nov 19, 2011, 8:30:01 AM11/19/11
to

Hello,

I have few machines where ipmi_watchdog (from 3.0.9 kernel currently)
reports after some time:

IPMI Watchdog: response: Error 80 on cmd 22

In less than a week
# grep "Error 80 on cmd 22" /var/log/kernel |wc -l
378681

#define IPMI_WDOG_RESET_TIMER 0x22
but no idea what error 80 is.

It happens on systems like Supermicro H8DMU or Supermicro X8SIL
(few other supermicro servers don't show this problem)

Options used on all these systems are:
options ipmi_watchdog timeout=120 action=reset nowayout=1

Any ideas?

Nov 18 22:49:36 server1 kernel: [ 27.229458] ipmi message handler version 39.2
Nov 18 22:49:36 server1 kernel: [ 27.231966] IPMI System Interface driver.
Nov 18 22:49:36 server1 kernel: [ 27.234635] ipmi_si: probing via SMBIOS
Nov 18 22:49:36 server1 kernel: [ 27.234640] ipmi_si: SMBIOS: io 0xca2 regsize 1 spacing 1 irq 0
Nov 18 22:49:36 server1 kernel: [ 27.234643] ipmi_si: Adding SMBIOS-specified kcs state machine
Nov 18 22:49:36 server1 kernel: [ 27.234648] ipmi_si: Trying SMBIOS-specified kcs state machine at i/o address 0xca2, slave address 0x20, irq 0
Nov 18 22:49:36 server1 kernel: [ 27.377128] ipmi_si ipmi_si.0: Found new BMC (man_id: 0x0028c5, prod_id: 0x0004, dev_id: 0x22)
Nov 18 22:49:36 server1 kernel: [ 27.377139] ipmi_si ipmi_si.0: IPMI kcs interface initialized
Nov 18 22:49:36 server1 kernel: [ 27.379406] ipmi device interface
Nov 18 22:50:09 server1 kernel: [ 70.153536] IPMI Watchdog: driver initialized
Nov 19 00:28:39 server1 kernel: [ 9571.486912] IPMI Watchdog: response: Error 80 on cmd 22
Nov 19 00:28:44 server1 kernel: [ 9576.488498] IPMI Watchdog: response: Error 80 on cmd 22
Nov 19 00:28:44 server1 kernel: [ 9576.489910] IPMI Watchdog: response: Error 80 on cmd 22
Nov 19 00:28:44 server1 kernel: [ 9576.491111] IPMI Watchdog: response: Error 80 on cmd 22
Nov 19 00:28:44 server1 kernel: [ 9576.492273] IPMI Watchdog: response: Error 80 on cmd 22
Nov 19 00:28:44 server1 kernel: [ 9576.493445] IPMI Watchdog: response: Error 80 on cmd 22
Nov 19 00:28:44 server1 kernel: [ 9576.494604] IPMI Watchdog: response: Error 80 on cmd 22
Nov 19 00:28:49 server1 kernel: [ 9581.496078] IPMI Watchdog: response: Error 80 on cmd 22
Nov 19 00:28:54 server1 kernel: [ 9586.497404] IPMI Watchdog: response: Error 80 on cmd 22
Nov 19 00:28:54 server1 kernel: [ 9586.498773] IPMI Watchdog: response: Error 80 on cmd 22
Nov 19 00:28:54 server1 kernel: [ 9586.500072] IPMI Watchdog: response: Error 80 on cmd 22
Nov 19 00:28:54 server1 kernel: [ 9586.501268] IPMI Watchdog: response: Error 80 on cmd 22
Nov 19 00:28:54 server1 kernel: [ 9586.502472] IPMI Watchdog: response: Error 80 on cmd 22
Nov 19 00:28:54 server1 kernel: [ 9586.503669] IPMI Watchdog: response: Error 80 on cmd 22
Nov 19 00:28:59 server1 kernel: [ 9591.505118] IPMI Watchdog: response: Error 80 on cmd 22
Nov 19 00:29:04 server1 kernel: [ 9596.506447] IPMI Watchdog: response: Error 80 on cmd 22
Nov 19 00:29:04 server1 kernel: [ 9596.507853] IPMI Watchdog: response: Error 80 on cmd 22
Nov 19 00:29:04 server1 kernel: [ 9596.509093] IPMI Watchdog: response: Error 80 on cmd 22
Nov 19 00:29:04 server1 kernel: [ 9596.510372] IPMI Watchdog: response: Error 80 on cmd 22
Nov 19 00:29:04 server1 kernel: [ 9596.511607] IPMI Watchdog: response: Error 80 on cmd 22
Nov 19 00:29:04 server1 kernel: [ 9596.512819] IPMI Watchdog: response: Error 80 on cmd 22
Nov 19 00:29:09 server1 kernel: [ 9601.514268] IPMI Watchdog: response: Error 80 on cmd 22
Nov 19 00:29:14 server1 kernel: [ 9606.515579] IPMI Watchdog: response: Error 80 on cmd 22
Nov 19 00:29:14 server1 kernel: [ 9606.516990] IPMI Watchdog: response: Error 80 on cmd 22
Nov 19 00:29:14 server1 kernel: [ 9606.518223] IPMI Watchdog: response: Error 80 on cmd 22
[...]

--
Arkadiusz Miśkiewicz PLD/Linux Team
arekm / maven.pl http://ftp.pld-linux.org/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Arkadiusz Miśkiewicz

unread,
Nov 21, 2011, 4:20:02 PM11/21/11
to
On Monday 21 of November 2011, cminyard wrote:
> On Sat, Nov 19, 2011 at 02:22:17PM +0100, Arkadiusz Miśkiewicz wrote:
> > Hello,
> >
> > I have few machines where ipmi_watchdog (from 3.0.9 kernel currently)
> > reports after some time:
> >
> > IPMI Watchdog: response: Error 80 on cmd 22
> >
> > In less than a week
> > # grep "Error 80 on cmd 22" /var/log/kernel |wc -l
> > 378681
> >
> > #define IPMI_WDOG_RESET_TIMER 0x22
> > but no idea what error 80 is.
>
> That error is a command-specific error, and it means: Attempt to start
> un-initialized watchdog
>
> I'm guessing that the IPMI controller gets reset somehow and then thinks
> it's watchdog timer is not initialized, and thus the reset command causes
> an issue.
>
> A fix should be pretty easy, if you get a 0x80 response from a reset,
> re-initialize the timer.
>
> Can you try the following patch?

Seems to be working fine, thanks.

[ 62.628219] IPMI Watchdog: driver initialized
[ 62.709860] watchdog (2396): /proc/2396/oom_adj is deprecated, please use /proc/2396/oom_score_adj instead.

# ipmitool mc reset cold (easy reproducer of the problem)

[ 76.002690] IPMI Watchdog: response: Error ff on cmd 22
[ 76.002826] IPMI Watchdog: response: Error ff on cmd 22
[ 76.002942] IPMI Watchdog: response: Error ff on cmd 22
[ 76.003055] IPMI Watchdog: response: Error ff on cmd 22
[ 76.003189] IPMI Watchdog: response: Error ff on cmd 22
[ 76.003301] IPMI Watchdog: response: Error ff on cmd 22
[ 77.114769] usb 1-1.2: reset full speed USB device number 3 using ehci_hcd
[ 82.173475] usb 1-1.2: device descriptor read/64, error -32
[ 82.343205] usb 1-1.2: device descriptor read/64, error -32
[ 82.512940] usb 1-1.2: reset full speed USB device number 3 using ehci_hcd
[ 82.579460] usb 1-1.2: device descriptor read/64, error -32
[ 82.749201] usb 1-1.2: device descriptor read/64, error -32
[ 82.918936] usb 1-1.2: reset full speed USB device number 3 using ehci_hcd
[ 83.015337] IPMI Watchdog: response: Error ff on cmd 22
[ 83.324851] usb 1-1.2: device not accepting address 3, error -32
[ 83.391564] usb 1-1.2: reset full speed USB device number 3 using ehci_hcd
[ 83.797438] usb 1-1.2: device not accepting address 3, error -32
[ 83.797926] usb 1-1.2: USB disconnect, device number 3
[ 83.907383] usb 1-1.2: new full speed USB device number 4 using ehci_hcd
[ 83.973904] usb 1-1.2: device descriptor read/64, error -32
[ 84.143638] usb 1-1.2: device descriptor read/64, error -32
[ 84.313374] usb 1-1.2: new full speed USB device number 5 using ehci_hcd
[ 84.380018] usb 1-1.2: device descriptor read/64, error -32
[ 84.549626] usb 1-1.2: device descriptor read/64, error -32
[ 84.719361] usb 1-1.2: new full speed USB device number 6 using ehci_hcd
[ 85.125340] usb 1-1.2: device not accepting address 6, error -32
[ 85.192002] usb 1-1.2: new full speed USB device number 7 using ehci_hcd
[ 85.597929] usb 1-1.2: device not accepting address 7, error -32
[ 85.598000] hub 1-1:1.0: unable to enumerate USB device on port 2
[ 90.686716] usb 1-1.2: new full speed USB device number 8 using ehci_hcd
[ 90.770958] usb 1-1.2: New USB device found, idVendor=0557, idProduct=2221
[ 90.770961] usb 1-1.2: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[ 90.770963] usb 1-1.2: Product: Hermon USB hidmouse Device
[ 90.770965] usb 1-1.2: Manufacturer: Winbond Electronics Corp
[ 90.772695] input: Winbond Electronics Corp Hermon USB hidmouse Device as /devices/pci0000:00/0000:00:1a.0/usb1/1-1/1-1.2/1-1.2:1.0/input/input2
[ 90.773358] generic-usb 0003:0557:2221.0003: input,hidraw0: USB HID v1.00 Mouse [Winbond Electronics Corp Hermon USB hidmouse Device] on usb-0000:00:1a.0-1.2/input0
[ 90.775299] input: Winbond Electronics Corp Hermon USB hidmouse Device as /devices/pci0000:00/0000:00:1a.0/usb1/1-1/1-1.2/1-1.2:1.1/input/input3
[ 90.775556] generic-usb 0003:0557:2221.0004: input,hidraw1: USB HID v1.00 Keyboard [Winbond Electronics Corp Hermon USB hidmouse Device] on usb-0000:00:1a.0-1.2/input1
[ 94.533830] IPMI Watchdog: response: Error ff on cmd 22
[ 104.016238] IPMI Watchdog: response: The IPMI controller appears to have been reset, will attempt to reinitialize the watchdog timer



>
> From 4467601416e23740fc940c31b1fffacbcb69b4a0 Mon Sep 17 00:00:00 2001
> From: Corey Minyard <cmin...@mvista.com>
> Date: Mon, 21 Nov 2011 14:26:20 -0600
> Subject: [PATCH] ipmi_watchdog: Restore settings when BMC reset
>
> If the BMC gets reset, it will return 0x80 response errors. In this case,
> it is probably a good idea to restore the IPMI settings.
> ---
> drivers/char/ipmi/ipmi_watchdog.c | 41
> ++++++++++++++++++++++++++++++++++-- 1 files changed, 38 insertions(+), 3
> deletions(-)
>
> diff --git a/drivers/char/ipmi/ipmi_watchdog.c
> b/drivers/char/ipmi/ipmi_watchdog.c index c2917ffa..34767a6 100644
> --- a/drivers/char/ipmi/ipmi_watchdog.c
> +++ b/drivers/char/ipmi/ipmi_watchdog.c
> @@ -139,6 +139,8 @@
> #define IPMI_WDOG_SET_TIMER 0x24
> #define IPMI_WDOG_GET_TIMER 0x25
>
> +#define IPMI_WDOG_TIMER_NOT_INIT_RESP 0x80
> +
> /* These are here until the real ones get into the watchdog.h interface.
> */ #ifndef WDIOC_GETTIMEOUT
> #define WDIOC_GETTIMEOUT _IOW(WATCHDOG_IOCTL_BASE, 20, int)
> @@ -596,6 +598,7 @@ static int ipmi_heartbeat(void)
> struct kernel_ipmi_msg msg;
> int rv;
> struct ipmi_system_interface_addr addr;
> + int timeout_retries = 0;
>
> if (ipmi_ignore_heartbeat)
> return 0;
> @@ -616,6 +619,7 @@ static int ipmi_heartbeat(void)
>
> mutex_lock(&heartbeat_lock);
>
> +restart:
> atomic_set(&heartbeat_tofree, 2);
>
> /*
> @@ -653,7 +657,33 @@ static int ipmi_heartbeat(void)
> /* Wait for the heartbeat to be sent. */
> wait_for_completion(&heartbeat_wait);
>
> - if (heartbeat_recv_msg.msg.data[0] != 0) {
> + if (heartbeat_recv_msg.msg.data[0] == IPMI_WDOG_TIMER_NOT_INIT_RESP) {
> + timeout_retries++;
> + if (timeout_retries > 3) {
> + printk(KERN_ERR PFX ": Unable to restore the IPMI"
> + " watchdog's settings, giving up.\n");
> + rv = -EIO;
> + goto out_unlock;
> + }
> +
> + /*
> + * The timer was not initialized, that means the BMC was
> + * probably reset and lost the watchdog information. Attempt
> + * to restore the timer's info. Note that we still hold
> + * the heartbeat lock, to keep a heartbeat from happening
> + * in this process, so must say no heartbeat to avoid a
> + * deadlock on this mutex.
> + */
> + rv = ipmi_set_timeout(IPMI_SET_TIMEOUT_NO_HB);
> + if (rv) {
> + printk(KERN_ERR PFX ": Unable to send the command to"
> + " set the watchdog's settings, giving up.\n");
> + goto out_unlock;
> + }
> +
> + /* We might need a new heartbeat, so do it now */
> + goto restart;
> + } else if (heartbeat_recv_msg.msg.data[0] != 0) {
> /*
> * Got an error in the heartbeat response. It was already
> * reported in ipmi_wdog_msg_handler, but we should return
> @@ -662,6 +692,7 @@ static int ipmi_heartbeat(void)
> rv = -EINVAL;
> }
>
> +out_unlock:
> mutex_unlock(&heartbeat_lock);
>
> return rv;
> @@ -922,11 +953,15 @@ static struct miscdevice ipmi_wdog_miscdev = {
> static void ipmi_wdog_msg_handler(struct ipmi_recv_msg *msg,
> void *handler_data)
> {
> - if (msg->msg.data[0] != 0) {
> + if (msg->msg.cmd == IPMI_WDOG_RESET_TIMER &&
> + msg->msg.data[0] == IPMI_WDOG_TIMER_NOT_INIT_RESP)
> + printk(KERN_INFO PFX "response: The IPMI controller appears"
> + " to have been reset, will attempt to reinitialize"
> + " the watchdog timer\n");
> + else if (msg->msg.data[0] != 0)
> printk(KERN_ERR PFX "response: Error %x on cmd %x\n",
> msg->msg.data[0],
> msg->msg.cmd);
> - }
>
> ipmi_free_recv_msg(msg);

cminyard

unread,
Nov 21, 2011, 4:40:01 PM11/21/11
to
On Sat, Nov 19, 2011 at 02:22:17PM +0100, Arkadiusz Miśkiewicz wrote:
>
> Hello,
>
> I have few machines where ipmi_watchdog (from 3.0.9 kernel currently)
> reports after some time:
>
> IPMI Watchdog: response: Error 80 on cmd 22
>
> In less than a week
> # grep "Error 80 on cmd 22" /var/log/kernel |wc -l
> 378681
>
> #define IPMI_WDOG_RESET_TIMER 0x22
> but no idea what error 80 is.

That error is a command-specific error, and it means: Attempt to start
un-initialized watchdog

I'm guessing that the IPMI controller gets reset somehow and then thinks
it's watchdog timer is not initialized, and thus the reset command causes
an issue.

A fix should be pretty easy, if you get a 0x80 response from a reset,
re-initialize the timer.

Can you try the following patch?

1.7.4.1
0 new messages