Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Uhhuh. NMI received for unknown reason 2c on CPU 0.

312 views
Skip to first unread message

Borislav Petkov

unread,
Jan 29, 2013, 3:30:02 PM1/29/13
to
Hi,

this is rc5 + tip/master from 2 days ago, when resuming I get this fun
message:

...
[15117.684975] Restarting tasks ... done.
[15117.687201] video LNXVIDEO:00: Restoring backlight state
[15117.720469] ehci-pci 0000:00:1d.0: power state changed by ACPI to D3cold
[15117.721414] ehci-pci 0000:00:1a.0: power state changed by ACPI to D3cold
[15117.949185] [drm] Enabling RC6 states: RC6 on, RC6p on, RC6pp off
[15118.617192] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
[15118.617198] e1000e 0000:00:19.0 eth0: 10/100 speed: disabling TSO
[15123.971346] Uhhuh. NMI received for unknown reason 2c on CPU 0.
[15123.971353] Do you have a strange power saving mode enabled?
[15123.971356] Dazed and confused, but trying to continue

Machine is thinkpad x230. Any and all sensible suggestions are welcome.

Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Bjorn Helgaas

unread,
Jan 29, 2013, 4:40:02 PM1/29/13
to
On Tue, Jan 29, 2013 at 1:28 PM, Borislav Petkov <b...@alien8.de> wrote:
> Hi,
>
> this is rc5 + tip/master from 2 days ago, when resuming I get this fun
> message:
>
> ...
> [15117.684975] Restarting tasks ... done.
> [15117.687201] video LNXVIDEO:00: Restoring backlight state
> [15117.720469] ehci-pci 0000:00:1d.0: power state changed by ACPI to D3cold
> [15117.721414] ehci-pci 0000:00:1a.0: power state changed by ACPI to D3cold
> [15117.949185] [drm] Enabling RC6 states: RC6 on, RC6p on, RC6pp off
> [15118.617192] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
> [15118.617198] e1000e 0000:00:19.0 eth0: 10/100 speed: disabling TSO
> [15123.971346] Uhhuh. NMI received for unknown reason 2c on CPU 0.
> [15123.971353] Do you have a strange power saving mode enabled?
> [15123.971356] Dazed and confused, but trying to continue
>
> Machine is thinkpad x230. Any and all sensible suggestions are welcome.

Konstantin has some fixes for an e1000e power management issue related
to suspend/resume that he observed on an x220. He didn't see an NMI,
and apparently his problem has been around for a long time, so no idea
whether it could be related. I just noticed the conjunction of
thinkpad/e1000e/resume/power saving in both reports.

https://lkml.org/lkml/2013/1/18/147

Borislav Petkov

unread,
Jan 29, 2013, 10:50:02 PM1/29/13
to
On Tue, Jan 29, 2013 at 02:32:56PM -0700, Bjorn Helgaas wrote:
> Konstantin has some fixes for an e1000e power management issue related
> to suspend/resume that he observed on an x220. He didn't see an NMI,
> and apparently his problem has been around for a long time,

Yeah, this is one of those issues you don't see *every* s/r cycle and
besides, I just got this box and haven't run 3.{6,7} on it yet (maybe
never will :-)).

> so no idea whether it could be related. I just noticed the conjunction
> of thinkpad/e1000e/resume/power saving in both reports.
>
> https://lkml.org/lkml/2013/1/18/147

Yes, thanks Bjorn, that was a good suggestion. Btw, from reading the
thread, those patches still need cooking a bit more, AFAICR people's
objections/comments. Or should I go ahead and test them?

It's just that I'm overly cautious every time I hear e1000e is involved:

www.linux-magazine.com/content/download/62169/484085/file/Security_Lessons_Ftrace.pdf

:-)

Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

Bjorn Helgaas

unread,
Jan 30, 2013, 12:30:03 PM1/30/13
to
On Tue, Jan 29, 2013 at 8:42 PM, Borislav Petkov <b...@alien8.de> wrote:
> On Tue, Jan 29, 2013 at 02:32:56PM -0700, Bjorn Helgaas wrote:
>> Konstantin has some fixes for an e1000e power management issue related
>> to suspend/resume that he observed on an x220. He didn't see an NMI,
>> and apparently his problem has been around for a long time,
>
> Yeah, this is one of those issues you don't see *every* s/r cycle and
> besides, I just got this box and haven't run 3.{6,7} on it yet (maybe
> never will :-)).
>
>> so no idea whether it could be related. I just noticed the conjunction
>> of thinkpad/e1000e/resume/power saving in both reports.
>>
>> https://lkml.org/lkml/2013/1/18/147
>
> Yes, thanks Bjorn, that was a good suggestion. Btw, from reading the
> thread, those patches still need cooking a bit more, AFAICR people's
> objections/comments. Or should I go ahead and test them?

You're right, I don't think we're quite ready to merge those patches.
But if your NMI is easy to reproduce, it might be worth removing
e1000e altogether to see if it still happens. I noticed in your
original log that the NMI occurred 5 seconds after the e1000e message,
and I could imagine some CPU or PCI response timeout being 5 seconds.

> It's just that I'm overly cautious every time I hear e1000e is involved:
>
> www.linux-magazine.com/content/download/62169/484085/file/Security_Lessons_Ftrace.pdf

Thanks for the pointer, that was an interesting read :)

Bjorn

Borislav Petkov

unread,
Jan 30, 2013, 12:50:01 PM1/30/13
to
On Wed, Jan 30, 2013 at 10:27:42AM -0700, Bjorn Helgaas wrote:
> You're right, I don't think we're quite ready to merge those patches.
> But if your NMI is easy to reproduce, it might be worth removing
> e1000e altogether to see if it still happens.

That's the problem - I've seen it only once so far. I'll watch out for
it and do the above when I find a reliable way of reproducing it. Will
keep you posted.

Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

Jiri Slaby

unread,
Jan 30, 2013, 2:50:01 PM1/30/13
to
On 01/30/2013 06:44 PM, Borislav Petkov wrote:
> On Wed, Jan 30, 2013 at 10:27:42AM -0700, Bjorn Helgaas wrote:
>> You're right, I don't think we're quite ready to merge those patches.
>> But if your NMI is easy to reproduce, it might be worth removing
>> e1000e altogether to see if it still happens.
>
> That's the problem - I've seen it only once so far. I'll watch out for
> it and do the above when I find a reliable way of reproducing it. Will
> keep you posted.

It happens here too. Dunno what is the root cause. I *think* that it
never happened unless I used ethernet. Other than that I see no pattern.

Attaching -C 20 grep of messages over the last half year if there is
something that may help somehow.

--
js
suse labs
lll

Borislav Petkov

unread,
Jan 30, 2013, 3:10:04 PM1/30/13
to
Cool, so it happens once a day, not every day, everytime during resume,
and with e1000e. Can you try Bjorn's suggestion to remove e1000e
altogether and see if it still happens?

Jiri Slaby

unread,
Jan 30, 2013, 3:40:02 PM1/30/13
to
On 01/30/2013 09:00 PM, Borislav Petkov wrote:
> On Wed, Jan 30, 2013 at 08:43:55PM +0100, Jiri Slaby wrote:
>> On 01/30/2013 06:44 PM, Borislav Petkov wrote:
>>> On Wed, Jan 30, 2013 at 10:27:42AM -0700, Bjorn Helgaas wrote:
>>>> You're right, I don't think we're quite ready to merge those patches.
>>>> But if your NMI is easy to reproduce, it might be worth removing
>>>> e1000e altogether to see if it still happens.
>>>
>>> That's the problem - I've seen it only once so far. I'll watch out for
>>> it and do the above when I find a reliable way of reproducing it. Will
>>> keep you posted.
>>
>> It happens here too. Dunno what is the root cause. I *think* that it
>> never happened unless I used ethernet. Other than that I see no pattern.
>>
>> Attaching -C 20 grep of messages over the last half year if there is
>> something that may help somehow.
>
> Cool, so it happens once a day, not every day, everytime during resume,
> and with e1000e. Can you try Bjorn's suggestion to remove e1000e
> altogether and see if it still happens?

No, e1000e is not to blame at all. I moved e1000e out of /lib/modules
and it still happens.

What is cool is that I have steps to reproduce:
1) boot
2) run the attached script (turn on all possible power savings -- in
fact everything what powertop suggests)
3) suspend to _disk_ (mem is not enough, BIOS apparently has to
interfere here)
4) resume from disk
5) boom

I tried to remove also wireless drivers, no change.

--
js
suse labs
power

Rafael J. Wysocki

unread,
Jan 30, 2013, 4:40:02 PM1/30/13
to
No, I don't think it's the BIOS. Most likely the boot kernel.

> 4) resume from disk
> 5) boom
>
> I tried to remove also wireless drivers, no change.

Is the resume boot kernel the same as the one in the image?

Rafael


--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

Jiri Slaby

unread,
Jan 30, 2013, 5:20:02 PM1/30/13
to
On 01/30/2013 10:39 PM, Rafael J. Wysocki wrote:
>> What is cool is that I have steps to reproduce:
>> 1) boot
>> 2) run the attached script (turn on all possible power savings -- in
>> fact everything what powertop suggests)
>> 3) suspend to _disk_ (mem is not enough, BIOS apparently has to
>> interfere here)
>
> No, I don't think it's the BIOS. Most likely the boot kernel.

Or that...

>> 4) resume from disk
>> 5) boom
>>
>> I tried to remove also wireless drivers, no change.
>
> Is the resume boot kernel the same as the one in the image?

Yeah, the same ones: 3.7.5

--
js
suse labs

Rafael J. Wysocki

unread,
Jan 30, 2013, 5:40:01 PM1/30/13
to
On Wednesday, January 30, 2013 11:17:06 PM Jiri Slaby wrote:
> On 01/30/2013 10:39 PM, Rafael J. Wysocki wrote:
> >> What is cool is that I have steps to reproduce:
> >> 1) boot
> >> 2) run the attached script (turn on all possible power savings -- in
> >> fact everything what powertop suggests)
> >> 3) suspend to _disk_ (mem is not enough, BIOS apparently has to
> >> interfere here)
> >
> > No, I don't think it's the BIOS. Most likely the boot kernel.
>
> Or that...
>
> >> 4) resume from disk
> >> 5) boom
> >>
> >> I tried to remove also wireless drivers, no change.
> >
> > Is the resume boot kernel the same as the one in the image?
>
> Yeah, the same ones: 3.7.5

Well, I guess that we leak some state from the boot kernel to the image kernel.
I have no idea what it is, but I suspect something arch-specific.

I wonder what the affected systems have in common apart from e1000e?

Rafael


--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

Jiri Slaby

unread,
Jan 30, 2013, 6:20:02 PM1/30/13
to
On 01/30/2013 11:45 PM, Rafael J. Wysocki wrote:
> On Wednesday, January 30, 2013 11:17:06 PM Jiri Slaby wrote:
>> On 01/30/2013 10:39 PM, Rafael J. Wysocki wrote:
>>>> What is cool is that I have steps to reproduce:
>>>> 1) boot
>>>> 2) run the attached script (turn on all possible power savings -- in
>>>> fact everything what powertop suggests)
>>>> 3) suspend to _disk_ (mem is not enough, BIOS apparently has to
>>>> interfere here)
>>>
>>> No, I don't think it's the BIOS. Most likely the boot kernel.
>>
>> Or that...
>>
>>>> 4) resume from disk
>>>> 5) boom
>>>>
>>>> I tried to remove also wireless drivers, no change.
>>>
>>> Is the resume boot kernel the same as the one in the image?
>>
>> Yeah, the same ones: 3.7.5
>
> Well, I guess that we leak some state from the boot kernel to the image kernel.
> I have no idea what it is, but I suspect something arch-specific.
>
> I wonder what the affected systems have in common apart from e1000e?

Everything as I have thinkpad x230 too :). Is there any other report
than Borislav's?

I think I will start with commenting parts of `power' script to see
exactly which of the power savings cause this.

--
js
suse labs

Jiri Slaby

unread,
Jan 30, 2013, 6:50:02 PM1/30/13
to
On 01/31/2013 12:12 AM, Jiri Slaby wrote:
> I think I will start with commenting parts of `power' script to see
> exactly which of the power savings cause this.

... NMI watchdog. If I remove it from the script, the problem
disappears. If I try it alone, I have those NMIs.

Rafael J. Wysocki

unread,
Jan 30, 2013, 7:50:02 PM1/30/13
to
On Thursday, January 31, 2013 12:47:40 AM Jiri Slaby wrote:
> On 01/31/2013 12:12 AM, Jiri Slaby wrote:
> > I think I will start with commenting parts of `power' script to see
> > exactly which of the power savings cause this.
>
> ... NMI watchdog. If I remove it from the script, the problem
> disappears. If I try it alone, I have those NMIs.

Well, beats me. :-(

I suspect that it doesn't quiesce itself sufficiently before image restoration
and we get some crosstalk between the boot kernel and the image kernel.

Thanks,
Rafael


--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

Borislav Petkov

unread,
Jan 31, 2013, 2:10:02 AM1/31/13
to
On Thu, Jan 31, 2013 at 01:54:56AM +0100, Rafael J. Wysocki wrote:
> On Thursday, January 31, 2013 12:47:40 AM Jiri Slaby wrote:
> > On 01/31/2013 12:12 AM, Jiri Slaby wrote:
> > > I think I will start with commenting parts of `power' script to see
> > > exactly which of the power savings cause this.
> >
> > ... NMI watchdog. If I remove it from the script, the problem
> > disappears. If I try it alone, I have those NMIs.
>
> Well, beats me. :-(
>
> I suspect that it doesn't quiesce itself sufficiently before image restoration
> and we get some crosstalk between the boot kernel and the image kernel.

Well, I did what Jiri said causes it:

echo 0 > /proc/sys/kernel/nmi_watchdog

No NMI.

BUT(!), if I start powertop and set all tunables in the "Tunables" tab
to "Good", then suspend to disk, when I resume I get the NMI and this
time the unknown reason is 0x3c. Sounds like this needs bisection...
Btw, this is latest -rc5 + tip/master and Jiri triggers it on 3.7-stable
...

Btw, this e1000e thing has another problem: when I unplug the network
cable and replug it again, it cannot ping local network anymore.
Normally, when you plug the network cable back in, it does some sort if
link detection saying eth link is back up but it doesn't say it on that
box - only a reboot fixes it. Hmm.

Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

Jiri Slaby

unread,
Jan 31, 2013, 3:20:02 AM1/31/13
to
On 01/31/2013 08:09 AM, Borislav Petkov wrote:
> On Thu, Jan 31, 2013 at 01:54:56AM +0100, Rafael J. Wysocki wrote:
>> On Thursday, January 31, 2013 12:47:40 AM Jiri Slaby wrote:
>>> On 01/31/2013 12:12 AM, Jiri Slaby wrote:
>>>> I think I will start with commenting parts of `power' script to see
>>>> exactly which of the power savings cause this.
>>>
>>> ... NMI watchdog. If I remove it from the script, the problem
>>> disappears. If I try it alone, I have those NMIs.
>>
>> Well, beats me. :-(
>>
>> I suspect that it doesn't quiesce itself sufficiently before image restoration
>> and we get some crosstalk between the boot kernel and the image kernel.
>
> Well, I did what Jiri said causes it:
>
> echo 0 > /proc/sys/kernel/nmi_watchdog
>
> No NMI.
>
> BUT(!), if I start powertop and set all tunables in the "Tunables" tab
> to "Good", then suspend to disk, when I resume I get the NMI and this
> time the unknown reason is 0x3c. Sounds like this needs bisection...
> Btw, this is latest -rc5 + tip/master and Jiri triggers it on 3.7-stable
> ...

And 3.6(.0) was the first one I _tried_ and had that issue too. Not sure
if there is any bisect-good kernel to start with.

> Btw, this e1000e thing has another problem: when I unplug the network
> cable and replug it again, it cannot ping local network anymore.
> Normally, when you plug the network cable back in, it does some sort if
> link detection saying eth link is back up but it doesn't say it on that
> box - only a reboot fixes it. Hmm.

I think this is what Konstantin fixes with his patches.

--
js
suse labs

Jiri Slaby

unread,
Jan 31, 2013, 3:30:02 AM1/31/13
to
On 01/31/2013 08:09 AM, Borislav Petkov wrote:
> On Thu, Jan 31, 2013 at 01:54:56AM +0100, Rafael J. Wysocki wrote:
>> On Thursday, January 31, 2013 12:47:40 AM Jiri Slaby wrote:
>>> On 01/31/2013 12:12 AM, Jiri Slaby wrote:
>>>> I think I will start with commenting parts of `power' script to see
>>>> exactly which of the power savings cause this.
>>>
>>> ... NMI watchdog. If I remove it from the script, the problem
>>> disappears. If I try it alone, I have those NMIs.
>>
>> Well, beats me. :-(
>>
>> I suspect that it doesn't quiesce itself sufficiently before image restoration
>> and we get some crosstalk between the boot kernel and the image kernel.
>
> Well, I did what Jiri said causes it:
>
> echo 0 > /proc/sys/kernel/nmi_watchdog
>
> No NMI.
>
> BUT(!), if I start powertop and set all tunables in the "Tunables" tab
> to "Good", then suspend to disk, when I resume I get the NMI and this
> time the unknown reason is 0x3c. Sounds like this needs bisection...

And, does it happen if you switch all of them but NMI wtd in there?

And if I pass nmi_watchdog=0 to the image kernel, it should be gone I guess.

--
js
suse labs

Rafael J. Wysocki

unread,
Jan 31, 2013, 8:10:02 AM1/31/13
to
Yes, there are two bugs in e1000e, it appears. Konstantin's patch [2/5]
fixes one of them, but the other one has to be fixed differently.

Boris, would you be able to test a couple of e1000e patches for me?

Rafael


--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

Borislav Petkov

unread,
Jan 31, 2013, 8:20:01 AM1/31/13
to
On Thu, Jan 31, 2013 at 09:28:12AM +0100, Jiri Slaby wrote:
> And, does it happen if you switch all of them but NMI wtd in there?

No, but something else happens. Here's the whole dance:

1. Switch all tunables except "NMI watchdog should be turned off" to "Good"
2. suspend to disk
3. resume... all good
4. switch "NMI watchdog should be turned off" to "Good"
5. suspend to disk
6. resume... all good
7. start powertop, toggle "Wireless Power Saving for interface wlan0" twice.
I.e., "Good" -> "Bad"; "Bad" -> "Good".

-> Boom! Unknown reason NMI. It happened right during the toggle because
it appeared in the framebuffer console (no X) right during me toggling
this.

So, it is something getting fishy *after* the watchdog gets disabled.
Something remains funny and dangling, causing it to fire an NMI because
it is an NMI watchdog (doh!)... Could it be that the watchdog_disable
fact doesn't get communicated to the image kernel somehow, or maybe
delayed?

> And if I pass nmi_watchdog=0 to the image kernel, it should be gone I
> guess.

How do you pass options the image kernel?

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

Borislav Petkov

unread,
Jan 31, 2013, 8:20:03 AM1/31/13
to
On Thu, Jan 31, 2013 at 02:12:58PM +0100, Rafael J. Wysocki wrote:
> Yes, there are two bugs in e1000e, it appears. Konstantin's
> patch [2/5] fixes one of them, but the other one has to be fixed
> differently.
>
> Boris, would you be able to test a couple of e1000e patches for me?

Sure, send them on. You can add 2/5 in the mix too.

Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

Borislav Petkov

unread,
Jan 31, 2013, 8:30:01 AM1/31/13
to
On Thu, Jan 31, 2013 at 02:18:05PM +0100, Borislav Petkov wrote:
> > And if I pass nmi_watchdog=0 to the image kernel, it should be gone I
> > guess.
>
> How do you pass options the image kernel?

Yep, passing "nmi_watchdog=0" to the kernel (both when you boot and
when you resume) fixes the issue - no more unknown NMIs. Did only 3 s/r
cycles though.

Rafael J. Wysocki

unread,
Jan 31, 2013, 8:30:02 AM1/31/13
to
On Thursday, January 31, 2013 02:18:05 PM Borislav Petkov wrote:
> On Thu, Jan 31, 2013 at 09:28:12AM +0100, Jiri Slaby wrote:
> > And, does it happen if you switch all of them but NMI wtd in there?
>
> No, but something else happens. Here's the whole dance:
>
> 1. Switch all tunables except "NMI watchdog should be turned off" to "Good"
> 2. suspend to disk
> 3. resume... all good
> 4. switch "NMI watchdog should be turned off" to "Good"
> 5. suspend to disk
> 6. resume... all good
> 7. start powertop, toggle "Wireless Power Saving for interface wlan0" twice.
> I.e., "Good" -> "Bad"; "Bad" -> "Good".
>
> -> Boom! Unknown reason NMI. It happened right during the toggle because
> it appeared in the framebuffer console (no X) right during me toggling
> this.
>
> So, it is something getting fishy *after* the watchdog gets disabled.
> Something remains funny and dangling, causing it to fire an NMI because
> it is an NMI watchdog (doh!)... Could it be that the watchdog_disable
> fact doesn't get communicated to the image kernel somehow, or maybe
> delayed?

The image kernel has no idea whether or not the watchdog has been disabled in
the boot kernel. It needs to be disabled in both.

>
> > And if I pass nmi_watchdog=0 to the image kernel, it should be gone I
> > guess.
>
> How do you pass options the image kernel?

The image kernel has the same set of command line options that was used by
that kernel before hibernation.

Thanks,
Rafael


--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

Rafael J. Wysocki

unread,
Feb 2, 2013, 6:00:01 PM2/2/13
to
On Thursday, January 31, 2013 02:10:34 PM Borislav Petkov wrote:
> On Thu, Jan 31, 2013 at 02:12:58PM +0100, Rafael J. Wysocki wrote:
> > Yes, there are two bugs in e1000e, it appears. Konstantin's
> > patch [2/5] fixes one of them, but the other one has to be fixed
> > differently.
> >
> > Boris, would you be able to test a couple of e1000e patches for me?
>
> Sure, send them on. You can add 2/5 in the mix too.

There you go.

The [2/5] is at: https://patchwork.kernel.org/patch/2001211/

The other two are attached. I suppose the ordering doesn't matter.

Thanks,
pci-pm-fix-e1000e-runtime-suspend.patch
pci-pm-clear-state_saved-during-suspend.patch

Borislav Petkov

unread,
Feb 3, 2013, 9:50:02 AM2/3/13
to
On Sun, Feb 03, 2013 at 12:04:46AM +0100, Rafael J. Wysocki wrote:
> The [2/5] is at: https://patchwork.kernel.org/patch/2001211/
>
> The other two are attached. I suppose the ordering doesn't matter.

Ok, the eth link cable hotplugging issue seems fixed, plugging and
unplugging the cable works as expected.

The issue I triggered earlier:

> BUT(!), if I start powertop and set all tunables in the "Tunables" tab
> to "Good", then suspend to disk, when I resume I get the NMI and this
> time the unknown reason is 0x3c.

... still happens:

[ 123.250870] PM: Creating hibernation image:
[ 123.504940] PM: Need to copy 95667 pages <--- suspend to disk
[ 123.252841] Enabling non-boot CPUs ... <--- resume
[ 123.254021] SMP alternatives: lockdep: fixing up alternatives
[ 123.254026] smpboot: Booting Node 0 Processor 1 APIC 0x1
[ 123.275566] CPU1 is up
[ 123.275697] SMP alternatives: lockdep: fixing up alternatives
[ 123.275699] smpboot: Booting Node 0 Processor 2 APIC 0x2
[ 123.297581] CPU2 is up
[ 123.297699] SMP alternatives: lockdep: fixing up alternatives
[ 123.297701] smpboot: Booting Node 0 Processor 3 APIC 0x3
[ 123.319358] CPU3 is up
[ 123.321928] i915 0000:00:02.0: power state changed by ACPI to D0
[ 123.321992] xhci_hcd 0000:00:14.0: power state changed by ACPI to D0
[ 123.333256] ehci-pci 0000:00:1a.0: power state changed by ACPI to D0
[ 123.344541] ehci-pci 0000:00:1d.0: power state changed by ACPI to D0
[ 123.345012] sdhci-pci 0000:02:00.0: MMC controller base frequency changed to 50Mhz.
[ 123.345744] PM: noirq restore of devices complete after 24.061 msecs
[ 123.346684] PM: early restore of devices complete after 0.836 msecs
[ 123.389863] i915 0000:00:02.0: setting latency timer to 64
[ 123.389870] xhci_hcd 0000:00:14.0: setting latency timer to 64
[ 123.389887] ehci-pci 0000:00:1a.0: setting latency timer to 64
[ 123.389907] usb usb3: root hub lost power or was reset
[ 123.389908] usb usb1: root hub lost power or was reset
[ 123.389909] usb usb2: root hub lost power or was reset
[ 123.390034] e1000e 0000:00:19.0: irq 44 for MSI/MSI-X
[ 123.390171] xhci_hcd 0000:00:14.0: irq 45 for MSI/MSI-X
[ 123.390308] snd_hda_intel 0000:00:1b.0: irq 47 for MSI/MSI-X
[ 123.391013] ehci-pci 0000:00:1d.0: setting latency timer to 64
[ 123.391038] usb usb4: root hub lost power or was reset
[ 123.393798] ehci-pci 0000:00:1a.0: cache line size of 64 is not supported
[ 123.394115] ahci 0000:00:1f.2: setting latency timer to 64
[ 123.394229] iwlwifi 0000:03:00.0: RF_KILL bit toggled to disable radio.
[ 123.394923] ehci-pci 0000:00:1d.0: cache line size of 64 is not supported
[ 123.697314] usb 3-1: reset high-speed USB device number 2 using ehci-pci
[ 123.698252] ata2: SATA link down (SStatus 0 SControl 300)
[ 123.699286] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 123.701259] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 123.701287] ata1.00: ACPI cmd ef/02:00:00:00:00:a0 (SET FEATURES) succeeded
[ 123.701291] ata1.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) filtered out
[ 123.702699] ata3.00: ACPI cmd ef/02:00:00:00:00:a0 (SET FEATURES) succeeded
[ 123.702703] ata3.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) filtered out
[ 123.703222] ata5: SATA link down (SStatus 0 SControl 300)
[ 123.704603] ata3.00: ACPI cmd ef/02:00:00:00:00:a0 (SET FEATURES) succeeded
[ 123.704606] ata3.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) filtered out
[ 123.705938] ata3.00: configured for UDMA/100
[ 123.706033] sd 2:0:0:0: [sdb] Starting disk
[ 123.706041] ata1.00: ACPI cmd ef/02:00:00:00:00:a0 (SET FEATURES) succeeded
[ 123.706045] ata1.00: ACPI cmd f5/00:00:00:00:00:a0 (SECURITY FREEZE LOCK) filtered out
[ 123.735336] ata1.00: configured for UDMA/133
[ 123.735662] sd 0:0:0:0: [sda] Starting disk
[ 123.912740] usb 4-1: reset high-speed USB device number 2 using ehci-pci
[ 124.129520] PM: restore of devices complete after 741.589 msecs
[ 124.174684] Restarting tasks ... done.
[ 124.177521] video LNXVIDEO:00: Restoring backlight state
[ 124.186033] xhci_hcd 0000:00:14.0: power state changed by ACPI to D3cold
[ 124.214931] ehci-pci 0000:00:1a.0: power state changed by ACPI to D3cold
[ 124.214970] ehci-pci 0000:00:1d.0: power state changed by ACPI to D3cold
[ 124.394882] Uhhuh. NMI received for unknown reason 3c on CPU 0. <--- FUN.
[ 124.394890] Do you have a strange power saving mode enabled?
[ 124.394892] Dazed and confused, but trying to continue
[ 124.407438] [drm] Enabling RC6 states: RC6 on, RC6p on, RC6pp off
[ 127.035581] ehci-pci 0000:00:1a.0: power state changed by ACPI to D0
[ 127.135668] ehci-pci 0000:00:1a.0: setting latency timer to 64
[ 127.135910] ehci-pci 0000:00:1d.0: power state changed by ACPI to D0
[ 127.146500] ehci-pci 0000:00:1a.0: power state changed by ACPI to D3cold
[ 127.236381] ehci-pci 0000:00:1d.0: setting latency timer to 64
[ 127.236658] xhci_hcd 0000:00:14.0: power state changed by ACPI to D0
[ 127.247244] ehci-pci 0000:00:1d.0: power state changed by ACPI to D3cold
[ 127.337137] xhci_hcd 0000:00:14.0: setting latency timer to 64
[ 127.348286] e1000e 0000:00:19.0: irq 44 for MSI/MSI-X
[ 127.348975] xhci_hcd 0000:00:14.0: power state changed by ACPI to D3cold
[ 129.255203] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
[ 129.255215] e1000e 0000:00:19.0 eth0: 10/100 speed: disabling TSO

Rafael J. Wysocki

unread,
Feb 3, 2013, 3:10:03 PM2/3/13
to
On Sunday, February 03, 2013 03:46:56 PM Borislav Petkov wrote:
> On Sun, Feb 03, 2013 at 12:04:46AM +0100, Rafael J. Wysocki wrote:
> > The [2/5] is at: https://patchwork.kernel.org/patch/2001211/
> >
> > The other two are attached. I suppose the ordering doesn't matter.
>
> Ok, the eth link cable hotplugging issue seems fixed, plugging and
> unplugging the cable works as expected.

Cool, thanks.
Is suspend-to-RAM triggering that as too?

Rafael


--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

Borislav Petkov

unread,
Feb 3, 2013, 4:00:03 PM2/3/13
to
On Sun, Feb 03, 2013 at 09:15:12PM +0100, Rafael J. Wysocki wrote:
> Is suspend-to-RAM triggering that as too?

Nope, not really. But, just to confirm: s2r is

echo "shutdown" > /sys/power/disk
echo "mem" > /sys/power/state

right?

Btw, this bug is very strange. So I did a couple more s2disk runs, i.e.

echo "shutdown" > /sys/power/disk
echo "disk" > /sys/power/state

and it seemed to me that when the eth cable is plugged in, it would
suspend and resume fine. When I then boot, unplug the cable, set all
tunables to "Good", suspend to disk and resume, no NMI message. When I
plug the cable back, only *then* the message triggered.

I need to play with this a bit more to get a better sense of when
exactly it happens.

Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

Borislav Petkov

unread,
Feb 3, 2013, 4:10:03 PM2/3/13
to
On Sun, Feb 03, 2013 at 09:58:57PM +0100, Borislav Petkov wrote:
> and it seemed to me that when the eth cable is plugged in, it would
> suspend and resume fine. When I then boot, unplug the cable, set all
> tunables to "Good", suspend to disk and resume, no NMI message. When I
> plug the cable back, only *then* the message triggered.
>
> I need to play with this a bit more to get a better sense of when
> exactly it happens.

Ok, not really.

It is not influenced by the cable being plugged - it happens when I plug
in the cable or simply shortly after resume, without the cable.

Borislav Petkov

unread,
Feb 3, 2013, 4:20:02 PM2/3/13
to
On Sun, Feb 03, 2013 at 10:06:45PM +0100, Borislav Petkov wrote:
> On Sun, Feb 03, 2013 at 09:58:57PM +0100, Borislav Petkov wrote:
> > and it seemed to me that when the eth cable is plugged in, it would
> > suspend and resume fine. When I then boot, unplug the cable, set all
> > tunables to "Good", suspend to disk and resume, no NMI message. When I
> > plug the cable back, only *then* the message triggered.
> >
> > I need to play with this a bit more to get a better sense of when
> > exactly it happens.
>
> Ok, not really.
>
> It is not influenced by the cable being plugged - it happens when I plug
> in the cable or simply shortly after resume, without the cable.

Ok, just did 10 s2ram cycles back-to-back - no issue whatsoever, no
matter when I (un-)plug the cable. Changed the suspend script to

echo "disk" > /sys/power/state

and did an 11th suspend-resume run. It triggered right after resuming
from disk. So I'd guess the image kernel might be the required condition
for the triggering of the issue.

Thanks.

Jiri Slaby

unread,
Feb 3, 2013, 4:30:02 PM2/3/13
to
On 01/31/2013 02:18 PM, Borislav Petkov wrote:
> On Thu, Jan 31, 2013 at 09:28:12AM +0100, Jiri Slaby wrote:
>> And, does it happen if you switch all of them but NMI wtd in there?
>
> No, but something else happens. Here's the whole dance:
>
> 1. Switch all tunables except "NMI watchdog should be turned off" to "Good"
> 2. suspend to disk
> 3. resume... all good
> 4. switch "NMI watchdog should be turned off" to "Good"
> 5. suspend to disk
> 6. resume... all good
> 7. start powertop, toggle "Wireless Power Saving for interface wlan0" twice.
> I.e., "Good" -> "Bad"; "Bad" -> "Good".
>
> -> Boom! Unknown reason NMI. It happened right during the toggle because
> it appeared in the framebuffer console (no X) right during me toggling
> this.

Right, for me too. Even if I disable nmi watchdog in both the boot and
image kernel, the NMI eventually occurs (dunno what's the trigger now
though).

Given the above I'm thinking about switching that intel wi-fi card to
ath9k which I have at hand and retest...

--
js
suse labs

Jiri Slaby

unread,
Feb 6, 2013, 9:00:02 AM2/6/13
to
On 02/03/2013 12:04 AM, Rafael J. Wysocki wrote:
> On Thursday, January 31, 2013 02:10:34 PM Borislav Petkov wrote:
>> On Thu, Jan 31, 2013 at 02:12:58PM +0100, Rafael J. Wysocki wrote:
>>> Yes, there are two bugs in e1000e, it appears. Konstantin's
>>> patch [2/5] fixes one of them, but the other one has to be fixed
>>> differently.
>>>
>>> Boris, would you be able to test a couple of e1000e patches for me?
>>
>> Sure, send them on. You can add 2/5 in the mix too.
>
> There you go.
>
> The [2/5] is at: https://patchwork.kernel.org/patch/2001211/
>
> The other two are attached. I suppose the ordering doesn't matter.

Just a side question, are these going to be merged some time soon? I
don't even see them in -next and I would like to backport them to
opensuse as they affect also 3.7...

--
js
suse labs

Rafael J. Wysocki

unread,
Feb 6, 2013, 4:30:03 PM2/6/13
to
On Wednesday, February 06, 2013 02:54:00 PM Jiri Slaby wrote:
> On 02/03/2013 12:04 AM, Rafael J. Wysocki wrote:
> > On Thursday, January 31, 2013 02:10:34 PM Borislav Petkov wrote:
> >> On Thu, Jan 31, 2013 at 02:12:58PM +0100, Rafael J. Wysocki wrote:
> >>> Yes, there are two bugs in e1000e, it appears. Konstantin's
> >>> patch [2/5] fixes one of them, but the other one has to be fixed
> >>> differently.
> >>>
> >>> Boris, would you be able to test a couple of e1000e patches for me?
> >>
> >> Sure, send them on. You can add 2/5 in the mix too.
> >
> > There you go.
> >
> > The [2/5] is at: https://patchwork.kernel.org/patch/2001211/
> >
> > The other two are attached. I suppose the ordering doesn't matter.
>
> Just a side question, are these going to be merged some time soon? I
> don't even see them in -next and I would like to backport them to
> opensuse as they affect also 3.7...

Not these particular patches, but there's a series on linux-pci from
Konstantin Khlebnikov that is functionally equivalent. I'm not sure who's
going to take that, though. I've acked it.

Thanks,
Rafael


--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

Borislav Petkov

unread,
Feb 14, 2013, 9:40:02 AM2/14/13
to
On Sun, Feb 03, 2013 at 03:46:56PM +0100, Borislav Petkov wrote:
> On Sun, Feb 03, 2013 at 12:04:46AM +0100, Rafael J. Wysocki wrote:
> > The [2/5] is at: https://patchwork.kernel.org/patch/2001211/
> >
> > The other two are attached. I suppose the ordering doesn't matter.
>
> Ok, the eth link cable hotplugging issue seems fixed, plugging and
> unplugging the cable works as expected.

Q: what happened to those, are they going upstream for 3.9 or are you
sending them now for 3.8?

They fix at least the cable hotplugging issue so at least one thing is
covered.

Bjorn Helgaas

unread,
Feb 14, 2013, 12:20:01 PM2/14/13
to
On Thu, Feb 14, 2013 at 7:39 AM, Borislav Petkov <b...@suse.de> wrote:
> On Sun, Feb 03, 2013 at 03:46:56PM +0100, Borislav Petkov wrote:
>> On Sun, Feb 03, 2013 at 12:04:46AM +0100, Rafael J. Wysocki wrote:
>> > The [2/5] is at: https://patchwork.kernel.org/patch/2001211/
>> >
>> > The other two are attached. I suppose the ordering doesn't matter.
>>
>> Ok, the eth link cable hotplugging issue seems fixed, plugging and
>> unplugging the cable works as expected.
>
> Q: what happened to those, are they going upstream for 3.9 or are you
> sending them now for 3.8?
>
> They fix at least the cable hotplugging issue so at least one thing is
> covered.

I haven't done anything with the e1000e patches; I assume the e1000e
maintainers will take care of those.

I merged the following patches for v3.9:

* pci/konstantin-runtime-pm:
PCI/PM: Clear state_saved during suspend
PCI: Use atomic_inc_return() rather than atomic_add_return()
PCI: Catch attempts to disable already-disabled devices
PCI: Disable Bus Master unconditionally in pci_device_shutdown()

You can see the actual patches I merged at:
http://git.kernel.org/?p=linux/kernel/git/helgaas/pci.git;a=shortlog;h=refs/heads/next

It's pretty late for v3.8, but let me know if you think they're critical.

Bjorn

Borislav Petkov

unread,
Feb 14, 2013, 2:20:03 PM2/14/13
to
On Thu, Feb 14, 2013 at 10:17:46AM -0700, Bjorn Helgaas wrote:
> On Thu, Feb 14, 2013 at 7:39 AM, Borislav Petkov <b...@suse.de> wrote:
> > On Sun, Feb 03, 2013 at 03:46:56PM +0100, Borislav Petkov wrote:
> >> On Sun, Feb 03, 2013 at 12:04:46AM +0100, Rafael J. Wysocki wrote:
> >> > The [2/5] is at: https://patchwork.kernel.org/patch/2001211/
> >> >
> >> > The other two are attached. I suppose the ordering doesn't matter.
> >>
> >> Ok, the eth link cable hotplugging issue seems fixed, plugging and
> >> unplugging the cable works as expected.
> >
> > Q: what happened to those, are they going upstream for 3.9 or are you
> > sending them now for 3.8?
> >
> > They fix at least the cable hotplugging issue so at least one thing is
> > covered.
>
> I haven't done anything with the e1000e patches; I assume the e1000e
> maintainers will take care of those.
>
> I merged the following patches for v3.9:
>
> * pci/konstantin-runtime-pm:
> PCI/PM: Clear state_saved during suspend
> PCI: Use atomic_inc_return() rather than atomic_add_return()
> PCI: Catch attempts to disable already-disabled devices
> PCI: Disable Bus Master unconditionally in pci_device_shutdown()
>
> You can see the actual patches I merged at:
> http://git.kernel.org/?p=linux/kernel/git/helgaas/pci.git;a=shortlog;h=refs/heads/next
>
> It's pretty late for v3.8, but let me know if you think they're critical.

Ok, I meant those:

http://marc.info/?l=linux-kernel&m=135984592927219

They fix the link detection issue on my x230. So let's see. The first one is:

* https://patchwork.kernel.org/patch/2001211/ (e1000e: fix pci device enable
counter balance)

Rafael said this one is a real bugfix. Looks like e1000e maintainers are
picking that one?

* pci-pm-fix-e1000e-runtime-suspend.patch

I don't see that one in your tree.

* pci-pm-clear-state_saved-during-suspend.patch

I can see this one in your tree: http://git.kernel.org/?p=linux/kernel/git/helgaas/pci.git;a=commitdiff;h=82fee4d67ab86d6fe5eb0f9a9e988ca9d654d765

With the imminence of the 3.8 release, we probably want to wait for
after the merge window and retest again, then apply and packport stuff,
if needed.

And the NMI issue is still unfixed so this needs more work, AFAICT. Oh
well, after the merge window.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

Konstantin Khlebnikov

unread,
Feb 15, 2013, 4:00:01 AM2/15/13
to
please use this instead:

[PATCH v2 1/7] e1000e: fix pci-device enable-counter balance
https://lkml.org/lkml/2013/2/4/190

from v2 patchset: https://lkml.org/lkml/2013/2/4/185

>
> Rafael said this one is a real bugfix. Looks like e1000e maintainers are
> picking that one?
>
> * pci-pm-fix-e1000e-runtime-suspend.patch
>
> I don't see that one in your tree.
>
> * pci-pm-clear-state_saved-during-suspend.patch
>
> I can see this one in your tree: http://git.kernel.org/?p=linux/kernel/git/helgaas/pci.git;a=commitdiff;h=82fee4d67ab86d6fe5eb0f9a9e988ca9d654d765
>
> With the imminence of the 3.8 release, we probably want to wait for
> after the merge window and retest again, then apply and packport stuff,
> if needed.
>
> And the NMI issue is still unfixed so this needs more work, AFAICT. Oh
> well, after the merge window.
>

--

Borislav Petkov

unread,
Feb 15, 2013, 4:20:02 AM2/15/13
to
On Fri, Feb 15, 2013 at 12:54:12PM +0400, Konstantin Khlebnikov wrote:
> >* https://patchwork.kernel.org/patch/2001211/ (e1000e: fix pci device enable
> >counter balance)
>
> please use this instead:
>
> [PATCH v2 1/7] e1000e: fix pci-device enable-counter balance
> https://lkml.org/lkml/2013/2/4/190
>
> from v2 patchset: https://lkml.org/lkml/2013/2/4/185

So it looks Bjorn has taken most of them and the e1000e one will go
through the e1000e maintainers. I'll test after the merge window is
done.

Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

Jiri Slaby

unread,
Mar 1, 2013, 8:00:01 AM3/1/13
to
On 01/31/2013 02:24 PM, Borislav Petkov wrote:
> On Thu, Jan 31, 2013 at 02:18:05PM +0100, Borislav Petkov wrote:
>>> And if I pass nmi_watchdog=0 to the image kernel, it should be gone I
>>> guess.
>>
>> How do you pass options the image kernel?
>
> Yep, passing "nmi_watchdog=0" to the kernel (both when you boot and
> when you resume) fixes the issue - no more unknown NMIs. Did only 3 s/r
> cycles though.

FWIW the last time I saw the unhandled NMI was on 31st Jan. Since I
disabled NMI watchdog no more unhandled NMIs. I have to add that I don't
use ethernet at all.

--
js
suse labs

Jiri Slaby

unread,
Mar 1, 2013, 8:00:02 AM3/1/13
to
On 02/15/2013 09:54 AM, Konstantin Khlebnikov wrote:
> Borislav Petkov wrote:
>> On Thu, Feb 14, 2013 at 10:17:46AM -0700, Bjorn Helgaas wrote:
>>> It's pretty late for v3.8, but let me know if you think they're
>>> critical.
>>
>> Ok, I meant those:
>>
>> http://marc.info/?l=linux-kernel&m=135984592927219
>>
>> They fix the link detection issue on my x230. So let's see. The first
>> one is:
>>
>> * https://patchwork.kernel.org/patch/2001211/ (e1000e: fix pci device
>> enable
>> counter balance)
>
> please use this instead:

Hi, I am a bit confused. Is this fixed in -next yet? And if so, is it
known which commit IDs are needed to fix the issue in 3.7 (see below)?

> [PATCH v2 1/7] e1000e: fix pci-device enable-counter balance
> https://lkml.org/lkml/2013/2/4/190
>
> from v2 patchset: https://lkml.org/lkml/2013/2/4/185

So this is now in -next as:
commit e34f7147d93afe5efc574734bbff6584c0cc4a02
Author: Konstantin Khlebnikov <khleb...@openvz.org>
Date: Mon Feb 25 09:19:04 2013 +0400

e1000e: fix pci-device enable-counter balance

>> I don't see that one in your tree.
>>
>> * pci-pm-clear-state_saved-during-suspend.patch

This is:
commit 82fee4d67ab86d6fe5eb0f9a9e988ca9d654d765
Author: Rafael J. Wysocki <rafael.j...@intel.com>
Date: Mon Feb 4 15:56:05 2013 +0400

PCI/PM: Clear state_saved during suspend

>> Rafael said this one is a real bugfix. Looks like e1000e maintainers are
>> picking that one?
>>
>> * pci-pm-fix-e1000e-runtime-suspend.patch

Is this one replaced by a different fix in the end? Which one? I don't
think it is in -next yet, right?

confused,
--
js
suse labs

Borislav Petkov

unread,
Mar 4, 2013, 5:00:03 PM3/4/13
to
On Fri, Feb 15, 2013 at 10:16:41AM +0100, Borislav Petkov wrote:
> So it looks Bjorn has taken most of them and the e1000e one will go
> through the e1000e maintainers. I'll test after the merge window is
> done.

Issue still persists on 3.9-rc1 :-( :

Mar 4 21:47:34 nazgul vmunix: [ 3223.412541] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
Mar 4 21:47:34 nazgul vmunix: [ 3223.412554] e1000e 0000:00:19.0 eth0: 10/100 speed: disabling TSO
Mar 4 21:47:35 nazgul vmunix: [ 3224.034158] Uhhuh. NMI received for unknown reason 2c on CPU 0.
Mar 4 21:47:35 nazgul vmunix: [ 3224.034166] Do you have a strange power saving mode enabled?
Mar 4 21:47:35 nazgul vmunix: [ 3224.034168] Dazed and confused, but trying to continue

Bjorn Helgaas

unread,
Mar 4, 2013, 7:20:02 PM3/4/13
to
[+cc e1000-devel, Jeff, Bruce]

On Mon, Mar 4, 2013 at 2:50 PM, Borislav Petkov <b...@alien8.de> wrote:
> On Fri, Feb 15, 2013 at 10:16:41AM +0100, Borislav Petkov wrote:
>> So it looks Bjorn has taken most of them and the e1000e one will go
>> through the e1000e maintainers. I'll test after the merge window is
>> done.
>
> Issue still persists on 3.9-rc1 :-( :
>
> Mar 4 21:47:34 nazgul vmunix: [ 3223.412541] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
> Mar 4 21:47:34 nazgul vmunix: [ 3223.412554] e1000e 0000:00:19.0 eth0: 10/100 speed: disabling TSO
> Mar 4 21:47:35 nazgul vmunix: [ 3224.034158] Uhhuh. NMI received for unknown reason 2c on CPU 0.
> Mar 4 21:47:35 nazgul vmunix: [ 3224.034166] Do you have a strange power saving mode enabled?
> Mar 4 21:47:35 nazgul vmunix: [ 3224.034168] Dazed and confused, but trying to continue

The e1000e changes didn't get merged, did they? I don't see the
following changes mentioned at https://lkml.org/lkml/2013/2/4/185 in
3.9-rc1:

e1000e: fix pci-device enable-counter balance
e1000e: fix runtime power management transitions
e1000e: fix accessing to suspended device

Jiri Slaby

unread,
Mar 5, 2013, 4:50:02 AM3/5/13
to
On 03/05/2013 01:16 AM, Bjorn Helgaas wrote:
> [+cc e1000-devel, Jeff, Bruce]
>
> On Mon, Mar 4, 2013 at 2:50 PM, Borislav Petkov <b...@alien8.de> wrote:
>> On Fri, Feb 15, 2013 at 10:16:41AM +0100, Borislav Petkov wrote:
>>> So it looks Bjorn has taken most of them and the e1000e one will go
>>> through the e1000e maintainers. I'll test after the merge window is
>>> done.
>>
>> Issue still persists on 3.9-rc1 :-( :
>>
>> Mar 4 21:47:34 nazgul vmunix: [ 3223.412541] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
>> Mar 4 21:47:34 nazgul vmunix: [ 3223.412554] e1000e 0000:00:19.0 eth0: 10/100 speed: disabling TSO
>> Mar 4 21:47:35 nazgul vmunix: [ 3224.034158] Uhhuh. NMI received for unknown reason 2c on CPU 0.
>> Mar 4 21:47:35 nazgul vmunix: [ 3224.034166] Do you have a strange power saving mode enabled?
>> Mar 4 21:47:35 nazgul vmunix: [ 3224.034168] Dazed and confused, but trying to continue
>
> The e1000e changes didn't get merged, did they? I don't see the
> following changes mentioned at https://lkml.org/lkml/2013/2/4/185 in
> 3.9-rc1:
>
> e1000e: fix pci-device enable-counter balance
> e1000e: fix runtime power management transitions
> e1000e: fix accessing to suspended device

You're right. They are not even in -next :(.

--
js
suse labs

Borislav Petkov

unread,
Mar 5, 2013, 5:00:02 AM3/5/13
to
On Tue, Mar 05, 2013 at 10:42:17AM +0100, Jiri Slaby wrote:
> On 03/05/2013 01:16 AM, Bjorn Helgaas wrote:
> > [+cc e1000-devel, Jeff, Bruce]
> >
> > On Mon, Mar 4, 2013 at 2:50 PM, Borislav Petkov <b...@alien8.de> wrote:
> >> On Fri, Feb 15, 2013 at 10:16:41AM +0100, Borislav Petkov wrote:
> >>> So it looks Bjorn has taken most of them and the e1000e one will go
> >>> through the e1000e maintainers. I'll test after the merge window is
> >>> done.
> >>
> >> Issue still persists on 3.9-rc1 :-( :
> >>
> >> Mar 4 21:47:34 nazgul vmunix: [ 3223.412541] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
> >> Mar 4 21:47:34 nazgul vmunix: [ 3223.412554] e1000e 0000:00:19.0 eth0: 10/100 speed: disabling TSO
> >> Mar 4 21:47:35 nazgul vmunix: [ 3224.034158] Uhhuh. NMI received for unknown reason 2c on CPU 0.
> >> Mar 4 21:47:35 nazgul vmunix: [ 3224.034166] Do you have a strange power saving mode enabled?
> >> Mar 4 21:47:35 nazgul vmunix: [ 3224.034168] Dazed and confused, but trying to continue
> >
> > The e1000e changes didn't get merged, did they? I don't see the
> > following changes mentioned at https://lkml.org/lkml/2013/2/4/185 in
> > 3.9-rc1:
> >
> > e1000e: fix pci-device enable-counter balance
> > e1000e: fix runtime power management transitions
> > e1000e: fix accessing to suspended device
>
> You're right. They are not even in -next :(.

Oh, and there's another issue with this driver I reported yesterday:
http://marc.info/?l=linux-kernel&m=136243374114892&w=2:

"Trying to free already-free IRQ 20"

which happens during suspend so it seems also related.

Rafael, what's the state of those patches here:
https://lkml.org/lkml/2013/2/4/185, are they ready to be tested or you
still have issues with them?

Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

Jiri Slaby

unread,
Mar 5, 2013, 5:10:01 AM3/5/13
to
On 03/05/2013 10:58 AM, Borislav Petkov wrote:
> Rafael, what's the state of those patches here:
> https://lkml.org/lkml/2013/2/4/185, are they ready to be tested or you
> still have issues with them?

Note there is a resend version:
https://lkml.org/lkml/2013/2/25/3

with a note from Jeff Kirsher:
I have added this patch to my e1000e patch queue.

thanks,
--
js
suse labs

Jiri Slaby

unread,
Mar 5, 2013, 5:10:01 AM3/5/13
to
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 03/05/2013 11:01 AM, Jeff Kirsher wrote:
> On Tue, 2013-03-05 at 10:42 +0100, Jiri Slaby wrote:
>>> The e1000e changes didn't get merged, did they? I don't see
>>> the following changes mentioned at
>>> https://lkml.org/lkml/2013/2/4/185 in 3.9-rc1:
>>>
>>> e1000e: fix pci-device enable-counter balance e1000e: fix
>>> runtime power management transitions e1000e: fix accessing to
>>> suspended device
>>
>> You're right. They are not even in -next :(.
>>
>
> I have them in my queue for net, so I should be pushing them later
> this week once validation has a chance to look at them.

Yeah, I've just noticed that here
https://lkml.org/lkml/2013/2/25/3

Thanks a lot.

- --
js
suse labs
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJRNcMQAAoJEL0lsQQGtHBJ+ZkP/3AokrLy82YOmecMvuFssino
jpS9MSjr3Fq8H6WvDmqyhFkKiL8wW0liQU1ZHU8csAFOmTCYUUhrN7QyjZZLt3Ek
QeUhPCi40uaL+jjfDh2TFy6dI/kvtiLxwUfQ4YcGOnNoJSMsN14E4PFiwWcQ/vfX
rOsw9z+MkqJ4je2ZuDFBxZBcUYgdb1Mlrk7gPTVwADz+DnE3PN7DKIYWy3grI5/U
uI9QkyESv4YEdpBBEphqdK3TNWWZS4QyiOq2glNgllnoksybI1JnYAWt+O2Khcef
Os9O/ccZcUiQK6K6HvEYvJvp9eGhPNVt7Fyr+JBV3bzKoPlIcHOIgktahuisUuiZ
zZsxshj3pFYBhCGlGkjbkMkB74hkgenJoT9e36JMPtov00E11B+DazqGodZm1jto
e70821Y6MQ5gavTZrrdcmzJmzSwEsdww7ALs+FCTIBpc8Re0MrZMIp+XrTFnue2L
aA23fYLu6/1uqd11PGNb+82P5s6dYpFCR9NHV29TPuXk50yH60z1Me8n3wMCzm8Y
rIvrk6Xd3XATqepM6qG6O/cDPpvxo9itZldKBvi1SD088n3qEUdJWmLRzpaxisrt
v0pCuUNx+pZE6gTE+tsxbv2k5d0RtNYPsnDJrds7EKMyhIwam7NDJcX490tu9pU8
VLndALzYj0O07N4wCQP1
=MGO1
-----END PGP SIGNATURE-----

Jeff Kirsher

unread,
Mar 5, 2013, 5:10:02 AM3/5/13
to
They are in my queue of e1000e patches for net and are being testing
currently. I should be able to push them upstream this week.
signature.asc

Jeff Kirsher

unread,
Mar 5, 2013, 5:10:02 AM3/5/13
to
On Tue, 2013-03-05 at 10:42 +0100, Jiri Slaby wrote:
> On 03/05/2013 01:16 AM, Bjorn Helgaas wrote:
> > [+cc e1000-devel, Jeff, Bruce]
> >
> > On Mon, Mar 4, 2013 at 2:50 PM, Borislav Petkov <b...@alien8.de> wrote:
> >> On Fri, Feb 15, 2013 at 10:16:41AM +0100, Borislav Petkov wrote:
> >>> So it looks Bjorn has taken most of them and the e1000e one will go
> >>> through the e1000e maintainers. I'll test after the merge window is
> >>> done.
> >>
> >> Issue still persists on 3.9-rc1 :-( :
> >>
> >> Mar 4 21:47:34 nazgul vmunix: [ 3223.412541] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
> >> Mar 4 21:47:34 nazgul vmunix: [ 3223.412554] e1000e 0000:00:19.0 eth0: 10/100 speed: disabling TSO
> >> Mar 4 21:47:35 nazgul vmunix: [ 3224.034158] Uhhuh. NMI received for unknown reason 2c on CPU 0.
> >> Mar 4 21:47:35 nazgul vmunix: [ 3224.034166] Do you have a strange power saving mode enabled?
> >> Mar 4 21:47:35 nazgul vmunix: [ 3224.034168] Dazed and confused, but trying to continue
> >
> > The e1000e changes didn't get merged, did they? I don't see the
> > following changes mentioned at https://lkml.org/lkml/2013/2/4/185 in
> > 3.9-rc1:
> >
> > e1000e: fix pci-device enable-counter balance
> > e1000e: fix runtime power management transitions
> > e1000e: fix accessing to suspended device
>
> You're right. They are not even in -next :(.
>

signature.asc

Borislav Petkov

unread,
Mar 5, 2013, 5:20:01 AM3/5/13
to
On Tue, Mar 05, 2013 at 02:02:48AM -0800, Jeff Kirsher wrote:
> They are in my queue of e1000e patches for net and are being testing
> currently. I should be able to push them upstream this week.

Right, if you'd like me to run them here too, let me know.

Jeff Kirsher

unread,
Mar 5, 2013, 5:30:02 AM3/5/13
to
On Tue, 2013-03-05 at 11:14 +0100, Borislav Petkov wrote:
>
> On Tue, Mar 05, 2013 at 02:02:48AM -0800, Jeff Kirsher wrote:
> > They are in my queue of e1000e patches for net and are being testing
> > currently. I should be able to push them upstream this week.
>
> Right, if you'd like me to run them here too, let me know.

Any additional testing is very much appreciated, so feel free to test
the patches with what hardware you have.

Thanks!
signature.asc

Borislav Petkov

unread,
Mar 5, 2013, 6:30:02 AM3/5/13
to
Yep, it looks good, machine suspends ok again. I'll watch it in the next
couple of days.

The only problem that remains is this:

[ 103.137024] xhci_hcd 0000:00:14.0: power state changed by ACPI to D3cold
[ 103.161032] ehci-pci 0000:00:1d.0: power state changed by ACPI to D3cold
[ 103.462328] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
[ 103.462342] e1000e 0000:00:19.0 eth0: 10/100 speed: disabling TSO
[ 108.472847] Uhhuh. NMI received for unknown reason 3c on CPU 0. <---
[ 108.472850] Do you have a strange power saving mode enabled?
[ 108.472851] Dazed and confused, but trying to continue

AFAIR, Rafael said it had something to do with the suspend kernel not
picking up settings done to the main kernel on time. Or something to
that effect, my memory is hazy.

Jeff Kirsher

unread,
Mar 5, 2013, 6:40:02 AM3/5/13
to
On Tue, 2013-03-05 at 12:27 +0100, Borislav Petkov wrote:
> On Tue, Mar 05, 2013 at 02:29:01AM -0800, Jeff Kirsher wrote:
> > On Tue, 2013-03-05 at 11:14 +0100, Borislav Petkov wrote:
> > >
> > > On Tue, Mar 05, 2013 at 02:02:48AM -0800, Jeff Kirsher wrote:
> > > > They are in my queue of e1000e patches for net and are being testing
> > > > currently. I should be able to push them upstream this week.
> > >
> > > Right, if you'd like me to run them here too, let me know.
> >
> > Any additional testing is very much appreciated, so feel free to test
> > the patches with what hardware you have.
>
> Yep, it looks good, machine suspends ok again. I'll watch it in the next
> couple of days.
>
> The only problem that remains is this:
>
> [ 103.137024] xhci_hcd 0000:00:14.0: power state changed by ACPI to D3cold
> [ 103.161032] ehci-pci 0000:00:1d.0: power state changed by ACPI to D3cold
> [ 103.462328] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
> [ 103.462342] e1000e 0000:00:19.0 eth0: 10/100 speed: disabling TSO
> [ 108.472847] Uhhuh. NMI received for unknown reason 3c on CPU 0. <---
> [ 108.472850] Do you have a strange power saving mode enabled?
> [ 108.472851] Dazed and confused, but trying to continue
>
> AFAIR, Rafael said it had something to do with the suspend kernel not
> picking up settings done to the main kernel on time. Or something to
> that effect, my memory is hazy.
>

Would you like me to add your Tested-by: to the patches?
signature.asc

Borislav Petkov

unread,
Mar 5, 2013, 6:50:01 AM3/5/13
to
On Tue, Mar 05, 2013 at 03:33:45AM -0800, Jeff Kirsher wrote:
> Would you like me to add your Tested-by: to the patches?

Sure, if you'd like to:

Tested-by: Borislav Petkov <b...@suse.de>

Thanks.

Rafael J. Wysocki

unread,
Mar 5, 2013, 7:10:01 PM3/5/13
to
On Tuesday, March 05, 2013 12:27:37 PM Borislav Petkov wrote:
> On Tue, Mar 05, 2013 at 02:29:01AM -0800, Jeff Kirsher wrote:
> > On Tue, 2013-03-05 at 11:14 +0100, Borislav Petkov wrote:
> > >
> > > On Tue, Mar 05, 2013 at 02:02:48AM -0800, Jeff Kirsher wrote:
> > > > They are in my queue of e1000e patches for net and are being testing
> > > > currently. I should be able to push them upstream this week.
> > >
> > > Right, if you'd like me to run them here too, let me know.
> >
> > Any additional testing is very much appreciated, so feel free to test
> > the patches with what hardware you have.
>
> Yep, it looks good, machine suspends ok again. I'll watch it in the next
> couple of days.
>
> The only problem that remains is this:
>
> [ 103.137024] xhci_hcd 0000:00:14.0: power state changed by ACPI to D3cold
> [ 103.161032] ehci-pci 0000:00:1d.0: power state changed by ACPI to D3cold
> [ 103.462328] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
> [ 103.462342] e1000e 0000:00:19.0 eth0: 10/100 speed: disabling TSO
> [ 108.472847] Uhhuh. NMI received for unknown reason 3c on CPU 0. <---
> [ 108.472850] Do you have a strange power saving mode enabled?
> [ 108.472851] Dazed and confused, but trying to continue
>
> AFAIR, Rafael said it had something to do with the suspend kernel not
> picking up settings done to the main kernel on time. Or something to
> that effect, my memory is hazy.

I suspected that during resume from hibernation the boot kernel (the one that
loaded the image) did something to hardware and the restored kernel didn't
handle that change properly. It is hard do say what piece of hardware that
was, however (it might or might not be the NIC, it may be pure coincidence
that the NMI messages appear in the log at this point).

Thanks,
Rafael


--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

Borislav Petkov

unread,
Mar 5, 2013, 7:20:01 PM3/5/13
to
On Wed, Mar 06, 2013 at 01:13:23AM +0100, Rafael J. Wysocki wrote:
> I suspected that during resume from hibernation the boot kernel (the
> one that loaded the image) did something to hardware and the restored
> kernel didn't handle that change properly. It is hard do say what
> piece of hardware that was, however (it might or might not be the NIC,
> it may be pure coincidence that the NMI messages appear in the log at
> this point).

Agreed with the second part. About the first part, who communicates what
to whom, come to think of it, it might not be related to any devices at
all.

Here's why I think so:

So one of the things I did to trigger this is boot the machine, run
powertop and set all the knobs in the "Tunables" tab to "Good". One of
the tunables is turn-off-nmi-watchdog something which turns off the
watchdog which is using the perf infrastructure which generates NMIs
when the counter overflows.

Now, imagine I do that in the "normal" kernel, then suspend,
...<something happens or does not happen>, then resume back into the
normal kernel and it somehow "forgets" the fact that we disabled the NMI
watchdog before the suspend cycle. And boom, it gets a single spurious
NMI.

Does it make sense? I dunno - I'm just connecting the dots here between
the observation points which are most likely.

Anyway, it's getting late, good night. :)

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

Borislav Petkov

unread,
Mar 8, 2013, 11:50:02 AM3/8/13
to
Exactly as I thought: so I'm running the machine with NMI watchdog
enabled, i.e. powertop says:


PowerTOP v2.0 Overview Idle stats Frequency stats Device stats Tunables

>> Bad NMI watchdog should be turned off
Good VM writeback timeout
....

and no more spurious NMIs.

I'd say the plot thickens: disabling NMIs and suspending to disk right
afterwards doesn't seem to really disable the watchdog. Or the disable
gets delayed leading to one last spurious NMI when resuming... I
probably need to go stare at the code though...

Jiri Slaby

unread,
Apr 4, 2013, 3:40:02 AM4/4/13
to
On 03/01/2013 01:55 PM, Jiri Slaby wrote:
> On 01/31/2013 02:24 PM, Borislav Petkov wrote:
>> On Thu, Jan 31, 2013 at 02:18:05PM +0100, Borislav Petkov wrote:
>>>> And if I pass nmi_watchdog=0 to the image kernel, it should be gone I
>>>> guess.
>>>
>>> How do you pass options the image kernel?
>>
>> Yep, passing "nmi_watchdog=0" to the kernel (both when you boot and
>> when you resume) fixes the issue - no more unknown NMIs. Did only 3 s/r
>> cycles though.
>
> FWIW the last time I saw the unhandled NMI was on 31st Jan. Since I
> disabled NMI watchdog no more unhandled NMIs. I have to add that I don't
> use ethernet at all.

And yesterday I plugged in an ethernet cable for a wihle and guess what
happened today:
Uhhuh. NMI received for unknown reason 2c on CPU 0.

Still holds that this is the first time since Jan.

Borislav Petkov

unread,
Apr 4, 2013, 5:40:01 AM4/4/13
to
On Thu, Apr 04, 2013 at 09:32:09AM +0200, Jiri Slaby wrote:
> And yesterday I plugged in an ethernet cable for a wihle and guess
> what happened today: Uhhuh. NMI received for unknown reason 2c on CPU
> 0.
>
> Still holds that this is the first time since Jan.

Yeah, you could try my sure-fire way to trigger it:

* boot the box without any "nmi_watchdog" tweaks on the kernel cmdline -
i.e. it should be enabled.

* turn off NMI watchdog through powertop or directly through
/proc/sys/kernel/nmi_watchdog

* suspend to disk

Now when you resume, you should either see unknown reason 2c or 3c.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

Jiri Slaby

unread,
Apr 4, 2013, 5:40:01 AM4/4/13
to
On 04/04/2013 11:33 AM, Borislav Petkov wrote:
> On Thu, Apr 04, 2013 at 09:32:09AM +0200, Jiri Slaby wrote:
>> And yesterday I plugged in an ethernet cable for a wihle and guess
>> what happened today: Uhhuh. NMI received for unknown reason 2c on CPU
>> 0.
>>
>> Still holds that this is the first time since Jan.
>
> Yeah, you could try my sure-fire way to trigger it:
>
> * boot the box without any "nmi_watchdog" tweaks on the kernel cmdline -
> i.e. it should be enabled.
>
> * turn off NMI watchdog through powertop or directly through
> /proc/sys/kernel/nmi_watchdog
>
> * suspend to disk
>
> Now when you resume, you should either see unknown reason 2c or 3c.

Oh, this reminds me that this time it might be unrelated to yesterday's
use of ethernet. Because today, I resumed the system by a kernel which I
didn't pass nmi_watchdog=0 to.

Hmm. So you can silently ignore the report I sent today :).

And sure, the way you describe above "works" for me to trigger the
issue... I just wanted to note the ethernet may interfere.

Anyway, I will bake some hack to disable NMI before jumping to the
resumed kernel and will see what happens...

--
js
suse labs

Borislav Petkov

unread,
Apr 4, 2013, 6:00:02 AM4/4/13
to
On Thu, Apr 04, 2013 at 11:38:21AM +0200, Jiri Slaby wrote:
> Oh, this reminds me that this time it might be unrelated to
> yesterday's use of ethernet. Because today, I resumed the system by a
> kernel which I didn't pass nmi_watchdog=0 to.
>
> Hmm. So you can silently ignore the report I sent today :).
>
> And sure, the way you describe above "works" for me to trigger the
> issue... I just wanted to note the ethernet may interfere.
>
> Anyway, I will bake some hack to disable NMI before jumping to the
> resumed kernel and will see what happens...

Yeah, Rafael said something about the resume kernel not getting the
disabling of the watchdog in time or so... The issue looks like a last
forgotten NMI which fires although we've disabled the watchdog already.
The certainly aren't any other coming up after this last one.

Hmm.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
0 new messages