on my aging Thinkpad T42p resume from hibernation fails in 2.6.30. There is a backtrace on suspend prior to writing out the disk image, but I cannot capture it due to lack of a serial port on the T42p. On resume the machine is dead after reading the image from disk.
> on my aging Thinkpad T42p resume from hibernation > fails in 2.6.30. There is a backtrace on suspend prior > to writing out the disk image, but I cannot capture > it due to lack of a serial port on the T42p. On > resume the machine is dead after reading the image > from disk.
On Tue, Jun 16, 2009 at 02:16:28AM +0200, Rafael J. Wysocki wrote: > On Tuesday 16 June 2009, Johannes Stezenbach wrote:
> > on my aging Thinkpad T42p resume from hibernation > > fails in 2.6.30. There is a backtrace on suspend prior > > to writing out the disk image, but I cannot capture > > it due to lack of a serial port on the T42p. On > > resume the machine is dead after reading the image > > from disk.
> > cpufreq: use smp_call_function_[single|many]() in acpi-cpufreq.c
> > I see in git log that this commit is known broken, but the > > resume on my machine is still broken in 2.6.30.
> > If I disable CONFIG_X86_ACPI_CPUFREQ suspend/resume works in 2.6.30.
> Thanks a lot for bisecting this!
> Is it the reason for the enabling of interrupts during cpufreq_suspend()?
> /me wonders
> Is there anything we can do to fix this quickly?
I think your guess was right. The patch below fixes the problem for me (hang after resume and backtrace on suspend).
Johannes -----------------------------
Fix swsusp failure on !SMP
Commit 01599fca6758d2cd133e78f87426fc851c9ea725 introduced a regression which caused a backtrace on suspend and a hang on resume on a Thinkpad T42p (Pentium M CPU).
Signed-off-by: Johannes Stezenbach <j...@sig21.net>
Johannes Stezenbach <j...@sig21.net> wrote: > Fix swsusp failure on !SMP
> Commit 01599fca6758d2cd133e78f87426fc851c9ea725 introduced > a regression which caused a backtrace on suspend and > a hang on resume on a Thinkpad T42p (Pentium M CPU).
> Signed-off-by: Johannes Stezenbach <j...@sig21.net>
ok, what's going on here? The patch implies that someone (presumably acpi-cpufreq) is calling smp_call_function_single() with local interrupts disabled. That's a bug on SMP kernels. And it'll generate a trace if it happens:
/* Can deadlock when called with interrupts disabled */ WARN_ON_ONCE(irqs_disabled() && !oops_in_progress);
but nobody has reported such a trace AFAIK?
Also, prior to 01599fca6758d2cd133e78f87426fc851c9ea725, acpi-cpufreq was using work_on_cpu(). If it was calling work_on_cpu() with local interrupts disabled then that would have been a bug too, which could generate might_sleep() or scheduling-while-atomic warnings.
Because it is a bug to call the SMP version of smp_call_function_single() with local interrupts disabled, I don't think we should need to apply the above patch.
But I don't know what we _should_ do because I don't know what the bug is. Are you able to get us a copy of that stack trace?
On Tue, Jun 16, 2009 at 11:55:40AM -0700, Andrew Morton wrote: > On Tue, 16 Jun 2009 16:22:17 +0200 > Johannes Stezenbach <j...@sig21.net> wrote:
> > Fix swsusp failure on !SMP
> > Commit 01599fca6758d2cd133e78f87426fc851c9ea725 introduced > > a regression which caused a backtrace on suspend and > > a hang on resume on a Thinkpad T42p (Pentium M CPU).
> > Signed-off-by: Johannes Stezenbach <j...@sig21.net>
> ok, what's going on here? The patch implies that someone (presumably > acpi-cpufreq) is calling smp_call_function_single() with local > interrupts disabled. That's a bug on SMP kernels. And it'll generate > a trace if it happens:
> /* Can deadlock when called with interrupts disabled */ > WARN_ON_ONCE(irqs_disabled() && !oops_in_progress);
> but nobody has reported such a trace AFAIK?
This problem apparently only exists on !SMP kernels...
> Also, prior to 01599fca6758d2cd133e78f87426fc851c9ea725, acpi-cpufreq > was using work_on_cpu(). If it was calling work_on_cpu() with local > interrupts disabled then that would have been a bug too, which could > generate might_sleep() or scheduling-while-atomic warnings.
> Because it is a bug to call the SMP version of > smp_call_function_single() with local interrupts disabled, I don't > think we should need to apply the above patch.
> But I don't know what we _should_ do because I don't know what the bug > is. Are you able to get us a copy of that stack trace?
Unfortunately my laptop doesn't have a serial port, and the stack trace is large and scrolls off the screen, I can only see the last part of it and I would need to find someone with a camera to take a picture...
On Tue, Jun 16, 2009 at 12:57:50PM -0700, Johannes Stezenbach wrote: > On Tue, Jun 16, 2009 at 11:55:40AM -0700, Andrew Morton wrote: > > On Tue, 16 Jun 2009 16:22:17 +0200 > > Johannes Stezenbach <j...@sig21.net> wrote:
> > > Fix swsusp failure on !SMP
> > > Commit 01599fca6758d2cd133e78f87426fc851c9ea725 introduced > > > a regression which caused a backtrace on suspend and > > > a hang on resume on a Thinkpad T42p (Pentium M CPU).
> > > Signed-off-by: Johannes Stezenbach <j...@sig21.net>
> > ok, what's going on here? The patch implies that someone (presumably > > acpi-cpufreq) is calling smp_call_function_single() with local > > interrupts disabled. That's a bug on SMP kernels. And it'll generate > > a trace if it happens:
> > /* Can deadlock when called with interrupts disabled */ > > WARN_ON_ONCE(irqs_disabled() && !oops_in_progress);
> > but nobody has reported such a trace AFAIK?
> This problem apparently only exists on !SMP kernels...
> > Also, prior to 01599fca6758d2cd133e78f87426fc851c9ea725, acpi-cpufreq > > was using work_on_cpu(). If it was calling work_on_cpu() with local > > interrupts disabled then that would have been a bug too, which could > > generate might_sleep() or scheduling-while-atomic warnings.
> > Because it is a bug to call the SMP version of > > smp_call_function_single() with local interrupts disabled, I don't > > think we should need to apply the above patch.
> > But I don't know what we _should_ do because I don't know what the bug > > is. Are you able to get us a copy of that stack trace?
> Unfortunately my laptop doesn't have a serial port, and the > stack trace is large and scrolls off the screen, I can only > see the last part of it and I would need to find someone with > a camera to take a picture...
Can you try the patch below (your changes + a warnon). That should give the stack trace with successful suspend-resume.
acpi-cpufreq will not directly disable interrupt and call these routines. So, it will be interesting to see how we are ending up in this state.
Thanks, Venki
diff --git a/kernel/up.c b/kernel/up.c index 1ff27a2..a4318ff 100644 --- a/kernel/up.c +++ b/kernel/up.c @@ -10,11 +10,15 @@ int smp_call_function_single(int cpu, void (*func) (void *info), void *info, int wait) { + unsigned long flags; + WARN_ON(cpu != 0);
PS: It seems like a good idea to apply this patch with the warning even if the root cause of the hibernate problem is elsewhere, for better debuggability of such issues?
Johannes Stezenbach <j...@sig21.net> wrote: > On Tue, Jun 16, 2009 at 01:25:58PM -0700, Pallipadi, Venkatesh wrote:
> > Can you try the patch below (your changes + a warnon). That should give > > the stack trace with successful suspend-resume.
> > acpi-cpufreq will not directly disable interrupt and call these routines. > > So, it will be interesting to see how we are ending up in this state.
Right, so it's the suspend-must-disable-local-interrupts thing. Again. create_image()'s local_irq_disable().
It was wrong to call work_on_cpu() with lcoal interrupts disabled, and it's now wrong to call smp_call_function_single() with local interrupts disabled. It's just that smp_call_function_single() warns while work_on_cpu() didn't.
That all explains the warning But afaik we still don't know why your machine actually failed. Perhaps it is a side-efect of emitting the warning when the console is in a weird state?
So.. what to do? Possibly we could hack cpufreq to not use smp_call_function_single() if the call is to be done on the local CPU. But SMP might still be broken - if it really does want to do a cross-cpu call.
Why does cpufreq need to do a cross-CPU get_cur_freq_on_cpu() call at suspend time _anyway_? Surely cpufreq knows the target CPU's frequency from its internal in-main-memory state? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
On Tue, 2009-06-16 at 14:09 -0700, Andrew Morton wrote: > On Tue, 16 Jun 2009 22:40:39 +0200 > Johannes Stezenbach <j...@sig21.net> wrote:
> > On Tue, Jun 16, 2009 at 01:25:58PM -0700, Pallipadi, Venkatesh wrote:
> > > Can you try the patch below (your changes + a warnon). That should give > > > the stack trace with successful suspend-resume.
> > > acpi-cpufreq will not directly disable interrupt and call these routines. > > > So, it will be interesting to see how we are ending up in this state.
> Right, so it's the suspend-must-disable-local-interrupts thing. Again. > create_image()'s local_irq_disable().
> It was wrong to call work_on_cpu() with lcoal interrupts disabled, and > it's now wrong to call smp_call_function_single() with local interrupts > disabled. It's just that smp_call_function_single() warns while > work_on_cpu() didn't.
> That all explains the warning But afaik we still don't know why your > machine actually failed. Perhaps it is a side-efect of emitting the > warning when the console is in a weird state?
> So.. what to do? Possibly we could hack cpufreq to not use > smp_call_function_single() if the call is to be done on the local CPU. > But SMP might still be broken - if it really does want to do a cross-cpu > call.
We surely do not need cross CPU cal at this point as all secondary cpus will be offline at this point.
> Why does cpufreq need to do a cross-CPU get_cur_freq_on_cpu() call at > suspend time _anyway_? Surely cpufreq knows the target CPU's frequency > from its internal in-main-memory state?
That was what I was wondering as well. Looks like this part of cpufreq_suspend came from
In order to properly fix some issues with cpufreq vs. sleep on PowerBooks, I had to add a suspend callback to the pmac_cpufreq driver. I must force a switch to full speed before sleep and I switch back to previous speed on resume.
I also added a driver flag to disable the warnings in suspend/resume since it is expected in this case to have different speed (and I want it to fixup the jiffies properly).
Signed-off-by: Benjamin Herrenschmidt <b...@kernel.crashing.org> Signed-off-by: Andrew Morton <a...@osdl.org> Signed-off-by: Linus Torvalds <torva...@osdl.org>
benh: Do you think we still need this cpufreq_driver->get() and return error on (!cur_freq || !cpu_policy->cur) stuff? May be we should all the checks only if CPUFREQ_PM_NO_WARN is set?
> On Tue, 2009-06-16 at 14:09 -0700, Andrew Morton wrote: > > On Tue, 16 Jun 2009 22:40:39 +0200
> > Johannes Stezenbach <j...@sig21.net> wrote: > > > On Tue, Jun 16, 2009 at 01:25:58PM -0700, Pallipadi, Venkatesh wrote: > > > > Can you try the patch below (your changes + a warnon). That should > > > > give the stack trace with successful suspend-resume.
> > > > acpi-cpufreq will not directly disable interrupt and call these > > > > routines. So, it will be interesting to see how we are ending up in > > > > this state.
> > Right, so it's the suspend-must-disable-local-interrupts thing. Again. > > create_image()'s local_irq_disable().
> > It was wrong to call work_on_cpu() with lcoal interrupts disabled, and > > it's now wrong to call smp_call_function_single() with local interrupts > > disabled. It's just that smp_call_function_single() warns while > > work_on_cpu() didn't.
> > That all explains the warning But afaik we still don't know why your > > machine actually failed. Perhaps it is a side-efect of emitting the > > warning when the console is in a weird state?
> > So.. what to do? Possibly we could hack cpufreq to not use > > smp_call_function_single() if the call is to be done on the local CPU. > > But SMP might still be broken - if it really does want to do a cross-cpu > > call.
> We surely do not need cross CPU cal at this point as all secondary cpus > will be offline at this point.
> > Why does cpufreq need to do a cross-CPU get_cur_freq_on_cpu() call at > > suspend time _anyway_? Surely cpufreq knows the target CPU's frequency > > from its internal in-main-memory state?
> That was what I was wondering as well. Looks like this part of > cpufreq_suspend came from
> In order to properly fix some issues with cpufreq vs. sleep on > PowerBooks, I had to add a suspend callback to the pmac_cpufreq > driver. > I must force a switch to full speed before sleep and I switch back > to > previous speed on resume.
> I also added a driver flag to disable the warnings in suspend/resume > since it is expected in this case to have different speed (and I > want it > to fixup the jiffies properly).
> Signed-off-by: Benjamin Herrenschmidt <b...@kernel.crashing.org> > Signed-off-by: Andrew Morton <a...@osdl.org> > Signed-off-by: Linus Torvalds <torva...@osdl.org>
> benh: Do you think we still need this cpufreq_driver->get() and return > error on (!cur_freq || !cpu_policy->cur) stuff? > May be we should all the checks only if CPUFREQ_PM_NO_WARN is set?
In fact, we need to do this entire thing differently.
The basic problem is that cpufreq_suspend() is a sysdev thing, so it will always be called with iterrupts off and *only* for CPU0. So, it looks like the majority of things we do there is just unnecessary (at least).
On Tue, Jun 16, 2009 at 02:09:23PM -0700, Andrew Morton wrote:
> Right, so it's the suspend-must-disable-local-interrupts thing. Again. > create_image()'s local_irq_disable().
> It was wrong to call work_on_cpu() with lcoal interrupts disabled, and > it's now wrong to call smp_call_function_single() with local interrupts > disabled. It's just that smp_call_function_single() warns while > work_on_cpu() didn't.
> That all explains the warning But afaik we still don't know why your > machine actually failed. Perhaps it is a side-efect of emitting the > warning when the console is in a weird state?
smp_call_function_single() enables irqs and hibernate doesn't like that?
BTW, I have no other UP machine to test with, but I reported in another thread that a !SMP kernel (or a SMP kernel with maxcpus=0 parameter) does not boot at all on my destop machine, see http://lkml.org/lkml/2009/6/12/468
No idea if I should be worried about this since the SMP kernel now works fine, another hibernate problem was solved in http://lkml.org/lkml/2009/6/14/156