[RFC PATCH 0/4] clockevents: fix clockevent_devices list corruption after cpu hotplug

Xiaotian Feng

unread,

Dec 10, 2009, 8:10:01 AM12/10/09

to

I've met a list_del corruption, which was reported in
http://lkml.org/lkml/2009/11/27/45. But no response, so I try to debug it
by myself.

After I added some printks to show all elements in clockevent_devices, I
found kernel hangs when I tried to resume from s2ram.

In clockevents_register_device, clockevents_do_notify ADD is always followed
by clockevents_notify_released. Although clockevents_do_notify ADD will use
tick_check_new_device to add new devices and replace old devices to the
clockevents_released list, clockevents_notify_released add them back to
clockevent_devices list.

My system is Quad-Core x86_64, with apic and hpet enables, after boot up,
the elements in clockevent_devices list is :
clockevent_device->lapic(3)->hpet5(3)->lapic(2)->hpet4(2)->lapic(1)->hpet3(1)-
->lapic(0)->hpet2(0)->hpet(0)
* () means cpu id

But active clock_event_device is hpet2,hpet3,hpet4,hpet5. Then at s2ram stage,
cpu 1,2,3 is down, then notify CLOCK_EVT_NOTIFY_CPU_DEAD will calls tick_shutdown,
then hpet2,hpet3,hpet4,hpet5 was deleted from clockevent_device list.
So after s2ram, elements in clockevent_device list is:
clockevent_device->lapic(3)->lapic(2)->lapic(1)->lapic(0)->hpet2(0)->hpet(0)

Then at resume stage, cpu 1,2,3 is up, it will register lapic again, and then
perform list_add lapic on clockevent_device list, e.g. list_add lapic(1) on
above list, lapic will move to the clockevent_device->next, but lapic(2)->next
is still point to lapic(1), the list is circular and corrupted then.

This patchset aims to fixes above behaviour by:
- on clockevents_register_device, if notify ADD success, move new devices
to the clockevent_devices list, otherwise move to clockevents_released
list.
- on clockevents_notify_released, same behaviour as above.
- on clockevents_notify CPU_DEAD, remove related devices on dead cpu from
clockevents_released list.

It makes sure that only active devices on each cpu is on clockevent_devices list.
With this patchset, the list_del corruption disappeared, and suspend/resume, cpu
hotplug works fine on my system.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Thomas Gleixner

unread,

Dec 10, 2009, 9:40:02 AM12/10/09

to

Great detective work !

> This patchset aims to fixes above behaviour by:
> - on clockevents_register_device, if notify ADD success, move new devices
> to the clockevent_devices list, otherwise move to clockevents_released
> list.
> - on clockevents_notify_released, same behaviour as above.
> - on clockevents_notify CPU_DEAD, remove related devices on dead cpu from
> clockevents_released list.
>
> It makes sure that only active devices on each cpu is on clockevent_devices list.
> With this patchset, the list_del corruption disappeared, and suspend/resume, cpu
> hotplug works fine on my system.

I'm not happy about that churn. Why don't we simply scan the
clockevent_devices list for leftovers of the dead CPU ?

Untested patch below solves the same problem.

Thanks,

tglx
----
diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 20a8920..5dd857f 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -238,8 +238,9 @@ void clockevents_exchange_device(struct clock_event_device *old,
*/
void clockevents_notify(unsigned long reason, void *arg)
{
- struct list_head *node, *tmp;
+ struct clock_event_device *dev, *tmp;
unsigned long flags;
+ int cpu;

spin_lock_irqsave(&clockevents_lock, flags);
clockevents_do_notify(reason, arg);
@@ -250,8 +251,19 @@ void clockevents_notify(unsigned long reason, void *arg)
* Unregister the clock event devices which were
* released from the users in the notify chain.
*/
- list_for_each_safe(node, tmp, &clockevents_released)
- list_del(node);
+ list_for_each_entry_safe(dev, tmp, &clockevents_released, list)
+ list_del(&dev->list);
+ /*
+ * Now check whether the CPU has left unused per cpu devices
+ */
+ cpu = *((int *)arg);
+ list_for_each_entry_safe(dev, tmp, &clockevent_devices, list) {
+ if (cpumask_test_cpu(cpu, dev->cpumask) &&
+ cpumask_weight(dev->cpumask) == 1) {
+ BUG_ON(dev->mode != CLOCK_EVT_MODE_UNUSED);
+ list_del(&dev->list);
+ }
+ }
break;
default:
break;

Xiaotian Feng

unread,

Dec 10, 2009, 9:40:01 PM12/10/09

to

My only thought is to make clockevent_devices list only store active
devices on each cpu, all other inactive devices stored on
clockevents_released, but this make things complex. Your patch is better.

>
> Untested patch below solves the same problem.

Yes, this also resolves my list_del warning. Thanks

>
> Thanks,
>
> tglx
> ----
> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
> index 20a8920..5dd857f 100644
> --- a/kernel/time/clockevents.c
> +++ b/kernel/time/clockevents.c
> @@ -238,8 +238,9 @@ void clockevents_exchange_device(struct clock_event_device *old,
> */
> void clockevents_notify(unsigned long reason, void *arg)
> {
> - struct list_head *node, *tmp;
> + struct clock_event_device *dev, *tmp;
> unsigned long flags;
> + int cpu;
>
> spin_lock_irqsave(&clockevents_lock, flags);
> clockevents_do_notify(reason, arg);
> @@ -250,8 +251,19 @@ void clockevents_notify(unsigned long reason, void *arg)
> * Unregister the clock event devices which were
> * released from the users in the notify chain.
> */
> - list_for_each_safe(node, tmp,&clockevents_released)
> - list_del(node);

> + list_for_each_entry_safe(dev, tmp,&clockevents_released, list)

> + list_del(&dev->list);
> + /*
> + * Now check whether the CPU has left unused per cpu devices
> + */
> + cpu = *((int *)arg);

> + list_for_each_entry_safe(dev, tmp,&clockevent_devices, list) {

Ozan Çağlayan

unread,

Jan 17, 2010, 4:30:02 AM1/17/10

to

Thomas Gleixner wrote:

>
> I'm not happy about that churn. Why don't we simply scan the
> clockevent_devices list for leftovers of the dead CPU ?
>
> Untested patch below solves the same problem.

Hi, I have 3 bug reports about a clockevent failure while halting the system (2.6.31.11 kernel).
It exactly pinpoints to the line 262 which got changed with this patch merged to 2.6.31.10 (and also 2.6.32.3):

[4406.986777] kernel BUG at kernel/time/clockevents.c:262!
[4406.986777] invalid opcode: 0000 [#1] SMP
[4406.986777] last sysfs file: /sysfs/module/ip_tables/initstate

All 3 systems have Pentium 4 processors with different clock speeds.

Here's a screenshot of the full stacktrace:
http://bugs.pardus.org.tr/attachment.cgi?id=4820

and the relevant bugzilla report:
http://bugzilla.kernel.org/show_bug.cgi?id=15073

Thanks,
Ozan Caglayan
Pardus Linux -- http://www.pardus.org.tr/eng

Xiaotian Feng

unread,

Jan 17, 2010, 9:40:01 PM1/17/10

to

On 01/17/2010 05:28 PM, Ozan Çağlayan wrote:
> Thomas Gleixner wrote:
>
>>
>> I'm not happy about that churn. Why don't we simply scan the
>> clockevent_devices list for leftovers of the dead CPU ?
>>
>> Untested patch below solves the same problem.
>
>
> Hi, I have 3 bug reports about a clockevent failure while halting the system (2.6.31.11 kernel).
> It exactly pinpoints to the line 262 which got changed with this patch merged to 2.6.31.10 (and also 2.6.32.3):
>
> [4406.986777] kernel BUG at kernel/time/clockevents.c:262!
> [4406.986777] invalid opcode: 0000 [#1] SMP
> [4406.986777] last sysfs file: /sysfs/module/ip_tables/initstate

I think this one is duplicated of
http://bugzilla.kernel.org/show_bug.cgi?id=15037
Thomas is working on the fix. I've sent a workaround patch on
http://patchwork.kernel.org/patch/71537/

Thomas Gleixner

unread,

Jan 18, 2010, 9:00:03 AM1/18/10

to

On Mon, 18 Jan 2010, Xiaotian Feng wrote:

> On 01/17/2010 05:28 PM, Ozan Çağlayan wrote:
> > Thomas Gleixner wrote:
> >
> > >
> > > I'm not happy about that churn. Why don't we simply scan the
> > > clockevent_devices list for leftovers of the dead CPU ?
> > >
> > > Untested patch below solves the same problem.
> >
> >
> > Hi, I have 3 bug reports about a clockevent failure while halting the system
> > (2.6.31.11 kernel).
> > It exactly pinpoints to the line 262 which got changed with this patch
> > merged to 2.6.31.10 (and also 2.6.32.3):
> >
> > [4406.986777] kernel BUG at kernel/time/clockevents.c:262!
> > [4406.986777] invalid opcode: 0000 [#1] SMP
> > [4406.986777] last sysfs file: /sysfs/module/ip_tables/initstate
>
> I think this one is duplicated of
> http://bugzilla.kernel.org/show_bug.cgi?id=15037
> Thomas is working on the fix. I've sent a workaround patch on
> http://patchwork.kernel.org/patch/71537/

I just applied your patch, but kept the cpuweight check. This is the
least intrusive solution for now. The logic needs an overhaul, but
thats neither rc4 nor stable material.

Thanks,

tglx