[PATCH] avoid race condition in pick_next_task_fair in kernel/sched

shenghui

unread,

Jun 29, 2010, 3:10:38 AM6/29/10

to kernel-...@vger.kernel.org, linux-...@vger.kernel.org, mi...@elte.hu, pet...@infradead.org, Greg KH

Hi,

I walked some code in kernel/sched_fair.c in version 2.6.35-rc3, and
got the following potential failure:

static struct task_struct *pick_next_task_fair(struct rq *rq)
{
...
if (!cfs_rq->nr_running)
return NULL;

do {
se = pick_next_entity(cfs_rq);
set_next_entity(cfs_rq, se);
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);
...
}

/*
* The dequeue_task method is called after nr_running is
* decreased. We remove the task from the rbtree and
* update the fair scheduling stats:
*/
static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
...
dequeue_entity(cfs_rq, se, flags);
...
}

static void
dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
...
if (se != cfs_rq->curr)
__dequeue_entity(cfs_rq, se);
account_entity_dequeue(cfs_rq, se);
update_min_vruntime(cfs_rq);
...
}

dequeue_task_fair() is used to dequeue some task, and it calls
dequeue_entity() to finish the job. While dequeue_entity() calls
__dequeue_entity() first, then calls account_entity_dequeue()
to do substraction on cfs_rq->nr_running.
Here, if one task is dequeued, then cfs_rq->nr_running will be
changed later, e.g 1 to 0.
In the pick_next_task_fair(), the if check will get passed before
nr_running is set 0 and the following while structure is executed.

do {
se = pick_next_entity(cfs_rq);
set_next_entity(cfs_rq, se);
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);

We may get se set NULL here, and set_next_entity may deference
NULL pointer, which can lead to Oops.

I think some lock on the metadata can fix this issue, but we may
change plenty of code to add support for lock. I think the easist
way is just substacting nr_running before dequing tasks.

Following is my patch, please check it.

From 4fe38deac173c7777cd02096950e979749170873 Mon Sep 17 00:00:00 2001
From: Wang Sheng-Hui <crosslo...@gmail.com>
Date: Tue, 29 Jun 2010 14:49:05 +0800
Subject: [PATCH] avoid race condition in pick_next_task_fair in
kernel/sched_fair.c

Signed-off-by: Wang Sheng-Hui <crosslo...@gmail.com>
---
kernel/sched_fair.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index eed35ed..93073ff 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -823,9 +823,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct
sched_entity *se, int flags)

clear_buddies(cfs_rq, se);

+ account_entity_dequeue(cfs_rq, se);
if (se != cfs_rq->curr)
__dequeue_entity(cfs_rq, se);
- account_entity_dequeue(cfs_rq, se);
update_min_vruntime(cfs_rq);

/*
@@ -1059,7 +1059,7 @@ enqueue_task_fair(struct rq *rq, struct
task_struct *p, int flags)
}

/*
- * The dequeue_task method is called before nr_running is
+ * The dequeue_task method is called after nr_running is
* decreased. We remove the task from the rbtree and
* update the fair scheduling stats:
*/
--
1.6.3.3

--

Thanks and Best Regards,
shenghui
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Peter Zijlstra

unread,

Jun 29, 2010, 6:43:35 AM6/29/10

to shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, mi...@elte.hu, Greg KH

On Tue, 2010-06-29 at 15:10 +0800, shenghui wrote:
> I think some lock on the metadata can fix this issue, but we may
> change plenty of code to add support for lock. I think the easist
> way is just substacting nr_running before dequing tasks.

But all that is fully serialized by the rq->lock.. so I'm really not
seeing how this can happen.

shenghui

unread,

Jun 29, 2010, 7:24:17 AM6/29/10

to Peter Zijlstra, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, mi...@elte.hu, Greg KH

2010/6/29 Peter Zijlstra <pet...@infradead.org>:

> On Tue, 2010-06-29 at 15:10 +0800, shenghui wrote:
>> I think some lock on the metadata can fix this issue, but we may
>> change plenty of code to add support for lock. I think the easist
>> way is just substacting nr_running before dequing tasks.
>
> But all that is fully serialized by the rq->lock.. so I'm really not
> seeing how this can happen.
>
>
>

I wonder is there any chance set_next_entity() can get NULL for
parameter se if so?
And will you please give me some instructions on where rq->lock
is required?

--

Thanks and Best Regards,
shenghui

Peter Zijlstra

unread,

Jun 29, 2010, 7:35:16 AM6/29/10

to shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, mi...@elte.hu, Greg KH

On Tue, 2010-06-29 at 19:24 +0800, shenghui wrote:

> I wonder is there any chance set_next_entity() can get NULL for
> parameter se if so?

Well, if your machine crashes that way, maybe, but I haven't seen that
happen in a long while.

> And will you please give me some instructions on where rq->lock
> is required?

Pretty much everywhere, if you look at sched.c the only sched_class
method not called with rq->lock held is ::task_fork().

The interesting bits are that schedule()->pre_schedule()/idle_balance()
can drop rq->lock as well as ->select_task_rq().

shenghui

unread,

Jun 29, 2010, 8:44:37 AM6/29/10

to Peter Zijlstra, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, mi...@elte.hu, Greg KH

2010/6/29 Peter Zijlstra <pet...@infradead.org>:

> Pretty much everywhere, if you look at sched.c the only sched_class
> method not called with rq->lock held is ::task_fork().
>
> The interesting bits are that schedule()->pre_schedule()/idle_balance()
> can drop rq->lock as well as ->select_task_rq().
>
>
>
>

Thanks for your instructions!

--

Thanks and Best Regards,
shenghui

Miklos Vajna

unread,

Dec 18, 2010, 9:03:13 PM12/18/10

to Peter Zijlstra, shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, mi...@elte.hu, Greg KH

On Tue, Jun 29, 2010 at 12:43:35PM +0200, Peter Zijlstra <pet...@infradead.org> wrote:
> On Tue, 2010-06-29 at 15:10 +0800, shenghui wrote:
> > I think some lock on the metadata can fix this issue, but we may
> > change plenty of code to add support for lock. I think the easist
> > way is just substacting nr_running before dequing tasks.
>
> But all that is fully serialized by the rq->lock.. so I'm really not
> seeing how this can happen.

Hi,

Here is a panic I got today:

http://frugalware.org/~vmiklos/pics/bug/2.6.37-rc6.png

More details:

I get this sometimes on boot or shutdown when testing systemd. I did not
get it with sysvinit, so I guess it may be related to systemd's heavy
cgroups usage, but I'm not sure. Sadly it isn't 100% reproducible but I
usually hit it at least once a day.

The config is here:
http://frugalware.org/~vmiklos/logs/2.6.37-rc6.config (I just did a
yes "" | make config to update it to 2.6.37-rc6.)

I got something similar with 2.6.36.1 as well:

http://frugalware.org/~vmiklos/pics/bug/2.6.36.1.png

Ah, and this is on i686 in VMware - though given that I never had this
problem with systemd, I guess it won't be an emulator bug. :)

I'm not familiar with the sched code, is it possible that this is
related?

Thanks,

Miklos

Miklos Vajna

unread,

Dec 21, 2010, 7:22:48 PM12/21/10

to Peter Zijlstra, shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, mi...@elte.hu, Greg KH, system...@lists.freedesktop.org

On Sun, Dec 19, 2010 at 03:03:13AM +0100, Miklos Vajna <vmi...@frugalware.org> wrote:
> Here is a panic I got today:
>
> http://frugalware.org/~vmiklos/pics/bug/2.6.37-rc6.png
>
> More details:
>
> I get this sometimes on boot or shutdown when testing systemd. I did not
> get it with sysvinit, so I guess it may be related to systemd's heavy
> cgroups usage, but I'm not sure. Sadly it isn't 100% reproducible but I
> usually hit it at least once a day.
>
> The config is here:
> http://frugalware.org/~vmiklos/logs/2.6.37-rc6.config (I just did a
> yes "" | make config to update it to 2.6.37-rc6.)
>
> I got something similar with 2.6.36.1 as well:
>
> http://frugalware.org/~vmiklos/pics/bug/2.6.36.1.png
>
> Ah, and this is on i686 in VMware - though given that I never had this
> problem with systemd, I guess it won't be an emulator bug. :)
>
> I'm not familiar with the sched code, is it possible that this is
> related?

A few more info:

- I reproduced the issue on my Lenovo S12 netbook as well, so it's not a
virtual machine-related bug. Here is a poor shot:
http://frugalware.org/~vmiklos/pics/bug/p1040730.jpg
- I'm also attaching the full dmesg in the VM that is finally a text
using serial console.
- On "how to reproduce": sometimes I just get this during boot. If it
does not occur, then it does when using "systemctl emergency".

Please let me know if you need more info.

Thanks,

Miklos

dmesg.txt

Peter Zijlstra

unread,

Dec 22, 2010, 3:29:49 AM12/22/10

to Miklos Vajna, shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, mi...@elte.hu, Greg KH, system...@lists.freedesktop.org

On Wed, 2010-12-22 at 01:22 +0100, Miklos Vajna wrote:
> Please let me know if you need more info.

What distro are you using? it looks like systemd is involved and I'm
actively avoiding anything using that crap. It now triggering a bug
means I need to get myself a 32bit image using it :/

If you happen to have a qemu image that reproduces.. :-)

Peter Zijlstra

unread,

Dec 22, 2010, 3:41:00 AM12/22/10

to Miklos Vajna, shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, mi...@elte.hu, Greg KH

On Wed, 2010-12-22 at 09:29 +0100, Peter Zijlstra wrote:
> On Wed, 2010-12-22 at 01:22 +0100, Miklos Vajna wrote:
> > Please let me know if you need more info.
>
> What distro are you using? it looks like systemd is involved and I'm
> actively avoiding anything using that crap. It now triggering a bug
> means I need to get myself a 32bit image using it :/
>
> If you happen to have a qemu image that reproduces.. :-)

Removed the systemd list from CC, please don't cross post with lists
that send bounces for non-members, that's considered rude.

Mike Galbraith

unread,

Dec 22, 2010, 3:41:51 AM12/22/10

to Peter Zijlstra, Miklos Vajna, shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, mi...@elte.hu, Greg KH, system...@lists.freedesktop.org

On Wed, 2010-12-22 at 09:29 +0100, Peter Zijlstra wrote:

> On Wed, 2010-12-22 at 01:22 +0100, Miklos Vajna wrote:
> > Please let me know if you need more info.
>
> What distro are you using? it looks like systemd is involved and I'm
> actively avoiding anything using that crap.

Heh, gearing up to hunt a bug it triggers, just doing all I had to do to
get the thing built and installed blew my box _all up_. I still have
smoldering bits lying about a week later.

-Mike

Peter Zijlstra

unread,

Dec 22, 2010, 4:07:22 AM12/22/10

to Mike Galbraith, Miklos Vajna, shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, mi...@elte.hu, Greg KH

On Wed, 2010-12-22 at 09:41 +0100, Mike Galbraith wrote:
> On Wed, 2010-12-22 at 09:29 +0100, Peter Zijlstra wrote:
> > On Wed, 2010-12-22 at 01:22 +0100, Miklos Vajna wrote:
> > > Please let me know if you need more info.
> >
> > What distro are you using? it looks like systemd is involved and I'm
> > actively avoiding anything using that crap.
>
> Heh, gearing up to hunt a bug it triggers, just doing all I had to do to
> get the thing built and installed blew my box _all up_. I still have
> smoldering bits lying about a week later.

There's a reason I was asking for a qemu image ;-), I'm not letting that
stuff anywhere near my regular machines.

Miklos Vajna

unread,

Dec 22, 2010, 8:31:55 AM12/22/10

to Peter Zijlstra, Mike Galbraith, shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, mi...@elte.hu, Greg KH

On Wed, Dec 22, 2010 at 10:07:22AM +0100, Peter Zijlstra <pet...@infradead.org> wrote:
> On Wed, 2010-12-22 at 09:41 +0100, Mike Galbraith wrote:
> > On Wed, 2010-12-22 at 09:29 +0100, Peter Zijlstra wrote:
> > > On Wed, 2010-12-22 at 01:22 +0100, Miklos Vajna wrote:
> > > > Please let me know if you need more info.
> > >
> > > What distro are you using? it looks like systemd is involved and I'm
> > > actively avoiding anything using that crap.

First, sorry about the systemd list, I did not know it's
subscriber-only.

This is Frugalware Linux, and it's disabled by default here as well.

> > Heh, gearing up to hunt a bug it triggers, just doing all I had to do to
> > get the thing built and installed blew my box _all up_. I still have
> > smoldering bits lying about a week later.
>
> There's a reason I was asking for a qemu image ;-), I'm not letting that
> stuff anywhere near my regular machines.

Sure - here is a qcow image (871M):

http://frugalware.org/~vmiklos/files/systemd.img

Here is how I started it:

qemu-kvm -vnc :0 -m 1G -hda systemd.img

$ qemu-kvm -version
QEMU PC emulator version 0.12.3 (qemu-kvm-0.12.3), Copyright (c) 2003-2008 Fabrice Bellard

There are two menu entries in grub config - first is sysvinit, that
boots up correctly here, you can login using root / root. The second is
systemd, that just panics here almost all the time. In case it would
not, use 'systemctl emergency' as root and that will trigger the bug. :)

If you need more packages in the virtual machine and the pacman package
manager sounds exotic, here is a quick help:

http://frugalware.org/docs/pacman-g2#_apt_pacman_g2_cross_reference

Thanks,

Miklos

Peter Zijlstra

unread,

Dec 22, 2010, 9:00:22 AM12/22/10

to Miklos Vajna, Mike Galbraith, shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, mi...@elte.hu, Greg KH

On Wed, 2010-12-22 at 14:31 +0100, Miklos Vajna wrote:
>
> http://frugalware.org/~vmiklos/files/systemd.img
>
> Here is how I started it:
>
> qemu-kvm -vnc :0 -m 1G -hda systemd.img

awesome, thanks!

I started it with something like:
qemu -kernel foo-build/arch/x86/boot/bzImage -append "root=/dev/sda1
debug sched_debug ignore_loglevel sysrq_always_enabled console=ttyS0"
-hda systemd.img -serial stdio -m 1G

Where foo-build/ contains a kernel build using your .config.

I'll have a poke at it..

Peter Zijlstra

unread,

Dec 22, 2010, 9:11:52 AM12/22/10

to Miklos Vajna, Mike Galbraith, shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, mi...@elte.hu, Greg KH

On Wed, 2010-12-22 at 15:00 +0100, Peter Zijlstra wrote:
> On Wed, 2010-12-22 at 14:31 +0100, Miklos Vajna wrote:
> >
> > http://frugalware.org/~vmiklos/files/systemd.img
> >
> > Here is how I started it:
> >
> > qemu-kvm -vnc :0 -m 1G -hda systemd.img
>
> awesome, thanks!
>
> I started it with something like:
> qemu -kernel foo-build/arch/x86/boot/bzImage -append "root=/dev/sda1
> debug sched_debug ignore_loglevel sysrq_always_enabled console=ttyS0"
> -hda systemd.img -serial stdio -m 1G
>
> Where foo-build/ contains a kernel build using your .config.
>
> I'll have a poke at it..

Hrm,. its not really wanting to start properly..

---
EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
VFS: Mounted root (ext4 filesystem) readonly on device 8:1.
Freeing unused kernel memory: 520k freed
EXT4-fs (sda1): re-mounted. Opts: (null)
md: stopping all md devices.
Restarting system.
machine restart
---

Does it _require_ initrd like muck? Or are there some modules that need
to get built-in in order for the thing to boot?

This is an utter lack of error reporting here, no idea what's wrong.

Miklos Vajna

unread,

Dec 22, 2010, 10:14:34 AM12/22/10

to Peter Zijlstra, Mike Galbraith, shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, mi...@elte.hu, Greg KH

On Wed, Dec 22, 2010 at 03:11:52PM +0100, Peter Zijlstra <pet...@infradead.org> wrote:
> > I started it with something like:
> > qemu -kernel foo-build/arch/x86/boot/bzImage -append "root=/dev/sda1
> > debug sched_debug ignore_loglevel sysrq_always_enabled console=ttyS0"
> > -hda systemd.img -serial stdio -m 1G
> >
> > Where foo-build/ contains a kernel build using your .config.
> >
> > I'll have a poke at it..
>
> Hrm,. its not really wanting to start properly..
>
> ---
> EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
> VFS: Mounted root (ext4 filesystem) readonly on device 8:1.
> Freeing unused kernel memory: 520k freed
> EXT4-fs (sda1): re-mounted. Opts: (null)
> md: stopping all md devices.
> Restarting system.
> machine restart
> ---
>
> Does it _require_ initrd like muck? Or are there some modules that need
> to get built-in in order for the thing to boot?

Nope.

> This is an utter lack of error reporting here, no idea what's wrong.

I tried to do something really similar to your commandline:

qemu -enable-kvm -kernel kernel-build/arch/x86/boot/bzImage -append "root=/dev/sda1 debug sched_debug ignore_loglevel sysrq_always_enabled console=ttyS0" -hda systemd.img -serial stdio -m 1G -vnc :0

This boots up properly here, I can login using root/root from vnc.

qemu -enable-kvm -kernel kernel-build/arch/x86/boot/bzImage -append "root=/dev/sda1 debug sched_debug ignore_loglevel sysrq_always_enabled console=ttyS0 init=/bin/systemd" -hda systemd.img -serial stdio -m 1G -vnc :0

^ Only init=/bin/systemd added, and this results in a panic in most
cases. I'm attaching the stdout of qemu, showing the fail in
put_prev_task_fair.

$ qemu -version
QEMU emulator version 0.13.0, Copyright (c) 2003-2008 Fabrice Bellard

kernel-build is a git build using the config I already sent and after a
'git checkout v2.6.36'. I can try to build master as well, so far what I
saw is that the bug occurs there as well, but less frequently, so maybe
that's a bit harder to debug.

I also want to note that - at least on my machine - if I drop
-enable-kvm the bug is hard to reproduce, maybe that's because that way
it does not trigger a race condition or my machine is just too slow
without kvm and it triggers some timeout, changing the behaviour; I'm
not exactly sure. (But again, I can reproduce the bug on real hardware,
so I don't think we have a kvm bug here.)

Thanks,

Miklos

qemu.log

Peter Zijlstra

unread,

Dec 22, 2010, 10:25:30 AM12/22/10

to Miklos Vajna, Mike Galbraith, shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, mi...@elte.hu, Greg KH

On Wed, 2010-12-22 at 16:14 +0100, Miklos Vajna wrote:
>
> qemu -enable-kvm -kernel kernel-build/arch/x86/boot/bzImage -append
> "root=/dev/sda1 debug sched_debug ignore_loglevel sysrq_always_enabled
> console=ttyS0 init=/bin/systemd" -hda systemd.img -serial stdio -m 1G
> -vnc :0
>
> ^ Only init=/bin/systemd added, and this results in a panic in most
> cases. I'm attaching the stdout of qemu, showing the fail in
> put_prev_task_fair.

OK, that does seem to work, maybe it really wants the vnc thing, dunno.

Anyway, it looks like I need to go build a kernel with kvm in for that
machine ;-) Thanks for the effort so far.

Peter Zijlstra

unread,

Dec 22, 2010, 12:08:38 PM12/22/10

to Miklos Vajna, Mike Galbraith, shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, mi...@elte.hu, Greg KH

On Wed, 2010-12-22 at 16:14 +0100, Miklos Vajna wrote:

> kernel-build is a git build using the config I already sent and after a
> 'git checkout v2.6.36'. I can try to build master as well, so far what I
> saw is that the bug occurs there as well, but less frequently, so maybe
> that's a bit harder to debug.

Right, so I've got v2.6.36 exploding, current -git and -tip won't
explode for me, tried both about 20 times. So I'll try and figure out
wth makes .36 explode and then see if further kernel still suffer that
problem.

Ingo Molnar

unread,

Dec 22, 2010, 12:16:55 PM12/22/10

to Peter Zijlstra, Miklos Vajna, Mike Galbraith, shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, Greg KH

* Peter Zijlstra <pet...@infradead.org> wrote:

> On Wed, 2010-12-22 at 16:14 +0100, Miklos Vajna wrote:
> > kernel-build is a git build using the config I already sent and after a
> > 'git checkout v2.6.36'. I can try to build master as well, so far what I
> > saw is that the bug occurs there as well, but less frequently, so maybe
> > that's a bit harder to debug.
>
> Right, so I've got v2.6.36 exploding, current -git and -tip won't
> explode for me, tried both about 20 times. So I'll try and figure out
> wth makes .36 explode and then see if further kernel still suffer that
> problem.

Does it explode if you change some of the CONFIG_CC flags, such as
CONFIG_CC_OPTIMIZE_FOR_SIZE? I.e. is the crash dependent on the precise kernel image
layout perhaps? If it's sensitive to that then this might keep you from chasing
shadows ...

Thanks,

Ingo

Peter Zijlstra

unread,

Dec 22, 2010, 12:25:20 PM12/22/10

to Ingo Molnar, Miklos Vajna, Mike Galbraith, shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, Greg KH

On Wed, 2010-12-22 at 18:16 +0100, Ingo Molnar wrote:
> * Peter Zijlstra <pet...@infradead.org> wrote:
>
> > On Wed, 2010-12-22 at 16:14 +0100, Miklos Vajna wrote:
> > > kernel-build is a git build using the config I already sent and after a
> > > 'git checkout v2.6.36'. I can try to build master as well, so far what I
> > > saw is that the bug occurs there as well, but less frequently, so maybe
> > > that's a bit harder to debug.
> >
> > Right, so I've got v2.6.36 exploding, current -git and -tip won't
> > explode for me, tried both about 20 times. So I'll try and figure out
> > wth makes .36 explode and then see if further kernel still suffer that
> > problem.
>
> Does it explode if you change some of the CONFIG_CC flags, such as
> CONFIG_CC_OPTIMIZE_FOR_SIZE? I.e. is the crash dependent on the precise kernel image
> layout perhaps? If it's sensitive to that then this might keep you from chasing
> shadows ...

I've since changed the .config slightly (enabled frame pointers, added
debug info, enabled the function tracer) and its still exploding. Also,
I suspect Miklos and me have different gcc versions.

So I suspect its really something weird, I just can't really make sense
of it yet.

Peter Zijlstra

unread,

Dec 22, 2010, 3:36:13 PM12/22/10

to Miklos Vajna, Mike Galbraith, shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, mi...@elte.hu, Greg KH, Paul Turner

On Wed, 2010-12-22 at 18:08 +0100, Peter Zijlstra wrote:
> On Wed, 2010-12-22 at 16:14 +0100, Miklos Vajna wrote:
> > kernel-build is a git build using the config I already sent and after a
> > 'git checkout v2.6.36'. I can try to build master as well, so far what I
> > saw is that the bug occurs there as well, but less frequently, so maybe
> > that's a bit harder to debug.
>
> Right, so I've got v2.6.36 exploding, current -git and -tip won't
> explode for me, tried both about 20 times. So I'll try and figure out
> wth makes .36 explode and then see if further kernel still suffer that
> problem.

OK, so something goes wrong with load accounting,

Full log at: http://programming.kicks-ass.net/sekrit/log4

I print the entire runqueue before and after each op, the last good one
is a post-print:

kworker/-9 0d..2. 2013644us : pick_next_task_fair: picked: f6a8de5c, kworker/u:1/1251
kworker/-9 0d..2. 2013644us : print_runqueue <-pick_next_task_fair
kworker/-9 0d..2. 2013645us : __print_runqueue: cfs_rq: c2407c34, nr: 3, load: 3072
kworker/-9 0d..2. 2013645us : __print_runqueue: curr: f6a8de5c, comm: kworker/u:1/1251, load: 1024
kworker/-9 0d..2. 2013646us : __print_runqueue: se: f69e6300, load: 1024,
kworker/-9 0d..2. 2013646us : __print_runqueue: cfs_rq: f69e6540, nr: 2, load: 2048
kworker/-9 0d..2. 2013646us : __print_runqueue: curr: (null)
kworker/-9 0d..2. 2013647us : __print_runqueue: se: f69e65a0, load: 1024,
kworker/-9 0d..2. 2013647us : __print_runqueue: cfs_rq: f69e6240, nr: 1, load: 1024
kworker/-9 0d..2. 2013648us : __print_runqueue: curr: (null)
kworker/-9 0d..2. 2013648us : __print_runqueue: se: f725916c, comm: loadkeys/1225, load: 1024
kworker/-9 0d..2. 2013649us : __print_runqueue: se: f69e6660, load: 1024,
kworker/-9 0d..2. 2013649us : __print_runqueue: cfs_rq: f69e6600, nr: 1, load: 1024
kworker/-9 0d..2. 2013650us : __print_runqueue: curr: (null)
kworker/-9 0d..2. 2013650us : __print_runqueue: se: f6a8da0c, comm: systemd-tmpfile/1246, load: 1024
kworker/-9 0d..2. 2013651us : __print_runqueue: se: f6a8cd1c, comm: kworker/u:1/1252, load: 1024

And then the next pre-print looks like:

systemd--1251 0d..5. 2015398us : enqueue_task_fair <-enqueue_task
systemd--1251 0d..5. 2015398us : print_runqueue <-enqueue_task_fair
systemd--1251 0d..5. 2015399us : __print_runqueue: cfs_rq: c2407c34, nr: 3, load: 3072
systemd--1251 0d..5. 2015400us : __print_runqueue: curr: f6a8de5c, comm: systemd-cgroups/1251, load: 1024
systemd--1251 0d..5. 2015401us : __print_runqueue: se: f69e6300, load: 1024,
systemd--1251 0d..5. 2015401us : __print_runqueue: cfs_rq: f69e6540, nr: 2, load: 2048
systemd--1251 0d..5. 2015402us : __print_runqueue: curr: (null)
systemd--1251 0d..5. 2015402us : __print_runqueue: se: f69e65a0, load: 4137574976,
systemd--1251 0d..5. 2015402us : __print_runqueue: cfs_rq: f69e6240, nr: 1, load: 4137575360
systemd--1251 0d..5. 2015403us : __print_runqueue: curr: (null)
systemd--1251 0d..5. 2015404us : __print_runqueue: se: f725916c, comm: loadkeys/1225, load: 1024
systemd--1251 0d..5. 2015404us : __print_runqueue: se: f69e6660, load: 1024,
systemd--1251 0d..5. 2015404us : __print_runqueue: cfs_rq: f69e6600, nr: 1, load: 1024
systemd--1251 0d..5. 2015405us : __print_runqueue: curr: (null)
systemd--1251 0d..5. 2015406us : __print_runqueue: se: f6a8da0c, comm: systemd-tmpfile/1246, load: 1024
systemd--1251 0d..5. 2015406us : __print_runqueue: se: f6a8cd1c, comm: kworker/u:1/1252, load: 1024

The load value being crazy like this leads to dequeue_task_fair() making
weird decisions since it checks for load==0 to dequeue groups.

Not doing that, then leads to 'dead' groups still being enqueued and
eventually referencing freed memory etc..

I've no clue why any of this happens, it could be random scribbles,
except the se/cfs_rq combo are always off together (although not always
by the same amount).

Anyway, my brain needs sleep, I'll continue poking tomorrow.

---
kernel/sched_fair.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 71 insertions(+), 0 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index db3f674..8b8b038 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -333,6 +333,8 @@ static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
s64 key = entity_key(cfs_rq, se);
int leftmost = 1;

+ trace_printk("%p to %p, %ld tasks\n", se, cfs_rq, cfs_rq->nr_running);
+
/*
* Find the right place in the rbtree:
*/
@@ -364,6 +366,8 @@ static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)

static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
+ trace_printk("%p from %p, %ld left\n", se, cfs_rq, cfs_rq->nr_running);
+
if (cfs_rq->rb_leftmost == &se->run_node) {
struct rb_node *next_node;

@@ -394,6 +398,53 @@ static struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)
return rb_entry(last, struct sched_entity, run_node);
}

+#define INDENT(i) ((i)*2), " "
+
+static void notrace __print_runqueue(struct cfs_rq *cfs_rq, int indent);
+
+static void __always_inline
+__print_entity(struct sched_entity *se, char *name, int indent)
+{
+ if (!se) {
+ trace_printk("%.*s %s: %p\n", INDENT(indent), name, se);
+ return;
+ }
+
+ if (entity_is_task(se)) {
+ struct task_struct *p = task_of(se);
+ trace_printk("%.*s %s: %p, comm: %s/%d, load: %ld\n",
+ INDENT(indent), name,
+ se, p->comm, p->pid, se->load.weight);
+ } else {
+ trace_printk("%.*s %s: %p, load: %ld, ",
+ INDENT(indent), name,
+ se, se->load.weight);
+ __print_runqueue(group_cfs_rq(se), indent+1);
+ }
+}
+
+static void notrace __print_runqueue(struct cfs_rq *cfs_rq, int indent)
+{
+ struct rb_node *n;
+
+ trace_printk("%.*s cfs_rq: %p, nr: %ld, load: %ld\n",
+ INDENT(indent),
+ cfs_rq, cfs_rq->nr_running,
+ cfs_rq->load.weight);
+
+ __print_entity(cfs_rq->curr, "curr", indent);
+
+ for (n = cfs_rq->rb_leftmost; n; n = rb_next(n)) {
+ struct sched_entity *se = rb_entry(n, struct sched_entity, run_node);
+ __print_entity(se, "se", indent);
+ }
+}
+
+void print_runqueue(struct cfs_rq *cfs_rq)
+{
+ __print_runqueue(cfs_rq, 0);
+}
+
/**************************************************************
* Scheduling class statistics methods:
*/
@@ -1047,6 +1098,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;

+ print_runqueue(&rq->cfs);
for_each_sched_entity(se) {
if (se->on_rq)
break;
@@ -1055,6 +1107,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
flags = ENQUEUE_WAKEUP;
}

+ trace_printk("add task: %p, %s/%d\n", &p->se, p->comm, p->pid);
+ print_runqueue(&rq->cfs);
+
hrtick_update(rq);
}

@@ -1068,6 +1123,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;

+ print_runqueue(&rq->cfs);
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, flags);
@@ -1077,6 +1133,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
flags |= DEQUEUE_SLEEP;
}

+ trace_printk("del task: %p, %s/%d\n", &p->se, p->comm, p->pid);
+ print_runqueue(&rq->cfs);
+
hrtick_update(rq);
}

@@ -1719,6 +1778,8 @@ static struct task_struct *pick_next_task_fair(struct rq *rq)
struct cfs_rq *cfs_rq = &rq->cfs;
struct sched_entity *se;

+ print_runqueue(&rq->cfs);
+

if (!cfs_rq->nr_running)
return NULL;

@@ -1731,6 +1792,9 @@ static struct task_struct *pick_next_task_fair(struct rq *rq)
p = task_of(se);
hrtick_start_fair(rq, p);

+ trace_printk("picked: %p, %s/%d\n", se, p->comm, p->pid);
+ print_runqueue(&rq->cfs);
+
return p;
}

@@ -1742,10 +1806,15 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
struct sched_entity *se = &prev->se;
struct cfs_rq *cfs_rq;

+ print_runqueue(&rq->cfs);
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
put_prev_entity(cfs_rq, se);
}
+
+ trace_printk("put: %p, %s/%d\n", se, prev->comm, prev->pid);
+ print_runqueue(&rq->cfs);
+
}

#ifdef CONFIG_SMP
@@ -3819,8 +3888,10 @@ static void set_curr_task_fair(struct rq *rq)
{
struct sched_entity *se = &rq->curr->se;

+ print_runqueue(&rq->cfs);
for_each_sched_entity(se)
set_next_entity(cfs_rq_of(se), se);
+ print_runqueue(&rq->cfs);
}

#ifdef CONFIG_FAIR_GROUP_SCHED

Miklos Vajna

unread,

Dec 22, 2010, 4:11:51 PM12/22/10

to Peter Zijlstra, Mike Galbraith, shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, mi...@elte.hu, Greg KH

On Wed, Dec 22, 2010 at 06:08:38PM +0100, Peter Zijlstra <pet...@infradead.org> wrote:
> On Wed, 2010-12-22 at 16:14 +0100, Miklos Vajna wrote:
> > kernel-build is a git build using the config I already sent and after a
> > 'git checkout v2.6.36'. I can try to build master as well, so far what I
> > saw is that the bug occurs there as well, but less frequently, so maybe
> > that's a bit harder to debug.
>
> Right, so I've got v2.6.36 exploding, current -git and -tip won't
> explode for me, tried both about 20 times. So I'll try and figure out
> wth makes .36 explode and then see if further kernel still suffer that
> problem.

I reported the issue as I tested 2.6.36 and 2.6.37-rc6 and both failed.
Now I tried current -git and indeed I can't reproduce the issue there.

I'm now bisecting between master and rc6 to see which commit fixes the
problem - just to make sure it's not something irrelevant and the bug is
still hidden somewhere.

Miklos Vajna

unread,

Dec 22, 2010, 6:39:56 PM12/22/10

to Peter Zijlstra, Mike Galbraith, shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, mi...@elte.hu, Greg KH

On Wed, Dec 22, 2010 at 10:11:51PM +0100, Miklos Vajna <vmi...@frugalware.org> wrote:
> I reported the issue as I tested 2.6.36 and 2.6.37-rc6 and both failed.
> Now I tried current -git and indeed I can't reproduce the issue there.
>
> I'm now bisecting between master and rc6 to see which commit fixes the
> problem - just to make sure it's not something irrelevant and the bug is
> still hidden somewhere.

I spoke too early. After some trying, it seems I can reproduce with
current -git as well. (But as Peter points out it's harder to
reproduce.)

Just in case it's interesting, I attached the -git dmesg. From a quick
look the trace is the same as with v2.6.36, which is easy to reproduce.

dmesg.txt

Yong Zhang

unread,

Dec 22, 2010, 9:08:56 PM12/22/10

to Peter Zijlstra, Miklos Vajna, Mike Galbraith, shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, mi...@elte.hu, Greg KH, Paul Turner

the load == f69e65a0 == address of se, odd

Thanks,
Yong

--
Only stand for myself.

Peter Zijlstra

unread,

Dec 23, 2010, 7:12:10 AM12/23/10

to Yong Zhang, Miklos Vajna, Mike Galbraith, shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, mi...@elte.hu, Greg KH, Paul Turner

On Thu, 2010-12-23 at 10:08 +0800, Yong Zhang wrote:
> > systemd--1251 0d..5. 2015398us : enqueue_task_fair <-enqueue_task
> > systemd--1251 0d..5. 2015398us : print_runqueue <-enqueue_task_fair
> > systemd--1251 0d..5. 2015399us : __print_runqueue: cfs_rq: c2407c34, nr: 3, load: 3072
> > systemd--1251 0d..5. 2015400us : __print_runqueue: curr: f6a8de5c, comm: systemd-cgroups/1251, load: 1024
> > systemd--1251 0d..5. 2015401us : __print_runqueue: se: f69e6300, load: 1024,
> > systemd--1251 0d..5. 2015401us : __print_runqueue: cfs_rq: f69e6540, nr: 2, load: 2048
> > systemd--1251 0d..5. 2015402us : __print_runqueue: curr: (null)
> > systemd--1251 0d..5. 2015402us : __print_runqueue: se: f69e65a0, load: 4137574976,
>
> the load == f69e65a0 == address of se, odd

This appears to be consistently true, I've also found that in between
these two prints, there is a free_sched_group() freeing that exact
entry. So post-print is a use-after-free artifact.

What's interesting is that its freeing a cfs_rq struct with
nr_running=1, that should not be possible...

/me goes stare at the whole cgroup task attach vs cgroup destruction
muck.

Peter Zijlstra

unread,

Dec 23, 2010, 7:33:44 AM12/23/10

to Yong Zhang, Miklos Vajna, Mike Galbraith, shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, mi...@elte.hu, Greg KH, Paul Turner

On Thu, 2010-12-23 at 13:12 +0100, Peter Zijlstra wrote:
> On Thu, 2010-12-23 at 10:08 +0800, Yong Zhang wrote:
> > > systemd--1251 0d..5. 2015398us : enqueue_task_fair <-enqueue_task
> > > systemd--1251 0d..5. 2015398us : print_runqueue <-enqueue_task_fair
> > > systemd--1251 0d..5. 2015399us : __print_runqueue: cfs_rq: c2407c34, nr: 3, load: 3072
> > > systemd--1251 0d..5. 2015400us : __print_runqueue: curr: f6a8de5c, comm: systemd-cgroups/1251, load: 1024
> > > systemd--1251 0d..5. 2015401us : __print_runqueue: se: f69e6300, load: 1024,
> > > systemd--1251 0d..5. 2015401us : __print_runqueue: cfs_rq: f69e6540, nr: 2, load: 2048
> > > systemd--1251 0d..5. 2015402us : __print_runqueue: curr: (null)
> > > systemd--1251 0d..5. 2015402us : __print_runqueue: se: f69e65a0, load: 4137574976,
> >
> > the load == f69e65a0 == address of se, odd
>
> This appears to be consistently true, I've also found that in between
> these two prints, there is a free_sched_group() freeing that exact
> entry. So post-print is a use-after-free artifact.
>
> What's interesting is that its freeing a cfs_rq struct with
> nr_running=1, that should not be possible...
>
> /me goes stare at the whole cgroup task attach vs cgroup destruction
> muck.

systemd-1 0d..1. 2070793us : sched_destroy_group: se: f69e43c0, load: 1024
systemd-1 0d..1. 2070794us : sched_destroy_group: cfs_rq: f69e4720, nr: 1, load: 1024
systemd-1 0d..1. 2070794us : __print_runqueue: cfs_rq: f69e4720, nr: 1, load: 1024
systemd-1 0d..1. 2070795us : __print_runqueue: curr: (null)
systemd-1 0d..1. 2070796us : __print_runqueue: se: f6a8eb4c, comm: systemd-tmpfile/1243, load: 1024
systemd-1 0d..1. 2070796us : _raw_spin_unlock_irqrestore <-sched_destroy_group

So somehow it manages to destroy a group with a task attached.

Peter Zijlstra

unread,

Dec 23, 2010, 1:24:30 PM12/23/10

to Yong Zhang, Miklos Vajna, Mike Galbraith, shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, mi...@elte.hu, Greg KH, Paul Turner, Li Zefan

On Thu, 2010-12-23 at 13:33 +0100, Peter Zijlstra wrote:

> systemd-1 0d..1. 2070793us : sched_destroy_group: se: f69e43c0, load: 1024
> systemd-1 0d..1. 2070794us : sched_destroy_group: cfs_rq: f69e4720, nr: 1, load: 1024
> systemd-1 0d..1. 2070794us : __print_runqueue: cfs_rq: f69e4720, nr: 1, load: 1024
> systemd-1 0d..1. 2070795us : __print_runqueue: curr: (null)
> systemd-1 0d..1. 2070796us : __print_runqueue: se: f6a8eb4c, comm: systemd-tmpfile/1243, load: 1024
> systemd-1 0d..1. 2070796us : _raw_spin_unlock_irqrestore <-sched_destroy_group
>
> So somehow it manages to destroy a group with a task attached.

Its even weirder:

systemd-1 0d..1. 1663489us : sched_destroy_group: se: f69e7360, load: 1024
systemd-1 0d..1. 1663489us : sched_destroy_group: cfs_rq: f69e72a0, nr: 1, load: 1024
systemd-1 0d..1. 1663491us : __print_runqueue: cfs_rq: f69e72a0, nr: 1, load: 1024, cgroup: /system/systemd-sysctl.service
systemd-1 0d..1. 1663491us : __print_runqueue: curr: (null)
systemd-1 0d..1. 1663493us : __print_runqueue: se: f69d95bc, comm: systemd-sysctl/1209, load: 1024, cgroup: /
systemd-1 0d..1. 1663496us : do_invalid_op <-error_code

The task enqueued to the cfs_rq doesn't match the cgroup, the thing is,
I don't see a cpu_cgroup_attach/sched_move_task call in the log, nor
does a BUG_ON() validating the task's cgroup against the cfs_rq's cgroup
on account_entity_enqueue() trigger.

So it looks like a task changes cgroup without passing through the
cgroup_subsys::attach method, which afaict isn't supposed to happen.

---
kernel/sched.c | 26 ++++++++++++++-
kernel/sched_fair.c | 89 +++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 113 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index dc85ceb..bc30efb 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8033,10 +8033,20 @@ static void free_fair_sched_group(struct task_group *tg)
int i;

for_each_possible_cpu(i) {
- if (tg->cfs_rq)
+ if (tg->cfs_rq) {
+ struct cfs_rq *cfs_rq = tg->cfs_rq[i];
+ if (cfs_rq) {
+ trace_printk("cfs_rq: %p, nr: %ld, load: %ld\n",
+ cfs_rq, cfs_rq->nr_running, cfs_rq->load.weight);
+ }
kfree(tg->cfs_rq[i]);
- if (tg->se)
+ }
+ if (tg->se) {
+ struct sched_entity *se = tg->se[i];
+ if (se)
+ trace_printk("se: %p, load: %ld\n", se, se->load.weight);
kfree(tg->se[i]);
+ }
}

kfree(tg->cfs_rq);
@@ -8092,6 +8102,18 @@ static inline void register_fair_sched_group(struct task_group *tg, int cpu)

static inline void unregister_fair_sched_group(struct task_group *tg, int cpu)
{
+ struct cfs_rq *cfs_rq = tg->cfs_rq[cpu];
+ struct sched_entity *se = tg->se[cpu];
+
+ trace_printk("se: %p, load: %ld\n", se, se->load.weight);
+ trace_printk("cfs_rq: %p, nr: %ld, load: %ld\n",
+ cfs_rq, cfs_rq->nr_running, cfs_rq->load.weight);
+
+ if (cfs_rq->nr_running) {
+ print_runqueue(cfs_rq);
+ BUG();
+ }
+
list_del_rcu(&tg->cfs_rq[cpu]->leaf_cfs_rq_list);
}
#else /* !CONFG_FAIR_GROUP_SCHED */
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index db3f674..37d918a 100644

--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -333,6 +333,8 @@ static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
s64 key = entity_key(cfs_rq, se);
int leftmost = 1;

+ trace_printk("%p to %p, %ld tasks\n", se, cfs_rq, cfs_rq->nr_running);
+
/*
* Find the right place in the rbtree:
*/
@@ -364,6 +366,8 @@ static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)

static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
+ trace_printk("%p from %p, %ld left\n", se, cfs_rq, cfs_rq->nr_running);
+
if (cfs_rq->rb_leftmost == &se->run_node) {
struct rb_node *next_node;

@@ -394,6 +398,70 @@ static struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)

return rb_entry(last, struct sched_entity, run_node);
}

+#define INDENT(i) ((i)*2), " "
+
+static void notrace __print_runqueue(struct cfs_rq *cfs_rq, int indent);
+

+static __always_inline
+char *task_group_path(struct task_group *tg, char *buf, int buflen)
+{
+ /* may be NULL if the underlying cgroup isn't fully-created yet */
+ if (!tg->css.cgroup)
+ buf[0] = '\0';
+ else
+ cgroup_path(tg->css.cgroup, buf, buflen);
+
+ return buf;
+}

+
+static void __always_inline
+__print_entity(struct sched_entity *se, char *name, int indent)
+{
+ if (!se) {
+ trace_printk("%.*s %s: %p\n", INDENT(indent), name, se);
+ return;
+ }
+
+ if (entity_is_task(se)) {
+ struct task_struct *p = task_of(se);

+ char buf[64];
+
+ trace_printk("%.*s %s: %p, comm: %s/%d, load: %ld, cgroup: %s\n",
+ INDENT(indent), name,
+ se, p->comm, p->pid, se->load.weight,
+ task_group_path(task_group(p), buf, sizeof(buf)));

+ } else {
+ trace_printk("%.*s %s: %p, load: %ld, ",
+ INDENT(indent), name,
+ se, se->load.weight);
+ __print_runqueue(group_cfs_rq(se), indent+1);
+ }
+}
+
+static void notrace __print_runqueue(struct cfs_rq *cfs_rq, int indent)
+{
+ struct rb_node *n;

+ char buf[64];
+
+ trace_printk("%.*s cfs_rq: %p, nr: %ld, load: %ld, cgroup: %s\n",

+ INDENT(indent),
+ cfs_rq, cfs_rq->nr_running,

+ cfs_rq->load.weight,
+ task_group_path(cfs_rq->tg, buf, sizeof(buf)));

+
+ __print_entity(cfs_rq->curr, "curr", indent);
+
+ for (n = cfs_rq->rb_leftmost; n; n = rb_next(n)) {
+ struct sched_entity *se = rb_entry(n, struct sched_entity, run_node);
+ __print_entity(se, "se", indent);
+ }
+}
+

+void notrace print_runqueue(struct cfs_rq *cfs_rq)

+{
+ __print_runqueue(cfs_rq, 0);
+}
+
/**************************************************************
* Scheduling class statistics methods:
*/

@@ -631,6 +699,7 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
if (entity_is_task(se)) {
add_cfs_task_weight(cfs_rq, se->load.weight);
list_add(&se->group_node, &cfs_rq->tasks);
+ BUG_ON(task_group(task_of(se)) != cfs_rq->tg);
}
cfs_rq->nr_running++;
se->on_rq = 1;
@@ -1047,6 +1116,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)

struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;

+ print_runqueue(&rq->cfs);
for_each_sched_entity(se) {
if (se->on_rq)
break;

@@ -1055,6 +1125,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)

flags = ENQUEUE_WAKEUP;
}

+ trace_printk("add task: %p, %s/%d\n", &p->se, p->comm, p->pid);
+ print_runqueue(&rq->cfs);
+
hrtick_update(rq);
}

@@ -1068,6 +1141,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)

struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;

+ print_runqueue(&rq->cfs);
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, flags);

@@ -1077,6 +1151,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)

flags |= DEQUEUE_SLEEP;
}

+ trace_printk("del task: %p, %s/%d\n", &p->se, p->comm, p->pid);
+ print_runqueue(&rq->cfs);
+
hrtick_update(rq);
}

@@ -1719,6 +1796,8 @@ static struct task_struct *pick_next_task_fair(struct rq *rq)

struct cfs_rq *cfs_rq = &rq->cfs;
struct sched_entity *se;

+ print_runqueue(&rq->cfs);
+
if (!cfs_rq->nr_running)
return NULL;

@@ -1731,6 +1810,9 @@ static struct task_struct *pick_next_task_fair(struct rq *rq)

p = task_of(se);
hrtick_start_fair(rq, p);

+ trace_printk("picked: %p, %s/%d\n", se, p->comm, p->pid);
+ print_runqueue(&rq->cfs);
+
return p;
}

@@ -1742,10 +1824,15 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)

struct sched_entity *se = &prev->se;
struct cfs_rq *cfs_rq;

+ print_runqueue(&rq->cfs);
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
put_prev_entity(cfs_rq, se);
}
+
+ trace_printk("put: %p, %s/%d\n", se, prev->comm, prev->pid);
+ print_runqueue(&rq->cfs);
+
}

#ifdef CONFIG_SMP

@@ -3819,8 +3906,10 @@ static void set_curr_task_fair(struct rq *rq)

{
struct sched_entity *se = &rq->curr->se;

+ print_runqueue(&rq->cfs);
for_each_sched_entity(se)
set_next_entity(cfs_rq_of(se), se);
+ print_runqueue(&rq->cfs);
}

#ifdef CONFIG_FAIR_GROUP_SCHED

--

Peter Zijlstra

unread,

Dec 24, 2010, 10:59:13 AM12/24/10

to Mike Galbraith, Miklos Vajna, shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, mi...@elte.hu, Greg KH, Paul Turner, Yong Zhang, Li Zefan, Paul Menage, Balbir Singh, Srivatsa Vaddagiri

On Fri, 2010-12-24 at 13:16 +0100, Mike Galbraith wrote:
> On Fri, 2010-12-24 at 11:54 +0100, Peter Zijlstra wrote:

> > Right, so the cgroup core is supposed to already emit -EBUSY when there
> > are associated tasks with the cgroup, that _should_ be sufficient, the
> > pre_destroy() method is to frob some extra constraints or somesuch.
> >
> > Our problem looks to be that a task (afaict usually current) changes
> > cgroups without us getting notified of it. On destruction the task is
> > still enqueued in the cfs_rq being destroyed but is not actually part of
> > that cgroup according to the task->css bits.
>
> Could it be an exiting task? We're still preemptible, and iirc, you run
> a CONFIG_PREEMPT kernel. (grasp at all straws;)
>
> cgroup_exit:
> /* Reassign the task to the init_css_set. */
> task_lock(tsk);
> cg = tsk->cgroups;
> tsk->cgroups = &init_css_set;
> task_unlock(tsk);
> if (cg)
> put_css_set_taskexit(cg);
>

This straw appears true:

$ grep -e cpu_cgroup\\\|f491447c log9

...

kworker/-1196 0d..2. 1601180us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /system/systemd-modules-load.service
kworker/-1196 0d..2. 1601186us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /system/systemd-modules-load.service
kworker/-1196 0d..2. 1601188us : __dequeue_entity: f491447c from f492a480, 1 left
kworker/-1196 0d..2. 1601188us : pick_next_task_fair: picked: f491447c, modprobe/1210
kworker/-1196 0d..2. 1601192us : __print_runqueue: curr: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /system/systemd-modules-load.service
modprobe-1210 0d..5. 1601802us : __print_runqueue: curr: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /
modprobe-1210 0d..5. 1601807us : __print_runqueue: curr: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /
modprobe-1210 0d..2. 1601817us : __print_runqueue: curr: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /
modprobe-1210 0d..2. 1601819us : __enqueue_entity: f491447c to f492a480, 1 tasks
modprobe-1210 0d..2. 1601826us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /
modprobe-1210 0d..2. 1601832us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /
modprobe-1210 0d..2. 1601839us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /
kworker/-1196 0d..2. 1601848us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /
kworker/-1196 0d..2. 1601854us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /
kworker/-1196 0d..2. 1601860us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /
kworker/-1196 0d..2. 1601865us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /
kworker/-1196 0d..2. 1601871us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /
kworker/-1196 0d..2. 1601872us : __dequeue_entity: f491447c from f492a480, 1 left
kworker/-1196 0d..2. 1601873us : pick_next_task_fair: picked: f491447c, modprobe/1210
kworker/-1196 0d..2. 1601876us : __print_runqueue: curr: f491447c, comm: modprobe/1210, state: 0, load: 1024, cgroup: /
modprobe-1210 0d..7. 1601895us : __print_runqueue: curr: f491447c, comm: modprobe/1210, state: 16, load: 1024, cgroup: /
modprobe-1210 0d..7. 1601900us : __print_runqueue: curr: f491447c, comm: modprobe/1210, state: 16, load: 1024, cgroup: /
modprobe-1210 0d..2. 1601909us : __print_runqueue: curr: f491447c, comm: modprobe/1210, state: 16, load: 1024, cgroup: /
modprobe-1210 0d..2. 1601911us : __enqueue_entity: f491447c to f492a480, 1 tasks
modprobe-1210 0d..2. 1601918us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 16, load: 1024, cgroup: /
modprobe-1210 0d..2. 1601924us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 16, load: 1024, cgroup: /
modprobe-1210 0d..2. 1601931us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 16, load: 1024, cgroup: /
kworker/-1196 0d..2. 1602071us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 16, load: 1024, cgroup: /
kworker/-1196 0d..2. 1602080us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 16, load: 1024, cgroup: /
kworker/-1196 0d..2. 1602089us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 16, load: 1024, cgroup: /
kworker/-1196 0d..2. 1602097us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 16, load: 1024, cgroup: /
kworker/-1196 0d..2. 1602105us : __print_runqueue: se: f491447c, comm: modprobe/1210, state: 16, load: 1024, cgroup: /
kworker/-1196 0d..2. 1602107us : __dequeue_entity: f491447c from f492a480, 1 left
kworker/-1196 0d..2. 1602108us : pick_next_task_fair: picked: f491447c, modprobe/1210
kworker/-1196 0d..2. 1602114us : __print_runqueue: curr: f491447c, comm: modprobe/1210, state: 16, load: 1024, cgroup: /
modprobe-1210 0d..3. 1602128us : __print_runqueue: curr: f491447c, comm: modprobe/1210, state: 80, load: 1024, cgroup: /

So cgroup moves a task without calling cgroup_subsys::attach() which is
odd, but it does have an ::exit method, sadly it calls that _before_
re-assigning the task, which means we have to jump through some hoops.

The below seems to fix the problem for me..

---
Subject: sched, cgroup: Use exit hook to avoid use-after-free crash

By not notifying the controller of the on-exit move back to
init_css_set, we fail to move the task out of the previous cgroup's
cfs_rq. This leads to an opportunity for a cgroup-destroy to come in and
free the cgroup (there are no active tasks left in it after all) to
which the not-quite dead task is still enqueued.

Cc: sta...@kernel.org
Reported-by: Miklos Vajna <vmi...@frugalware.org>
Signed-off-by: Peter Zijlstra <a.p.zi...@chello.nl>
---
kernel/sched.c | 10 ++++++++++
1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 7e401f8..572625c 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -611,6 +611,9 @@ static inline struct task_group *task_group(struct task_struct *p)
struct task_group *tg;
struct cgroup_subsys_state *css;

+ if (p->flags & PF_EXITING)
+ return &root_task_group;
+
css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
lockdep_is_held(&task_rq(p)->lock));
tg = container_of(css, struct task_group, css);
@@ -8887,6 +8890,12 @@ cpu_cgroup_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,
}
}

+static void
+cpu_cgroup_exit(struct cgroup_subsys *ss, struct task_struct *task)
+{
+ sched_move_task(task);
+}
+
#ifdef CONFIG_FAIR_GROUP_SCHED
static int cpu_shares_write_u64(struct cgroup *cgrp, struct cftype *cftype,
u64 shareval)
@@ -8959,6 +8968,7 @@ struct cgroup_subsys cpu_cgroup_subsys = {
.destroy = cpu_cgroup_destroy,
.can_attach = cpu_cgroup_can_attach,
.attach = cpu_cgroup_attach,
+ .exit = cpu_cgroup_exit,
.populate = cpu_cgroup_populate,
.subsys_id = cpu_cgroup_subsys_id,
.early_init = 1,

Miklos Vajna

unread,

Dec 24, 2010, 11:40:43 AM12/24/10

to Peter Zijlstra, Mike Galbraith, shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, mi...@elte.hu, Greg KH, Paul Turner, Yong Zhang, Li Zefan, Paul Menage, Balbir Singh, Srivatsa Vaddagiri

> Subject: sched, cgroup: Use exit hook to avoid use-after-free crash
>
> By not notifying the controller of the on-exit move back to
> init_css_set, we fail to move the task out of the previous cgroup's
> cfs_rq. This leads to an opportunity for a cgroup-destroy to come in and
> free the cgroup (there are no active tasks left in it after all) to
> which the not-quite dead task is still enqueued.
>
> Cc: sta...@kernel.org
> Reported-by: Miklos Vajna <vmi...@frugalware.org>
> Signed-off-by: Peter Zijlstra <a.p.zi...@chello.nl>
> ---
> kernel/sched.c | 10 ++++++++++
> 1 files changed, 10 insertions(+), 0 deletions(-)

Thanks! :)

Reported-and-tested-by: Miklos Vajna <vmi...@frugalware.org>

Mike Galbraith

unread,

Dec 24, 2010, 11:48:17 AM12/24/10

to Peter Zijlstra, Miklos Vajna, shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, mi...@elte.hu, Greg KH, Paul Turner, Yong Zhang, Li Zefan, Paul Menage, Balbir Singh, Srivatsa Vaddagiri

On Fri, 2010-12-24 at 16:59 +0100, Peter Zijlstra wrote:

> So cgroup moves a task without calling cgroup_subsys::attach() which is
> odd, but it does have an ::exit method, sadly it calls that _before_
> re-assigning the task, which means we have to jump through some hoops.

Could you do the move in cgroup_exit() in the CONFIG_PREEMPT case?

-Mike

Peter Zijlstra

unread,

Dec 24, 2010, 12:07:37 PM12/24/10

to Mike Galbraith, Miklos Vajna, shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, mi...@elte.hu, Greg KH, Paul Turner, Yong Zhang, Li Zefan, Paul Menage, Balbir Singh, Srivatsa Vaddagiri

On Fri, 2010-12-24 at 17:48 +0100, Mike Galbraith wrote:
> On Fri, 2010-12-24 at 16:59 +0100, Peter Zijlstra wrote:
>
> > So cgroup moves a task without calling cgroup_subsys::attach() which is
> > odd, but it does have an ::exit method, sadly it calls that _before_
> > re-assigning the task, which means we have to jump through some hoops.
>
> Could you do the move in cgroup_exit() in the CONFIG_PREEMPT case?

I'm not really comfortable relying on that.. voluntary might just grow a
cond_resched() somewhere in the exit path and lead us down the same
path, also I think that !preempt smp might have the same race.

Mike Galbraith

unread,

Dec 24, 2010, 12:24:51 PM12/24/10

to Peter Zijlstra, Miklos Vajna, shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, mi...@elte.hu, Greg KH, Paul Turner, Yong Zhang, Li Zefan, Paul Menage, Balbir Singh, Srivatsa Vaddagiri

On Fri, 2010-12-24 at 18:07 +0100, Peter Zijlstra wrote:
> On Fri, 2010-12-24 at 17:48 +0100, Mike Galbraith wrote:
> > On Fri, 2010-12-24 at 16:59 +0100, Peter Zijlstra wrote:
> >
> > > So cgroup moves a task without calling cgroup_subsys::attach() which is
> > > odd, but it does have an ::exit method, sadly it calls that _before_
> > > re-assigning the task, which means we have to jump through some hoops.
> >
> > Could you do the move in cgroup_exit() in the CONFIG_PREEMPT case?
>
> I'm not really comfortable relying on that.. voluntary might just grow a
> cond_resched() somewhere in the exit path and lead us down the same
> path, also I think that !preempt smp might have the same race.

Yeah, good point.

-Mike

Balbir Singh

unread,

Dec 25, 2010, 12:55:25 PM12/25/10

to Peter Zijlstra, Mike Galbraith, Miklos Vajna, shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, mi...@elte.hu, Greg KH, Paul Turner, Yong Zhang, Li Zefan, Paul Menage, Srivatsa Vaddagiri

* Peter Zijlstra <pet...@infradead.org> [2010-12-24 16:59:13]:

Very good catch!

Looks very reasonable and correct to me

Acked-by: Balbir Singh <bal...@linux.vnet.ibm.com>

--
Three Cheers,
Balbir

Paul Menage

unread,

Dec 25, 2010, 3:59:07 PM12/25/10

to Peter Zijlstra, Mike Galbraith, Miklos Vajna, shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, mi...@elte.hu, Greg KH, Paul Turner, Yong Zhang, Li Zefan, Balbir Singh, Srivatsa Vaddagiri

On Fri, Dec 24, 2010 at 3:59 PM, Peter Zijlstra <pet...@infradead.org> wrote:
> Subject: sched, cgroup: Use exit hook to avoid use-after-free crash
>
> By not notifying the controller of the on-exit move back to
> init_css_set, we fail to move the task out of the previous cgroup's
> cfs_rq. This leads to an opportunity for a cgroup-destroy to come in and
> free the cgroup (there are no active tasks left in it after all) to
> which the not-quite dead task is still enqueued.

While this patch is likely fine for solving the problem, it does add
extra work into the task exit path.

Could you instead just use the pre_destroy callback to return -EBUSY
if there are still any tasks on the cfs_rq? That way there'd only be a
penalty on cgroup destruction, which is a much rarer operation.

Paul

Ingo Molnar

unread,

Dec 29, 2010, 10:25:22 AM12/29/10

to Peter Zijlstra, Mike Galbraith, Miklos Vajna, shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, Greg KH, Paul Turner, Yong Zhang, Li Zefan, Paul Menage, Balbir Singh, Srivatsa Vaddagiri

I tried this patch, but it causes a boot crash:

udevd[109]: delete_path: rmdir(/dev/.udev/failed) failed: Read-only file sgeneral protection fault: 0000 [#1] SMP
last sysfs file: /sys/bus/mdio_bus/drivers/Generic PHY/uevent
CPU 0
Modules linked in:

ystem
udevd[10Pid: 109, comm: udevd Not tainted 2.6.37-rc8-tip-01815-g4e7b4f4-dirty #76296 A8N-E/System Product Name
RIP: 0010:[<ffffffff810276ff>] 9]: delete_path: [<ffffffff810276ff>] task_group+0x4d/0x53
RSP: 0018:ffff88003d33bd50 EFLAGS: 00010046
RAX: 6b6b6b6b6b6b6b6b RBX: ffff88003d06f460 RCX: ffffffff00000000
RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff88003d06f460
rmdir(/dev/.udeRBP: ffff88003d33bd50 R08: ffff88003e40ad68 R09: 000000000000005a
R10: ffff88003f7d62c8 R11: ffff88003d086d80 R12: 0000000000000000
v/failed) failedR13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS: 00007f0b23077780(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
: Read-only fileCS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f0b222cb000 CR3: 000000003d36d000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
system
udevd[DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process udevd (pid: 109, threadinfo ffff88003d33a000, task ffff88003e63ee40)
Stack:
ffff88003d33bd70 ffffffff8102771c ffff88003d06f460 0000000000000000
ffff88003d33bd90 ffffffff81027c9c ffff88003d06f460 ffff88003fc11000
ffff88003d33bdd0 ffffffff81034d6d109]: delete_pat ffff88003d06f460 0000000000000246
Call Trace:
[<ffffffff8102771c>] set_task_rq+0x17/0x73
h: rmdir(/dev/.u [<ffffffff81027c9c>] task_move_group_fair+0x37/0x53
dev/failed) fail [<ffffffff81034d6d>] sched_move_task+0x71/0xbc
ed: Read-only fi [<ffffffff81034dc9>] cpu_cgroup_exit+0x11/0x13
[<ffffffff81070e88>] cgroup_exit+0x35/0xd4
le system
udev [<ffffffff8103706a>] copy_process+0xf3a/0xfc7
d[109]: delete_p [<ffffffff81037283>] do_fork+0x163/0x2d7
[<ffffffff810b456a>] ? do_munmap+0x2f2/0x30b
[<ffffffff8100a0c2>] sys_clone+0x28/0x2a
ath: rmdir(/dev/ [<ffffffff81002cf3>] stub_clone+0x13/0x20
[<ffffffff81002a8b>] ? system_call_fastpath+0x16/0x1b
Code: 15 cb 0f 4e 00 85 .udev/failed) fad2 74 12 48 3d 30 03 5a 81 75 0a 48 81 7f 30 90 ad 33 81 iled: Read-only 74 02 c9 c3 48 8b 87 00 04 00 00 48 8b 80 70 01 00 00 <48> 8b 40 08 eb ea 55 48 89 e5 41 54 53 e8 6f b1 fd ff 48 89 fb
RIP [<ffffffff810276ff>] task_group+0x4d/0x53
RSP <ffff88003d33bd50>
---[ end trace cde1903a0ae5345d ]---
Kernel panic - not syncing: Fatal exception
Pid: 109, comm: udevd Tainted: G D 2.6.37-rc8-tip-01815-g4e7b4f4-dirty #76296
Call Trace:
[<ffffffff8132a7d6>] ? panic+0x91/0x194
[<ffffffff8103931c>] ? kmsg_dump+0x11a/0x136
[<ffffffff8132d357>] ? oops_end+0xae/0xbe
[<ffffffff810060eb>] ? die+0x5a/0x63
[<ffffffff8132cdaa>] ? do_general_protection+0x133/0x13b
[<ffffffff8132c8af>] ? general_protection+0x1f/0x30
[<ffffffff810276ff>] ? task_group+0x4d/0x53
[<ffffffff8102771c>] ? set_task_rq+0x17/0x73
[<ffffffff81027c9c>] ? task_move_group_fair+0x37/0x53
[<ffffffff81034d6d>] ? sched_move_task+0x71/0xbc
[<ffffffff81034dc9>] ? cpu_cgroup_exit+0x11/0x13
[<ffffffff81070e88>] ? cgroup_exit+0x35/0xd4
[<ffffffff8103706a>] ? copy_process+0xf3a/0xfc7
[<ffffffff81037283>] ? do_fork+0x163/0x2d7
[<ffffffff810b456a>] ? do_munmap+0x2f2/0x30b
[<ffffffff8100a0c2>] ? sys_clone+0x28/0x2a
[<ffffffff81002cf3>] ? stub_clone+0x13/0x20
[<ffffffff81002a8b>] ? system_call_fastpath+0x16/0x1b

Config attached.

Ingo

config

Miklos Vajna

unread,

Dec 29, 2010, 6:07:44 PM12/29/10

to Ingo Molnar, Peter Zijlstra, Mike Galbraith, shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, Greg KH, Paul Turner, Yong Zhang, Li Zefan, Paul Menage, Balbir Singh, Srivatsa Vaddagiri

On Wed, Dec 29, 2010 at 04:25:22PM +0100, Ingo Molnar <mi...@elte.hu> wrote:
> I tried this patch, but it causes a boot crash:

Hm, indeed. (I get a crash in qemu, but not on the host machine.)

qemu -enable-kvm -kernel kernel-build/arch/x86/boot/bzImage -append "root=/dev/sda1 debug sched_debug ignore_loglevel sysrq_always_enabled console=ttyS0 init=/bin/systemd" -hda systemd.img -serial stdio -m 1G -vnc :0

does not crash here, but

qemu -enable-kvm -kernel kernel-build/arch/x86/boot/bzImage -append "root=/dev/sda1 debug sched_debug ignore_loglevel sysrq_always_enabled console=ttyS0" -hda systemd.img -serial stdio -m 1G -vnc :0

does.

I'm attaching the config (what I already sent earlier in this thread)
and the output of the above two commands just in case that helps Peter.

config.txt

out-systemd.txt

out-sysvinit.txt

Mike Galbraith

unread,

Dec 31, 2010, 3:32:30 AM12/31/10

to Ingo Molnar, Peter Zijlstra, Miklos Vajna, shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, Greg KH, Paul Turner, Yong Zhang, Li Zefan, Paul Menage, Balbir Singh, Srivatsa Vaddagiri

On Wed, 2010-12-29 at 16:25 +0100, Ingo Molnar wrote:

> I tried this patch, but it causes a boot crash:

The below should fix it.

sched: fix autogroup reference leak and cpu_cgroup_exit() explosion

In the event of a fork failure, the new cpu_cgroup_exit() method tries to
move an unhashed task. Since PF_EXITING isn't set in that case, autogroup
will dig aground in a freed signal_struct. Neither cgroups nor autogroup
has anything it needs to do with this shade, so don't go there.

This also uncovered a struct autogroup reference leak. copy_process() was
simply freeing vs putting the signal_struct, stranding a reference.

Signed-off-by: Mike Galbraith <efa...@gmx.de>

---
kernel/fork.c | 2 +-
kernel/sched.c | 10 ++++++++++
2 files changed, 11 insertions(+), 1 deletion(-)

Index: linux-2.6.37.git/kernel/fork.c
===================================================================
--- linux-2.6.37.git.orig/kernel/fork.c
+++ linux-2.6.37.git/kernel/fork.c
@@ -1318,7 +1318,7 @@ bad_fork_cleanup_mm:
}
bad_fork_cleanup_signal:
if (!(clone_flags & CLONE_THREAD))
- free_signal_struct(p->signal);
+ put_signal_struct(p->signal);
bad_fork_cleanup_sighand:
__cleanup_sighand(p->sighand);
bad_fork_cleanup_fs:
Index: linux-2.6.37.git/kernel/sched.c
===================================================================
--- linux-2.6.37.git.orig/kernel/sched.c
+++ linux-2.6.37.git/kernel/sched.c
@@ -9193,6 +9193,16 @@ cpu_cgroup_attach(struct cgroup_subsys *
static void

cpu_cgroup_exit(struct cgroup_subsys *ss, struct task_struct *task)
{

+ /*
+ * cgroup_exit() is called in the copy_process failure path.
+ * The task isn't hashed, and we don't want to make autogroup
+ * dig into a freed signal_struct, so just go away.
+ *
+ * XXX: why are cgroup methods diddling unattached tasks?
+ */
+ if (!(task->flags & PF_EXITING))
+ return;
+
sched_move_task(task);

Mike Galbraith

unread,

Dec 31, 2010, 5:04:19 AM12/31/10

to Miklos Vajna, Ingo Molnar, Peter Zijlstra, shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, Greg KH, Paul Turner, Yong Zhang, Li Zefan, Paul Menage, Balbir Singh, Srivatsa Vaddagiri

Your crash seems to be completely independent of Peter's patch. It
crashes here with your image/config/2.6.36 here regardless, and not 100%
repeatably. Sometimes it boots, sometimes not.

-Mike

Miklos Vajna

unread,

Dec 31, 2010, 5:46:38 AM12/31/10

to Mike Galbraith, Ingo Molnar, Peter Zijlstra, shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, Greg KH, Paul Turner, Yong Zhang, Li Zefan, Paul Menage, Balbir Singh, Srivatsa Vaddagiri

On Fri, Dec 31, 2010 at 11:04:19AM +0100, Mike Galbraith <efa...@gmx.de> wrote:
> Your crash seems to be completely independent of Peter's patch. It
> crashes here with your image/config/2.6.36 here regardless, and not 100%
> repeatably. Sometimes it boots, sometimes not.

Oh, then ignore that. It's possible it's something qemu-specific, I
can't reproduce this one on my real machine.

Peter Zijlstra

unread,

Jan 3, 2011, 2:06:41 AM1/3/11

to Paul Menage, Mike Galbraith, Miklos Vajna, shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, mi...@elte.hu, Greg KH, Paul Turner, Yong Zhang, Li Zefan, Balbir Singh, Srivatsa Vaddagiri

On Sat, 2010-12-25 at 20:59 +0000, Paul Menage wrote:

> While this patch is likely fine for solving the problem, it does add
> extra work into the task exit path.
>
> Could you instead just use the pre_destroy callback to return -EBUSY
> if there are still any tasks on the cfs_rq? That way there'd only be a
> penalty on cgroup destruction, which is a much rarer operation.

No, that's broken! That would mean userspace gets the cgroup-empty
notification thing and then encounters -EBUSY, which then forces it to
go poll the state, rendering the whole notification thing pointless.

Peter Zijlstra

unread,

Jan 3, 2011, 3:21:25 AM1/3/11

to Mike Galbraith, Ingo Molnar, Miklos Vajna, shenghui, kernel-...@vger.kernel.org, linux-...@vger.kernel.org, Greg KH, Paul Turner, Yong Zhang, Li Zefan, Paul Menage, Balbir Singh, Srivatsa Vaddagiri

Ah, that looks plausible. I've folded this chunk into my patch and kept
your fork-fail mod in a separate patch.

tip-bot for Mike Galbraith

unread,

Jan 4, 2011, 9:19:13 AM1/4/11

to linux-ti...@vger.kernel.org, linux-...@vger.kernel.org, h...@zytor.com, mi...@redhat.com, a.p.zi...@chello.nl, efa...@gmx.de, ol...@redhat.com, tg...@linutronix.de, mi...@elte.hu

Commit-ID: 101e5f77bf35679809586e250b6c62193d2ed179
Gitweb: http://git.kernel.org/tip/101e5f77bf35679809586e250b6c62193d2ed179
Author: Mike Galbraith <efa...@gmx.de>
AuthorDate: Fri, 31 Dec 2010 09:32:30 +0100
Committer: Ingo Molnar <mi...@elte.hu>
CommitDate: Tue, 4 Jan 2011 15:10:36 +0100

sched, autogroup: Fix reference leak

The cgroup exit mess also uncovered a struct autogroup reference leak.

copy_process() was simply freeing vs putting the signal_struct,
stranding a reference.

Signed-off-by: Mike Galbraith <efa...@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zi...@chello.nl>
Cc: Oleg Nesterov <ol...@redhat.com>
LKML-Reference: <1293784350....@marge.simson.net>
Signed-off-by: Ingo Molnar <mi...@elte.hu>
---
kernel/fork.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index b6f2475..0672444 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1317,7 +1317,7 @@ bad_fork_cleanup_mm:

}
bad_fork_cleanup_signal:
if (!(clone_flags & CLONE_THREAD))
- free_signal_struct(p->signal);
+ put_signal_struct(p->signal);
bad_fork_cleanup_sighand:
__cleanup_sighand(p->sighand);
bad_fork_cleanup_fs:

Oleg Nesterov

unread,

Jan 4, 2011, 9:57:22 AM1/4/11

to tip-bot for Mike Galbraith, linux-ti...@vger.kernel.org, linux-...@vger.kernel.org, h...@zytor.com, mi...@redhat.com, a.p.zi...@chello.nl, tg...@linutronix.de, mi...@elte.hu

On 01/04, tip-bot for Mike Galbraith wrote:
>
> The cgroup exit mess also uncovered a struct autogroup reference leak.
> copy_process() was simply freeing vs putting the signal_struct,
> stranding a reference.
>
> Signed-off-by: Mike Galbraith <efa...@gmx.de>
> Signed-off-by: Peter Zijlstra <a.p.zi...@chello.nl>
> Cc: Oleg Nesterov <ol...@redhat.com>
> LKML-Reference: <1293784350....@marge.simson.net>
> Signed-off-by: Ingo Molnar <mi...@elte.hu>
> ---
> kernel/fork.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index b6f2475..0672444 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1317,7 +1317,7 @@ bad_fork_cleanup_mm:
> }
> bad_fork_cleanup_signal:
> if (!(clone_flags & CLONE_THREAD))
> - free_signal_struct(p->signal);
> + put_signal_struct(p->signal);

Well, free_signal_struct() was correct. Without CLONE_THREAD
sig->sigcnt must be equal to 1.

But yes, autogroup puts sched_autogroup_exit() into put_signal_struct(),
so this patch looks fine.

Although I must admit, to me it would be more clean to simply move
sched_autogroup_exit() from put_signal_struct() into free_signal_struct()
instead.

Oleg.

Mike Galbraith

unread,

Jan 4, 2011, 2:06:24 PM1/4/11

to Oleg Nesterov, linux-ti...@vger.kernel.org, linux-...@vger.kernel.org, h...@zytor.com, mi...@redhat.com, a.p.zi...@chello.nl, tg...@linutronix.de, mi...@elte.hu

On Tue, 2011-01-04 at 15:57 +0100, Oleg Nesterov wrote:

> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index b6f2475..0672444 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -1317,7 +1317,7 @@ bad_fork_cleanup_mm:
> > }
> > bad_fork_cleanup_signal:
> > if (!(clone_flags & CLONE_THREAD))
> > - free_signal_struct(p->signal);
> > + put_signal_struct(p->signal);
>
> Well, free_signal_struct() was correct. Without CLONE_THREAD
> sig->sigcnt must be equal to 1.

Yeah, it was only about the autogroup reference leak.

> But yes, autogroup puts sched_autogroup_exit() into put_signal_struct(),
> so this patch looks fine.
>
> Although I must admit, to me it would be more clean to simply move
> sched_autogroup_exit() from put_signal_struct() into free_signal_struct()
> instead.

OK, I'll send a move it patchlet.

-Mike

tip-bot for Peter Zijlstra

unread,

Jan 19, 2011, 2:04:06 PM1/19/11

to linux-ti...@vger.kernel.org, linux-...@vger.kernel.org, h...@zytor.com, mi...@redhat.com, a.p.zi...@chello.nl, efa...@gmx.de, pet...@infradead.org, vmi...@frugalware.org, sta...@kernel.org, tg...@linutronix.de, mi...@elte.hu

Commit-ID: 068c5cc5ac7414a8e9eb7856b4bf3cc4d4744267
Gitweb: http://git.kernel.org/tip/068c5cc5ac7414a8e9eb7856b4bf3cc4d4744267
Author: Peter Zijlstra <pet...@infradead.org>
AuthorDate: Wed, 19 Jan 2011 12:26:11 +0100
Committer: Ingo Molnar <mi...@elte.hu>
CommitDate: Wed, 19 Jan 2011 12:51:32 +0100

sched, cgroup: Use exit hook to avoid use-after-free crash

By not notifying the controller of the on-exit move back to
init_css_set, we fail to move the task out of the previous
cgroup's cfs_rq. This leads to an opportunity for a
cgroup-destroy to come in and free the cgroup (there are no
active tasks left in it after all) to which the not-quite dead
task is still enqueued.

Reported-by: Miklos Vajna <vmi...@frugalware.org>
Fixed-by: Mike Galbraith <efa...@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zi...@chello.nl>
Cc: <sta...@kernel.org>
Cc: Mike Galbraith <efa...@gmx.de>
Signed-off-by: Ingo Molnar <mi...@elte.hu>
LKML-Reference: <1293206353.29444.205.camel@laptop>
---
kernel/sched.c | 18 ++++++++++++++++++
1 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 0a169a8..fa5272a 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -606,6 +606,9 @@ static inline struct task_group *task_group(struct task_struct *p)

struct task_group *tg;
struct cgroup_subsys_state *css;

+ if (p->flags & PF_EXITING)
+ return &root_task_group;
+
css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
lockdep_is_held(&task_rq(p)->lock));
tg = container_of(css, struct task_group, css);

@@ -8880,6 +8883,20 @@ cpu_cgroup_attach(struct cgroup_subsys *ss, struct cgroup *cgrp,

}
}

+static void
+cpu_cgroup_exit(struct cgroup_subsys *ss, struct task_struct *task)
+{

+ /*
+ * cgroup_exit() is called in the copy_process() failure path.
+ * Ignore this case since the task hasn't ran yet, this avoids
+ * trying to poke a half freed task state from generic code.

+ */
+ if (!(task->flags & PF_EXITING))
+ return;

+
+ sched_move_task(task);
+}
+
#ifdef CONFIG_FAIR_GROUP_SCHED
static int cpu_shares_write_u64(struct cgroup *cgrp, struct cftype *cftype,
u64 shareval)

@@ -8952,6 +8969,7 @@ struct cgroup_subsys cpu_cgroup_subsys = {

Reply all

Reply to author

Forward

[PATCH] avoid race condition in pick_next_task_fair in kernel/sched_fair.c

shenghui

Peter Zijlstra

shenghui

Peter Zijlstra

shenghui

Miklos Vajna

Miklos Vajna

Peter Zijlstra

Peter Zijlstra

Mike Galbraith

Peter Zijlstra

Miklos Vajna

Peter Zijlstra

Peter Zijlstra

Miklos Vajna

Peter Zijlstra

Peter Zijlstra

Ingo Molnar

Peter Zijlstra

Peter Zijlstra

Miklos Vajna

Miklos Vajna

Yong Zhang

Peter Zijlstra

Peter Zijlstra

Peter Zijlstra

Peter Zijlstra

Miklos Vajna

Mike Galbraith

Peter Zijlstra

Mike Galbraith

Balbir Singh

Paul Menage

Ingo Molnar

Miklos Vajna

Mike Galbraith

Mike Galbraith

Miklos Vajna

Peter Zijlstra

Peter Zijlstra

tip-bot for Mike Galbraith

Oleg Nesterov

Mike Galbraith

tip-bot for Peter Zijlstra