Kernel crash in icq_free_icq

Vivek Goyal

unread,

Jan 17, 2012, 3:18:23 PM1/17/12

to linux kernel mailing list, Tejun Heo, Jens Axboe

Hi Tejun,

With latest linus kernel, I see following crash. I was running some
scripts which create cgroups and launch fio jobs in those groups. In
a separate window I wrote a script to change the IO scheduler on the
device in a loop while IO was happening on the device. After few
seconds I see following. So far I tried it twice and reproduced it
both the times in first few seconds.

Thanks
Vivek

[ 94.217015] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
[ 94.218004] IP: [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200
[ 94.218004] PGD 13abda067 PUD 137d52067 PMD 0
[ 94.218004] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 94.218004] CPU 0
[ 94.218004] Modules linked in: [last unloaded: scsi_wait_scan]
[ 94.218004]
[ 94.218004] Pid: 0, comm: swapper/0 Not tainted 3.2.0+ #16 Hewlett-Packard HP xw6600 Workstation/0A9Ch
[ 94.218004] RIP: 0010:[<ffffffff81142fae>] [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200
[ 94.218004] RSP: 0018:ffff88013fc03de0 EFLAGS: 00010006
[ 94.218004] RAX: ffffffff81e0d020 RBX: ffff880138b3c680 RCX: 00000001801c001b
[ 94.218004] RDX: 00000000003aac1d RSI: ffff880138b3c680 RDI: ffffffff81142fae
[ 94.218004] RBP: ffff88013fc03e10 R08: ffff880137830238 R09: 0000000000000001
[ 94.218004] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 94.218004] R13: ffffea0004e2cf00 R14: ffffffff812f6eb6 R15: 0000000000000246
[ 94.218004] FS: 0000000000000000(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
[ 94.218004] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 94.218004] CR2: 000000000000001c CR3: 00000001395ab000 CR4: 00000000000006f0
[ 94.218004] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 94.218004] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 94.218004] Process swapper/0 (pid: 0, threadinfo ffffffff81e00000, task ffffffff81e0d020)
[ 94.218004] Stack:
[ 94.218004] 0000000000000102 ffff88013fc0db20 ffffffff81e22700 ffff880139500f00
[ 94.218004] 0000000000000001 000000000000000a ffff88013fc03e20 ffffffff812f6eb6
[ 94.218004] ffff88013fc03e90 ffffffff810c8da2 ffffffff81e01fd8 ffff880137830240
[ 94.218004] Call Trace:
[ 94.218004] <IRQ>
[ 94.218004] [<ffffffff812f6eb6>] icq_free_icq_rcu+0x16/0x20
[ 94.218004] [<ffffffff810c8da2>] __rcu_process_callbacks+0x1c2/0x420
[ 94.218004] [<ffffffff810c9038>] rcu_process_callbacks+0x38/0x250
[ 94.218004] [<ffffffff810405ee>] __do_softirq+0xce/0x3e0
[ 94.218004] [<ffffffff8108ed04>] ? clockevents_program_event+0x74/0x100
[ 94.218004] [<ffffffff81090104>] ? tick_program_event+0x24/0x30
[ 94.218004] [<ffffffff8183ed1c>] call_softirq+0x1c/0x30
[ 94.218004] [<ffffffff8100422d>] do_softirq+0x8d/0xc0
[ 94.218004] [<ffffffff81040c3e>] irq_exit+0xae/0xe0
[ 94.218004] [<ffffffff8183f4be>] smp_apic_timer_interrupt+0x6e/0x99
[ 94.218004] [<ffffffff8183e330>] apic_timer_interrupt+0x70/0x80
[ 94.218004] <EOI>
[ 94.218004] [<ffffffff8100a806>] ? mwait_idle+0xb6/0x4c0
[ 94.218004] [<ffffffff8100a7fd>] ? mwait_idle+0xad/0x4c0
[ 94.218004] [<ffffffff810011e6>] cpu_idle+0x96/0xe0
[ 94.218004] [<ffffffff818064df>] rest_init+0x133/0x144
[ 94.218004] [<ffffffff81806425>] ? rest_init+0x79/0x144
[ 94.218004] [<ffffffff81ed4b51>] start_kernel+0x35b/0x366
[ 94.218004] [<ffffffff81ed4321>] x86_64_start_reservations+0x131/0x135
[ 94.218004] [<ffffffff81ed4415>] x86_64_start_kernel+0xf0/0xf7
[ 94.218004] Code: f3 e8 97 cb ee ff 48 c1 e8 0c 48 c1 e0 06 49 01 c5 49 8b 45 00 f6 c4 80 0f 85 99 00 00 00 4c 8b 75 08 9c 41 5f fa e8 12 f8 f4 ff <49> 63 74 24 1c 48 89 df e8 e5 4b f5 ff 41 f7 c7 00 02 00 00 74
[ 94.218004] RIP [<ffffffff81142fae>] kmem_cache_free+0x5e/0x200
[ 94.218004] RSP <ffff88013fc03de0>
[ 94.218004] CR2: 000000000000001c
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Jens Axboe

unread,

Jan 17, 2012, 3:19:28 PM1/17/12

to Vivek Goyal, linux kernel mailing list, Tejun Heo

Can you try this?

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 163263d..ee55019 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -3117,18 +3117,17 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
*/
static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
- struct cfq_queue *old_cfqq = cfqd->active_queue;
-
cfq_log_cfqq(cfqd, cfqq, "preempt");
- cfq_slice_expired(cfqd, 1);

/*
* workload type is changed, don't save slice, otherwise preempt
* doesn't happen
*/
- if (cfqq_type(old_cfqq) != cfqq_type(cfqq))
+ if (cfqq_type(cfqd->active_queue) != cfqq_type(cfqq))
cfqq->cfqg->saved_workload_slice = 0;

+ cfq_slice_expired(cfqd, 1);
+
/*
* Put the new queue at the front of the of the current list,
* so we know that it will be selected next.

--
Jens Axboe

Vivek Goyal

unread,

Jan 17, 2012, 3:40:53 PM1/17/12

to Jens Axboe, linux kernel mailing list, Tejun Heo

Nope, this does not help either. Can reproduce the issue with below
patch applied.

>
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 163263d..ee55019 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -3117,18 +3117,17 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
> */
> static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> {
> - struct cfq_queue *old_cfqq = cfqd->active_queue;
> -
> cfq_log_cfqq(cfqd, cfqq, "preempt");
> - cfq_slice_expired(cfqd, 1);
>
> /*
> * workload type is changed, don't save slice, otherwise preempt
> * doesn't happen
> */
> - if (cfqq_type(old_cfqq) != cfqq_type(cfqq))
> + if (cfqq_type(cfqd->active_queue) != cfqq_type(cfqq))
> cfqq->cfqg->saved_workload_slice = 0;
>
> + cfq_slice_expired(cfqd, 1);
> +

cfq_slice_expired() will overwrite the value of
cfqq->cfqg->saved_workload_slice. So we need to set it to zero after
cfq_slice_expired.

I was thinking of just saving the workload type of cfqq before
cfq_slice_expired() so that we don't access old cfqq after
cfq_slice_expired(). But then I noticed that we don't drop a cfqq
reference in cfq_slice_expired(). So not sure how cfq_slice_expired()
can lead to freeing up of queue. It should happen only when process
has exited and last request on the queue if finished.

Thanks
Vivek

Jens Axboe

unread,

Jan 17, 2012, 3:42:40 PM1/17/12

to Vivek Goyal, linux kernel mailing list, Tejun Heo

Good point, lets just fix that up afterwards, the use-after-free needs
to go asap.

> I was thinking of just saving the workload type of cfqq before
> cfq_slice_expired() so that we don't access old cfqq after
> cfq_slice_expired(). But then I noticed that we don't drop a cfqq
> reference in cfq_slice_expired(). So not sure how cfq_slice_expired()
> can lead to freeing up of queue. It should happen only when process
> has exited and last request on the queue if finished.

It does, it drops a ref to the cic which in turn drops the active async
and sync queues.

Out for tonight, will pick this up in the morning.

--
Jens Axboe

Vivek Goyal

unread,

Jan 17, 2012, 3:58:16 PM1/17/12

to Jens Axboe, linux kernel mailing list, Tejun Heo, Chris Mason

Ok, I see it now. Thanks.

I modified your patch a bit. It does not seem to solve my problem but
might help with Chris Mason's boot issue.

Chris, can you please give it a try.

Thanks
Vivek

---
block/cfq-iosched.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6/block/cfq-iosched.c
===================================================================
--- linux-2.6.orig/block/cfq-iosched.c 2012-01-18 02:49:33.000000000 -0500
+++ linux-2.6/block/cfq-iosched.c 2012-01-18 02:50:31.000000000 -0500
@@ -3117,7 +3117,7 @@ cfq_should_preempt(struct cfq_data *cfqd

*/
static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
{
- struct cfq_queue *old_cfqq = cfqd->active_queue;

+ enum wl_type_t old_cfqq_wl_type = cfqq_type(cfqd->active_queue);

cfq_log_cfqq(cfqd, cfqq, "preempt");
cfq_slice_expired(cfqd, 1);
@@ -3126,7 +3126,7 @@ static void cfq_preempt_queue(struct cfq

* workload type is changed, don't save slice, otherwise preempt
* doesn't happen
*/
- if (cfqq_type(old_cfqq) != cfqq_type(cfqq))

+ if (old_cfqq_wl_type != cfqq_type(cfqq))

cfqq->cfqg->saved_workload_slice = 0;

/*

Vivek Goyal

unread,

Jan 17, 2012, 4:01:14 PM1/17/12

to Jens Axboe, linux kernel mailing list, Tejun Heo, Chris Mason

On Tue, Jan 17, 2012 at 03:58:16PM -0500, Vivek Goyal wrote:

[..]

Oops, old mail id of chris. Fixing it now.

Thanks
Vivek

Tejun Heo

unread,

Jan 17, 2012, 4:48:34 PM1/17/12

to Vivek Goyal, linux kernel mailing list, Jens Axboe

On Tue, Jan 17, 2012 at 03:18:23PM -0500, Vivek Goyal wrote:
> Hi Tejun,
>
> With latest linus kernel, I see following crash. I was running some
> scripts which create cgroups and launch fio jobs in those groups. In
> a separate window I wrote a script to change the IO scheduler on the
> device in a loop while IO was happening on the device. After few
> seconds I see following. So far I tried it twice and reproduced it
> both the times in first few seconds.

Does the following patch make any difference?

diff --git a/block/elevator.c b/block/elevator.c
index 91e18f8..b5a94ed 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -105,6 +105,7 @@ int elv_rq_merge_ok(struct request *rq, struct bio *bio)
if (bio_integrity(bio) != blk_integrity_rq(rq))
return 0;

+ return 0;
if (!elv_iosched_allow_merge(rq, bio))
return 0;

Vivek Goyal

unread,

Jan 17, 2012, 5:07:15 PM1/17/12

to Tejun Heo, linux kernel mailing list, Jens Axboe

On Tue, Jan 17, 2012 at 01:48:34PM -0800, Tejun Heo wrote:
> On Tue, Jan 17, 2012 at 03:18:23PM -0500, Vivek Goyal wrote:
> > Hi Tejun,
> >
> > With latest linus kernel, I see following crash. I was running some
> > scripts which create cgroups and launch fio jobs in those groups. In
> > a separate window I wrote a script to change the IO scheduler on the
> > device in a loop while IO was happening on the device. After few
> > seconds I see following. So far I tried it twice and reproduced it
> > both the times in first few seconds.
>
> Does the following patch make any difference?

Nope. Still can reproduce the issue.

Thanks
Vivek

Shaohua Li

unread,

Jan 17, 2012, 8:01:00 PM1/17/12

to Vivek Goyal, Tejun Heo, linux kernel mailing list, Jens Axboe

2012/1/18 Vivek Goyal <vgo...@redhat.com>:

> On Tue, Jan 17, 2012 at 01:48:34PM -0800, Tejun Heo wrote:
>> On Tue, Jan 17, 2012 at 03:18:23PM -0500, Vivek Goyal wrote:
>> > Hi Tejun,
>> >
>> > With latest linus kernel, I see following crash. I was running some
>> > scripts which create cgroups and launch fio jobs in those groups. In
>> > a separate window I wrote a script to change the IO scheduler on the
>> > device in a loop while IO was happening on the device. After few
>> > seconds I see following. So far I tried it twice and reproduced it
>> > both the times in first few seconds.
>>
>> Does the following patch make any difference?
>
> Nope. Still can reproduce the issue.

I had similar issue too. Looks ioc_create_icq finds an icq (in ioc_lookup_icq)
the icq is in rcu free (it has __rcu_icq_cache set)

Tejun Heo

unread,

Jan 17, 2012, 8:03:23 PM1/17/12

to Shaohua Li, Vivek Goyal, linux kernel mailing list, Jens Axboe

Hello,

On Wed, Jan 18, 2012 at 09:01:00AM +0800, Shaohua Li wrote:
> 2012/1/18 Vivek Goyal <vgo...@redhat.com>:

> >> Does the following patch make any difference?
> >
> > Nope. Still can reproduce the issue.
>
> I had similar issue too. Looks ioc_create_icq finds an icq (in ioc_lookup_icq)
> the icq is in rcu free (it has __rcu_icq_cache set)

Vivek is seeing the problem while switching elevators. Are you too?
Or is it during normal operation?

Thanks.

--
tejun

Shaohua Li

unread,

Jan 17, 2012, 8:05:26 PM1/17/12

to Tejun Heo, Vivek Goyal, linux kernel mailing list, Jens Axboe

2012/1/18 Tejun Heo <t...@kernel.org>:

> Hello,
>
> On Wed, Jan 18, 2012 at 09:01:00AM +0800, Shaohua Li wrote:
>> 2012/1/18 Vivek Goyal <vgo...@redhat.com>:
>> >> Does the following patch make any difference?
>> >
>> > Nope. Still can reproduce the issue.
>>
>> I had similar issue too. Looks ioc_create_icq finds an icq (in ioc_lookup_icq)
>> the icq is in rcu free (it has __rcu_icq_cache set)
>
> Vivek is seeing the problem while switching elevators. �Are you too?
> Or is it during normal operation?

same here. I had some problems when I debug my ioscheduler, but
eventually found even switching cfq and noop can trigger oops.

Tejun Heo

unread,

Jan 17, 2012, 8:11:12 PM1/17/12

to Shaohua Li, Vivek Goyal, linux kernel mailing list, Jens Axboe

On Wed, Jan 18, 2012 at 09:05:26AM +0800, Shaohua Li wrote:
> > Vivek is seeing the problem while switching elevators. �Are you too?
> > Or is it during normal operation?
> same here. I had some problems when I debug my ioscheduler, but
> eventually found even switching cfq and noop can trigger oops.

Hmmm... maybe quiescing isn't working as expected and kmem cache is
being destroyed with live icq's. I'll try to reproduce it.

Thanks.

--
tejun

Shaohua Li

unread,

Jan 17, 2012, 8:30:53 PM1/17/12

to Tejun Heo, Vivek Goyal, linux kernel mailing list, Jens Axboe

On Tue, 2012-01-17 at 17:11 -0800, Tejun Heo wrote:
> On Wed, Jan 18, 2012 at 09:05:26AM +0800, Shaohua Li wrote:
> > > Vivek is seeing the problem while switching elevators. Are you too?
> > > Or is it during normal operation?
> > same here. I had some problems when I debug my ioscheduler, but
> > eventually found even switching cfq and noop can trigger oops.
>
> Hmmm... maybe quiescing isn't working as expected and kmem cache is
> being destroyed with live icq's. I'll try to reproduce it.

this debug patch seems to fix for me.

diff --git a/block/blk-core.c b/block/blk-core.c
index e6c05a9..c6a8ef5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -872,11 +872,11 @@ retry:
spin_unlock_irq(q->queue_lock);

/* create icq if missing */
- if (unlikely(et->icq_cache && !icq))
+ if (unlikely(et->icq_cache && !icq && (rw_flags & REQ_ELVPRIV)))
icq = ioc_create_icq(q, gfp_mask);

/* rqs are guaranteed to have icq on elv_set_request() if requested */
- if (likely(!et->icq_cache || icq))
+ if (likely(!et->icq_cache || icq || !(rw_flags & REQ_ELVPRIV)))
rq = blk_alloc_request(q, icq, rw_flags, gfp_mask);

if (unlikely(!rq)) {

Shaohua Li

unread,

Jan 17, 2012, 9:26:38 PM1/17/12

to Tejun Heo, Vivek Goyal, linux kernel mailing list, Jens Axboe

2012/1/18 Shaohua Li <shaoh...@intel.com>:

> On Tue, 2012-01-17 at 17:11 -0800, Tejun Heo wrote:
>> On Wed, Jan 18, 2012 at 09:05:26AM +0800, Shaohua Li wrote:
>> > > Vivek is seeing the problem while switching elevators. �Are you too?
>> > > Or is it during normal operation?
>> > same here. I had some problems when I debug my ioscheduler, but
>> > eventually found even switching cfq and noop can trigger oops.
>>
>> Hmmm... maybe quiescing isn't working as expected and kmem cache is
>> being destroyed with live icq's. �I'll try to reproduce it.
> this debug patch seems to fix for me.
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index e6c05a9..c6a8ef5 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -872,11 +872,11 @@ retry:
> � � � �spin_unlock_irq(q->queue_lock);
>
> � � � �/* create icq if missing */
> - � � � if (unlikely(et->icq_cache && !icq))
> + � � � if (unlikely(et->icq_cache && !icq && (rw_flags & REQ_ELVPRIV)))
> � � � � � � � �icq = ioc_create_icq(q, gfp_mask);
>
> � � � �/* rqs are guaranteed to have icq on elv_set_request() if requested */
> - � � � if (likely(!et->icq_cache || icq))
> + � � � if (likely(!et->icq_cache || icq || !(rw_flags & REQ_ELVPRIV)))
> � � � � � � � �rq = blk_alloc_request(q, icq, rw_flags, gfp_mask);
>
> � � � �if (unlikely(!rq)) {

this passed my test, but I didn't get reason why it can help ...
blk_alloc_request doesn't use icq if REQ_ELVPRIV isn't set, so the code
has problem for sure.

Shaohua Li

unread,

Jan 17, 2012, 11:23:24 PM1/17/12

to Tejun Heo, Vivek Goyal, linux kernel mailing list, Jens Axboe

2012/1/18 Shaohua Li <shaoh...@intel.com>:

Ok, I got the reason here:
CPU 0 CPU1
get_request
ioc_create_icq
allocate icq with cfq ioscheduler

switch to noop
insert icq to ioc

switch to cfq

ioc_clear_queue
in ioc_clear_queue, ioc has icq in its list, but current elevator is noop,
so ioc_exit_icq will get a NULL et->icq_cache

I hit a kmem_cache_alloc_node oops too. because the et in ioc_create_icq
might not be the et of get_request (we drop lock before calling ioc_create_icq)

The put_io_context has a workqueue to run ioc_exit_icq, I suspect
there is problem too.

Reply all

Reply to author

Forward

Kernel crash in icq_free_icq_rcu

Vivek Goyal

Jens Axboe

Vivek Goyal

Jens Axboe

Vivek Goyal

Vivek Goyal

Tejun Heo

Vivek Goyal

Shaohua Li

Tejun Heo

Shaohua Li

Tejun Heo

Shaohua Li

Shaohua Li

Shaohua Li