[Iscsitarget-devel] mod_timer kernel crash

Steffen Plotner

unread,

Feb 12, 2012, 10:59:05 PM2/12/12

to iscsitar...@lists.sourceforge.net

Hello,

I have encountered a mod_timer kernel crash under the following condition.

Using ietadm, I modify a target's NOP-in as in (the intial configuration was set to NOPInterval=0 and NOPTimeout=0):

ietadm --op update --tid=1 --params=NOPInterval=6

ietadm --op update --tid=1 --params=NOPTimeout=8

2 hardware platforms did not crash and where fine (they are lightly loaded), 1 hardware platform crashed immediatelly after setting these parameters on a target and I finally got one crash in a vmware virtual machine running iscsi-target for testing purposes.

It seems that the crash is related to the number of targets and possibly the number of active connections. On the hardware we have 52 targets with 372 connections and on the virtual side we have 43 targets with 164 connections.

I was able to produce the crash on the hardware every time I made the config change (thank goodness for HA and the targets failed over to node2 without this change in place "yet" - thankfully).

After configuring netconsole in the virtual machine, I was able to get a complete kernel bug trace as follows (the bug trace ends like this also on the hardware (I only have a screenshot of that and nothing in syslog)).

It appears to be much "harder" to produce this problem in the virtual machine - basically I let the thing sit over night being configured with NOPInterval=0 and NOPTimeout=0 and 24 hours later I configure it to 6 and 8 and boom. Doing it immediatelly does not seem to produce this problem at all in the virtual machine.

There are 3 places in which mod_timer is invoked:

iscsi.c:        mod_timer(&cmnd->req->timer, jiffies + HZ * t);
nthread.c:              mod_timer(&conn->nop_timer,
nthread.c:      mod_timer(&conn->nop_timer,

Thoughts:

My sense is that the &conn->no_timer somehow is null when mod_timer is invoked, is that because the connection has disappeared?

The targets involved are configured at the initiator (vsphere4) with multipathing. I noticed, for example, when changing the NOP settings to 0 that connections get thrown out, of course the initiator reconnects and all is ok.

Any ideas?

----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at kernel/timer.c:488
invalid opcode: 0000 [1] SMP
last sysfs file: /block/ram0/dev
CPU 0
Modules linked in: netconsole iscsi_trgt(U) autofs4 ip_conntrack_netbios_ns ipt_REJECT xt_tcpudp xt_state ip_conntrack nfnetlink iptable_filter ip_tables x_tables vsock(U) vmci(U) vmmemctl(U) acpiphp dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac lp floppy i2c_piix4 pcspkr sg i2c_core serio_raw e1000 parport_pc shpchp parport tpm_tis ide_cd tpm tpm_bios cdrom dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 5506, comm: istd30 Tainted: G      2.6.18-238.19.1.el5 #1
RIP: 0010:[<ffffffff80048658>] [<ffffffff80048658>] mod_timer+0x7/0x28
RSP: 0018:ffff81001309dde8 EFLAGS: 00010246
RAX: 0000000104d90a62 RBX: ffff8100138724c8 RCX: 0000003000000001
RDX: 0000000000000001 RSI: 0000000104d90e4a RDI: ffff810013386088
RBP: ffff810013386000 R08: ffff81001309c000 R09: 00000000000001b6
R10: ffff810013b816c0 R11: ffff81001309dddc R12: 00000000fffffff5
R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffffffff80426000(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000002f94890 CR3: 0000000004256000 CR4: 00000000000006e0
Process istd30 (pid: 5506, threadinfo ffff81001309c000, task ffff8100130487a0)
Stack: ffffffff8847b173 ffff810013669000 ffff8100136691a0 ffff810004138000
000000001c8b5840 ffff810013669000 0000000400000001 0000100000000000
ffff8100040d63f8 ffffffff802582f8 ffff810012a60d00 ffff810001364680
Call Trace:
[<ffffffff8847b173>] :iscsi_trgt:istd+0x5a1/0x1056
[<ffffffff802582f8>] tcp_sendpage+0x0/0x64c
[<ffffffff8847abd2>] :iscsi_trgt:istd+0x0/0x1056
[<ffffffff800a26e3>] keventd_create_kthread+0x0/0xc4
[<ffffffff80032b26>] kthread+0xfe/0x132
[<ffffffff8005dfb1>] child_rip+0xa/0x11
[<ffffffff800a26e3>] keventd_create_kthread+0x0/0xc4
[<ffffffff80032a28>] kthread+0x0/0x132
[<ffffffff8005dfa7>] child_rip+0x0/0x11

Code: 0f 0b 68 34 5c 2b 80 c2 e8 01 48 39 77 10 75 06 48 83 3f 00
RIP [<ffffffff80048658>] mod_timer+0x7/0x28
RSP <ffff81001309dde8>
<0>Kernel panic - not syncing: Fatal exception

Steffen

_______________________________________________________________________________________________

Steffen Plotner Amherst College Tel (413) 542-2348

Systems/Network Administrator/Programmer PO BOX 5000 Fax (413) 542-2626

Systems & Networking Amherst, MA 01002-5000 swpl...@amherst.edu

Arne Redlich

unread,

Feb 13, 2012, 10:52:46 AM2/13/12

to Steffen Plotner, iscsitar...@lists.sourceforge.net

2012/2/13 Steffen Plotner <swpl...@amherst.edu>:

Steffen,

This doesn't look like a null-ptr dereference to me, but rather like a
violated assertion (a BUG() or BUG_ON()). So the first step would be
to look at the source code (kernel/timer.c:488). The upstream git tree
or an LXR site doesn't get me any further in that case as the redhat
kernel you're running is heavily patched compared to the vanilla
kernel. Could you have a look at the patched kernel source and post
the code of the function that contains the aforementioned line?

Thanks,
Arne

> ------------------------------------------------------------------------------
> Try before you buy = See our experts in action!
> The most comprehensive online learning library for Microsoft developers
> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
> Metro Style Apps, more. Free future releases when you subscribe now!
> http://p.sf.net/sfu/learndevnow-dev2
> _______________________________________________
> Iscsitarget-devel mailing list
> Iscsitar...@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/iscsitarget-devel
>

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Iscsitarget-devel mailing list
Iscsitar...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/iscsitarget-devel

Ross S. W. Walker

unread,

Feb 13, 2012, 11:08:08 AM2/13/12

to Arne Redlich, Steffen Plotner, iscsitar...@lists.sourceforge.net

Arne Redlich [mailto:arne.r...@googlemail.com] wrote:
> 2012/2/13 Steffen Plotner <swpl...@amherst.edu>:

> >
> > ----------- [cut here ] --------- [please bite here ] ---------
> > Kernel BUG at kernel/timer.c:488
> > invalid opcode: 0000 [1] SMP
>
> Steffen,
>
> This doesn't look like a null-ptr dereference to me, but rather like a
> violated assertion (a BUG() or BUG_ON()). So the first step would be
> to look at the source code (kernel/timer.c:488). The upstream git tree
> or an LXR site doesn't get me any further in that case as the redhat
> kernel you're running is heavily patched compared to the vanilla
> kernel. Could you have a look at the patched kernel source and post
> the code of the function that contains the aforementioned line?

I googled Redhat LXR and it brought me to a site with RH kernels.

It didn't have 2.6.18-238 yet but it has 2.6.18-194.

I couldn't find a BUG on 488, maybe new code was inserted between 194-238?

I did find this though:

399 * mod_timer - modify a timer's timeout
400 * @timer: the timer to be modified
401 *
402 * mod_timer is a more efficient way to update the expire field of an
403 * active timer (if the timer is inactive it will be activated)
404 *
405 * mod_timer(timer, expires) is equivalent to:
406 *
407 * del_timer(timer); timer->expires = expires; add_timer(timer);
408 *
409 * Note that if there are multiple unserialized concurrent users of the
410 * same timer, then mod_timer() is the only safe way to modify the timeout,
411 * since add_timer() cannot modify an already running timer.
412 *
413 * The function returns whether it has modified a pending timer or not.
414 * (ie. mod_timer() of an inactive timer returns 0, mod_timer() of an
415 * active timer returns 1.)
416 */
417int mod_timer(struct timer_list *timer, unsigned long expires)
418{
419 BUG_ON(!timer->function);
420
421 /*
422 * This is a common optimization triggered by the
423 * networking code - if the timer is re-modified
424 * to be the same thing then just return:
425 */
426 if (timer->expires == expires && timer_pending(timer))
427 return 1;
428
429 return __mod_timer(timer, expires);
430}

It looks like maybe the connection and it's timer were cleared
without making sure all active timers were purged first?

-Ross
______________________________________________________________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.

Steffen Plotner

unread,

Feb 13, 2012, 11:22:54 AM2/13/12

to Arne Redlich, iscsitar...@lists.sourceforge.net

Arne,

Thank you for responding, I have retrieved the source code from the patched centos kernel:

> >
> > ----------- [cut here ] --------- [please bite here ] ---------
> > Kernel BUG at kernel/timer.c:488
> > invalid opcode: 0000 [1] SMP
>
> Steffen,
>
> This doesn't look like a null-ptr dereference to me, but rather like a
> violated assertion (a BUG() or BUG_ON()). So the first step would be
> to look at the source code (kernel/timer.c:488). The upstream git tree
> or an LXR site doesn't get me any further in that case as the redhat
> kernel you're running is heavily patched compared to the vanilla
> kernel. Could you have a look at the patched kernel source and post
> the code of the function that contains the aforementioned line?
>

The line number 488 is BUG_ON(!timer->function)

int mod_timer(struct timer_list *timer, unsigned long expires)

{
BUG_ON(!timer->function);

/*

* This is a common optimization triggered by the

* networking code - if the timer is re-modified

* to be the same thing then just return:

*/

if (timer->expires == expires && timer_pending(timer))

return 1;

return __mod_timer(timer, expires);
}

EXPORT_SYMBOL(mod_timer);

------------------------------------------------------------------------------

Ross S. W. Walker

unread,

Feb 13, 2012, 11:48:06 AM2/13/12

to Steffen Plotner, Arne Redlich, iscsitar...@lists.sourceforge.net

Steffen Plotner [mailto:swpl...@amherst.edu] wrote:
>
> Arne,
>
> Thank you for responding, I have retrieved the source code
> from the patched centos kernel:
>

I think I know what is happenning.

The connection is being torn down in nthread at the exact same time
wthread is in the middle of processing a NOP packet and trying to
reset a timer that no longer exists.

I think a test of the timer's continued existence in nop_in_tx_end
before executing mod_timer would fix this.

-Ross
______________________________________________________________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.

Arne Redlich

unread,

Feb 14, 2012, 5:02:32 PM2/14/12

to Ross S. W. Walker, iscsitar...@lists.sourceforge.net

2012/2/13 Ross S. W. Walker <RWa...@medallion.com>:

> Steffen Plotner [mailto:swpl...@amherst.edu] wrote:
>>
>> Arne,
>>
>> Thank you for responding, I have retrieved the source code
>> from the patched centos kernel:
>>
>
> I think I know what is happenning.
>
> The connection is being torn down in nthread at the exact same time
> wthread is in the middle of processing a NOP packet and trying to
> reset a timer that no longer exists.

I don't see how this can happen - AFAICT all timer operations are done
within the nthread, or am I missing something?

Cheers,
Arne

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!

The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!

http://p.sf.net/sfu/learndevnow-d2d

Ross S. W. Walker

unread,

Feb 14, 2012, 5:32:04 PM2/14/12

to Arne Redlich, iscsitar...@lists.sourceforge.net

Arne Redlich [mailto:arne.r...@googlemail.com] wrote:

> 2012/2/13 Ross S. W. Walker <RWa...@medallion.com>:
> > Steffen Plotner [mailto:swpl...@amherst.edu] wrote:
> >>
> >> Arne,
> >>
> >> Thank you for responding, I have retrieved the source code
> >> from the patched centos kernel:
> >>
> >
> > I think I know what is happenning.
> >
> > The connection is being torn down in nthread at the exact same time
> > wthread is in the middle of processing a NOP packet and trying to
> > reset a timer that no longer exists.
>
> I don't see how this can happen - AFAICT all timer operations are done
> within the nthread, or am I missing something?

You may be right, I didn't trace out the code path just a cursory
glance at the code.

Your right, in fact nop_in_tx_end is called from cmnd_tx_end which
is called in nthread.

Well the BUG is hit cause the timer is no longer valid when
mod_timer() is called which is called in nop_in_tx_end(),
conn_reset_nop_timer() and conn_start_nop_timer(), all of
them are run in the nthread context.

-Ross

______________________________________________________________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.

Arne Redlich

unread,

Feb 14, 2012, 5:39:11 PM2/14/12

to Ross S. W. Walker, iscsitar...@lists.sourceforge.net

2012/2/14 Ross S. W. Walker <RWa...@medallion.com>:

> Arne Redlich [mailto:arne.r...@googlemail.com] wrote:
>> 2012/2/13 Ross S. W. Walker <RWa...@medallion.com>:
>> > Steffen Plotner [mailto:swpl...@amherst.edu] wrote:
>> >>
>> >> Arne,
>> >>
>> >> Thank you for responding, I have retrieved the source code
>> >> from the patched centos kernel:
>> >>
>> >
>> > I think I know what is happenning.
>> >
>> > The connection is being torn down in nthread at the exact same time
>> > wthread is in the middle of processing a NOP packet and trying to
>> > reset a timer that no longer exists.
>>
>> I don't see how this can happen - AFAICT all timer operations are done
>> within the nthread, or am I missing something?
>
> You may be right, I didn't trace out the code path just a cursory
> glance at the code.
>
> Your right, in fact nop_in_tx_end is called from cmnd_tx_end which
> is called in nthread.
>
> Well the BUG is hit cause the timer is no longer valid when
> mod_timer() is called which is called in nop_in_tx_end(),
> conn_reset_nop_timer() and conn_start_nop_timer(), all of
> them are run in the nthread context.

Steffen,

is there anything else related to IET's NOP timer in the netconsole output?

Arne

Steffen Plotner

unread,

Feb 14, 2012, 10:31:29 PM2/14/12

to Arne Redlich, Ross S. W. Walker, iscsitar...@lists.sourceforge.net

Hi,

> is there anything else related to IET's NOP timer in the netconsole
> output?
>

I have prepared the netconsole output using debug flags:
options iscsi_trgt debug_enable_flags=4

and I have a text dump of this here: http://www3.amherst.edu/~swplotner/debug/trace_mod_timer_03.txt

I was hoping to find a case where the timer argument of nop_in_tx_end turns null - I don't see that.

Steffen
------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/

Steffen Plotner

unread,

Feb 21, 2012, 10:07:48 PM2/21/12

to Ross S. W. Walker, Arne Redlich, iscsitar...@lists.sourceforge.net

Hi,
_______________________________________
To: Ross S. W. Walker; Arne Redlich
Cc: iscsitar...@lists.sourceforge.net
Subject: RE: [Iscsitarget-devel] mod_timer kernel crash

> -----Original Message-----
> From: Ross S. W. Walker [mailto:RWa...@medallion.com]
> Sent: Monday, February 13, 2012 11:48 AM
> To: Steffen Plotner; Arne Redlich
> Cc: iscsitar...@lists.sourceforge.net
> Subject: RE: [Iscsitarget-devel] mod_timer kernel crash
> Sensitivity: Personal
>
> Steffen Plotner [mailto:swpl...@amherst.edu] wrote:
> >
> > Arne,
> >
> > Thank you for responding, I have retrieved the source code
> > from the patched centos kernel:
> >
>
> I think I know what is happenning.
>
> The connection is being torn down in nthread at the exact same time
> wthread is in the middle of processing a NOP packet and trying to
> reset a timer that no longer exists.
>
> I think a test of the timer's continued existence in nop_in_tx_end
> before executing mod_timer would fix this.
>
> -Ross
> ______________________________________________________________________

This sounds promissing - I am having a difficult time tracking it down. I placed the following statements in-front of mod_timer() calls in the nthread.c and iscsi.c modules:

dprintk(D_BUGTIMER, "BUGTIMER> &conn->nop_timer.function=%p\n", &conn->nop_timer.function);

The mod_timer() function invokes the following:

BUG_ON(!timer->function);

meaning that is not the conn->nop_timer that is null, but rather the element "function" is null. So, somehow the connection's nop_timer.function changes to null. The dprintk output did not reveal any null addresses.

I assume the problem happens between dprintk and mod_timer function, hence validating the nop_timer.function value won't work. Something else is influencing it.

For reference, this is what is looks like under your referenced nop_in_tx_end function:

static void nop_in_tx_end(struct iscsi_cmnd *cmnd)
{
struct iscsi_conn *conn = cmnd->conn;
u32 t;

if (cmnd->pdu.bhs.ttt == cpu_to_be32(ISCSI_RESERVED_TAG))
return;

/*
* NOP-In ping issued by the target.
* FIXME: Sanitize the NOP timeout earlier, during configuration
*/
t = conn->session->target->trgt_param.nop_timeout;

if (!t || t > conn->session->target->trgt_param.nop_interval) {
eprintk("Adjusting NOPTimeout of tid %u from %u to %u "
"(== NOPInterval)\n", conn->session->target->tid,
t,
conn->session->target->trgt_param.nop_interval);
t = conn->session->target->trgt_param.nop_interval;
conn->session->target->trgt_param.nop_timeout = t;
}

dprintk(D_GENERIC, "NOP-In %p, %x: timer %p\n", cmnd, cmnd_ttt(cmnd),
&cmnd->req->timer);

set_cmnd_timer_active(cmnd->req);

dprintk(D_BUGTIMER, "BUGTIMER> &cmnd->req->timer.function=%p\n", &cmnd->req->timer.function);

mod_timer(&cmnd->req->timer, jiffies + HZ * t);
}

Steffen

------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/

Arne Redlich

unread,

Feb 22, 2012, 2:28:11 PM2/22/12

to Steffen Plotner, iscsitar...@lists.sourceforge.net, Ross S. W. Walker

2012/2/22 Steffen Plotner <swpl...@amherst.edu>:

Yes, that's also where I am stuck. I've been looking at the log trace
you provided and also tried to reproduce it here. So far no joy.

Arne

Reply all

Reply to author

Forward