Steffen,
This doesn't look like a null-ptr dereference to me, but rather like a
violated assertion (a BUG() or BUG_ON()). So the first step would be
to look at the source code (kernel/timer.c:488). The upstream git tree
or an LXR site doesn't get me any further in that case as the redhat
kernel you're running is heavily patched compared to the vanilla
kernel. Could you have a look at the patched kernel source and post
the code of the function that contains the aforementioned line?
Thanks,
Arne
> ------------------------------------------------------------------------------
> Try before you buy = See our experts in action!
> The most comprehensive online learning library for Microsoft developers
> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
> Metro Style Apps, more. Free future releases when you subscribe now!
> http://p.sf.net/sfu/learndevnow-dev2
> _______________________________________________
> Iscsitarget-devel mailing list
> Iscsitar...@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/iscsitarget-devel
>
------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Iscsitarget-devel mailing list
Iscsitar...@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/iscsitarget-devel
I googled Redhat LXR and it brought me to a site with RH kernels.
It didn't have 2.6.18-238 yet but it has 2.6.18-194.
I couldn't find a BUG on 488, maybe new code was inserted between 194-238?
I did find this though:
399 * mod_timer - modify a timer's timeout
400 * @timer: the timer to be modified
401 *
402 * mod_timer is a more efficient way to update the expire field of an
403 * active timer (if the timer is inactive it will be activated)
404 *
405 * mod_timer(timer, expires) is equivalent to:
406 *
407 * del_timer(timer); timer->expires = expires; add_timer(timer);
408 *
409 * Note that if there are multiple unserialized concurrent users of the
410 * same timer, then mod_timer() is the only safe way to modify the timeout,
411 * since add_timer() cannot modify an already running timer.
412 *
413 * The function returns whether it has modified a pending timer or not.
414 * (ie. mod_timer() of an inactive timer returns 0, mod_timer() of an
415 * active timer returns 1.)
416 */
417int mod_timer(struct timer_list *timer, unsigned long expires)
418{
419 BUG_ON(!timer->function);
420
421 /*
422 * This is a common optimization triggered by the
423 * networking code - if the timer is re-modified
424 * to be the same thing then just return:
425 */
426 if (timer->expires == expires && timer_pending(timer))
427 return 1;
428
429 return __mod_timer(timer, expires);
430}
It looks like maybe the connection and it's timer were cleared
without making sure all active timers were purged first?
-Ross
______________________________________________________________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.
Thank you for responding, I have retrieved the source code from the patched centos kernel:
> >
> > ----------- [cut here ] --------- [please bite here ] ---------
> > Kernel BUG at kernel/timer.c:488
> > invalid opcode: 0000 [1] SMP
>
> Steffen,
>
> This doesn't look like a null-ptr dereference to me, but rather like a
> violated assertion (a BUG() or BUG_ON()). So the first step would be
> to look at the source code (kernel/timer.c:488). The upstream git tree
> or an LXR site doesn't get me any further in that case as the redhat
> kernel you're running is heavily patched compared to the vanilla
> kernel. Could you have a look at the patched kernel source and post
> the code of the function that contains the aforementioned line?
>
The line number 488 is BUG_ON(!timer->function)
int mod_timer(struct timer_list *timer, unsigned long expires)
{
BUG_ON(!timer->function);
/*
* This is a common optimization triggered by the
* networking code - if the timer is re-modified
* to be the same thing then just return:
*/
if (timer->expires == expires && timer_pending(timer))
return 1;
return __mod_timer(timer, expires);
}
EXPORT_SYMBOL(mod_timer);
------------------------------------------------------------------------------
I think I know what is happenning.
The connection is being torn down in nthread at the exact same time
wthread is in the middle of processing a NOP packet and trying to
reset a timer that no longer exists.
I think a test of the timer's continued existence in nop_in_tx_end
before executing mod_timer would fix this.
-Ross
______________________________________________________________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.
I don't see how this can happen - AFAICT all timer operations are done
within the nthread, or am I missing something?
Cheers,
Arne
------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
You may be right, I didn't trace out the code path just a cursory
glance at the code.
Your right, in fact nop_in_tx_end is called from cmnd_tx_end which
is called in nthread.
Well the BUG is hit cause the timer is no longer valid when
mod_timer() is called which is called in nop_in_tx_end(),
conn_reset_nop_timer() and conn_start_nop_timer(), all of
them are run in the nthread context.
-Ross
______________________________________________________________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.
Steffen,
is there anything else related to IET's NOP timer in the netconsole output?
Arne
> is there anything else related to IET's NOP timer in the netconsole
> output?
>
I have prepared the netconsole output using debug flags:
options iscsi_trgt debug_enable_flags=4
and I have a text dump of this here: http://www3.amherst.edu/~swplotner/debug/trace_mod_timer_03.txt
I was hoping to find a case where the timer argument of nop_in_tx_end turns null - I don't see that.
Steffen
------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
> -----Original Message-----
> From: Ross S. W. Walker [mailto:RWa...@medallion.com]
> Sent: Monday, February 13, 2012 11:48 AM
> To: Steffen Plotner; Arne Redlich
> Cc: iscsitar...@lists.sourceforge.net
> Subject: RE: [Iscsitarget-devel] mod_timer kernel crash
> Sensitivity: Personal
>
> Steffen Plotner [mailto:swpl...@amherst.edu] wrote:
> >
> > Arne,
> >
> > Thank you for responding, I have retrieved the source code
> > from the patched centos kernel:
> >
>
> I think I know what is happenning.
>
> The connection is being torn down in nthread at the exact same time
> wthread is in the middle of processing a NOP packet and trying to
> reset a timer that no longer exists.
>
> I think a test of the timer's continued existence in nop_in_tx_end
> before executing mod_timer would fix this.
>
> -Ross
> ______________________________________________________________________
This sounds promissing - I am having a difficult time tracking it down. I placed the following statements in-front of mod_timer() calls in the nthread.c and iscsi.c modules:
dprintk(D_BUGTIMER, "BUGTIMER> &conn->nop_timer.function=%p\n", &conn->nop_timer.function);
The mod_timer() function invokes the following:
BUG_ON(!timer->function);
meaning that is not the conn->nop_timer that is null, but rather the element "function" is null. So, somehow the connection's nop_timer.function changes to null. The dprintk output did not reveal any null addresses.
I assume the problem happens between dprintk and mod_timer function, hence validating the nop_timer.function value won't work. Something else is influencing it.
For reference, this is what is looks like under your referenced nop_in_tx_end function:
static void nop_in_tx_end(struct iscsi_cmnd *cmnd)
{
struct iscsi_conn *conn = cmnd->conn;
u32 t;
if (cmnd->pdu.bhs.ttt == cpu_to_be32(ISCSI_RESERVED_TAG))
return;
/*
* NOP-In ping issued by the target.
* FIXME: Sanitize the NOP timeout earlier, during configuration
*/
t = conn->session->target->trgt_param.nop_timeout;
if (!t || t > conn->session->target->trgt_param.nop_interval) {
eprintk("Adjusting NOPTimeout of tid %u from %u to %u "
"(== NOPInterval)\n", conn->session->target->tid,
t,
conn->session->target->trgt_param.nop_interval);
t = conn->session->target->trgt_param.nop_interval;
conn->session->target->trgt_param.nop_timeout = t;
}
dprintk(D_GENERIC, "NOP-In %p, %x: timer %p\n", cmnd, cmnd_ttt(cmnd),
&cmnd->req->timer);
set_cmnd_timer_active(cmnd->req);
dprintk(D_BUGTIMER, "BUGTIMER> &cmnd->req->timer.function=%p\n", &cmnd->req->timer.function);
mod_timer(&cmnd->req->timer, jiffies + HZ * t);
}
Steffen
------------------------------------------------------------------------------
Virtualization & Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
Yes, that's also where I am stuck. I've been looking at the log trace
you provided and also tried to reproduce it here. So far no joy.
Arne