[PATCH] decrease sndtmo

Mike Christie

unread,

Jul 29, 2009, 10:54:55 PM7/29/09

to open-iscsi

tcp_sendpages/tcp_sendmsg can wait sndtmo seconds
if a connection goes bad. This then delays session
recovery, because that code must wait for the xmit
thread to flush. OTOH, if we did not wait at all
we are less efficient in the lock management
because we must reacquire the session lock every
time the network layer returns ENOBUFS and runs
iscsi_tcp's write_space callout.

This tries to balance the two by reducing the
wait to 3 seconds from 15. If we have waited 3 secs
to send a pdu then perf is already taking a hit so
grabbing the session lock again is not going to make a
difference. And waiting up to 3 secs for the xmit thread
to flush and suspend is not that long (at least a lot better
than 15).

0001-iscsi_tcp-reduce-sk-sndtmo.patch

Hannes Reinecke

unread,

Jul 31, 2009, 5:03:00 AM7/31/09

to open-...@googlegroups.com

:-)

Cool. I'm running with 1 sec here, but the principle is
the same. Especially for a multipathed setup you really
want this.

Oh, what about making this setting dependend on the
transport class timeout?
Worst case sendpages/sendmsg will take up to 3 seconds
now before it even will return an error.
So having a transport class timeout lower than that
is pointless as we have no means of terminating
a call being stuck in sendpages/sendmsg and the
transport class will always terminate the command.

So we should either limit the transport class timeout
to not being able to be set lower than 3 seconds or
make this timeout set by the transport class timeout.

Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
ha...@suse.de +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)

Mike Christie

unread,

Jul 31, 2009, 12:08:01 PM7/31/09

to open-...@googlegroups.com

Good point! Let me backout my patch, and do some more digging on why I
cannot just do

signal(xmit thread)

to wake it from sendpage/sendmsg right away.

If I cannot get that to work, then I will send a patch to implement what
you describe.

Mike Christie

unread,

Jul 31, 2009, 11:34:19 PM7/31/09

to open-...@googlegroups.com

I got the signal stuff working. I am attaching the patch here. I put it
in my iscsi branch, because it is built over some other patches I sent
Erez in his logout takes ~50 secs thread.

wake-xmit-on-err.patch

Erez Zilber

unread,

Aug 4, 2009, 10:02:20 AM8/4/09

to open-...@googlegroups.com

I'm running with open-iscsi.git HEAD + the check suspend bit patch +
the wake xmit on error patch. If I disconnect the cable on the
initiator side (even while not running IO), I see that after sending
the signal, the iscsi_q_XX thread reaches 100% cpu. I ran it over
several 1GB/ 10 GB drivers and got the same results.

If I remove the wake xmit on error patch, I don't see this behavior.

Erez

Mike Christie

unread,

Aug 4, 2009, 1:17:32 PM8/4/09

to open-...@googlegroups.com

Erez Zilber wrote:
>
> I'm running with open-iscsi.git HEAD + the check suspend bit patch +
> the wake xmit on error patch. If I disconnect the cable on the
> initiator side (even while not running IO), I see that after sending
> the signal, the iscsi_q_XX thread reaches 100% cpu. I ran it over
> several 1GB/ 10 GB drivers and got the same results.
>
> If I remove the wake xmit on error patch, I don't see this behavior.
>

Shoot, I have been running the xmit wakeup and suspend bit patch here
fine. Let me do some more testing.

Is this something you always hit? Could you send me the final patch you
ended up using?

Erez Zilber

unread,

Aug 4, 2009, 2:12:00 PM8/4/09

to open-...@googlegroups.com

I see this every time. Note that I'm not running with
linux-2.6-iscsi.git. I'm using the open-iscsi.git tree + the 2 patches
that I took without any change (using git-show) from the
linux-2.6-iscsi.git tree. Which tree did you test it on?

I added some printks to the code and saw that the signal does get sent
from iscsi_sw_tcp_conn_stop, but I didn't see that (rc == -EINTR || rc
== -EAGAIN) in iscsi_sw_tcp_xmit (), even when I ran IO on that
session.

Erez

Mike Christie

unread,

Aug 5, 2009, 11:19:06 AM8/5/09

to open-...@googlegroups.com

Does r in iscsi_sw_tcp_xmit_segment == 0?

If not I think you need a diffferent patch. In one of the patch versions
iscsi_sw_tcp_xmit_segment could return -ENODATA (this is when I had a
check for suspend_tx in there). iscsi_sw_tcp_xmit did not check this and
so I think we can loop.

Could you try the attached patch. It was made over open-iscsi.git for
you. I dropped the suspend bit check in iscsi_sw_tcp_xmit_segment,
because it is not needed. If we end up blocking the signal will wake us.

open-iscsi-suspend-and-wake.patch

Erez Zilber

unread,

Aug 5, 2009, 12:01:54 PM8/5/09

to open-...@googlegroups.com

On Wed, Aug 5, 2009 at 6:19 PM, Mike Christie<mich...@cs.wisc.edu> wrote:
> On 08/04/2009 01:12 PM, Erez Zilber wrote:
>> On Tue, Aug 4, 2009 at 8:17 PM, Mike Christie<mich...@cs.wisc.edu> wrote:
>>> Erez Zilber wrote:
>>>> I'm running with open-iscsi.git HEAD + the check suspend bit patch +
>>>> the wake xmit on error patch. If I disconnect the cable on the
>>>> initiator side (even while not running IO), I see that after sending
>>>> the signal, the iscsi_q_XX thread reaches 100% cpu. I ran it over
>>>> several 1GB/ 10 GB drivers and got the same results.
>>>>
>>>> If I remove the wake xmit on error patch, I don't see this behavior.
>>>>
>>> Shoot, I have been running the xmit wakeup and suspend bit patch here
>>> fine. Let me do some more testing.
>>>
>>> Is this something you always hit? Could you send me the final patch you
>>> ended up using?
>>
>> I see this every time. Note that I'm not running with
>> linux-2.6-iscsi.git. I'm using the open-iscsi.git tree + the 2 patches
>> that I took without any change (using git-show) from the
>> linux-2.6-iscsi.git tree. Which tree did you test it on?
>>
>> I added some printks to the code and saw that the signal does get sent
>> from iscsi_sw_tcp_conn_stop, but I didn't see that (rc == -EINTR || rc
>> == -EAGAIN) in iscsi_sw_tcp_xmit (), even when I ran IO on that
>> session.
>>
>
> Does r in iscsi_sw_tcp_xmit_segment == 0?
>

No, it is never zero.

> If not I think you need a diffferent patch. In one of the patch versions
> iscsi_sw_tcp_xmit_segment could return -ENODATA (this is when I had a
> check for suspend_tx in there). iscsi_sw_tcp_xmit did not check this and
> so I think we can loop.
>
> Could you try the attached patch. It was made over open-iscsi.git for
> you. I dropped the suspend bit check in iscsi_sw_tcp_xmit_segment,
> because it is not needed. If we end up blocking the signal will wake us.

I ran it and got the same 100% cpu usage. Did you try to run it on
your machines with open-iscsi.git? Did you see a different behavior?

Erez

Mike Christie

unread,

Aug 5, 2009, 12:26:57 PM8/5/09

to open-...@googlegroups.com

I just ran it. Maybe I am looking for the wrong thing though.

For your problem, when the signal is sent does the recovery go ok and we
end up reconnecting? But the problem is just that the xmit thread takes
up 100% of the cpu?

Or.

For your problem, when the signal is sent does the recovery stall and we
do not reconnect, because the xmit thread is just spinning and taking
100% of the cpu?

Mike Christie

unread,

Aug 5, 2009, 12:33:50 PM8/5/09

to open-...@googlegroups.com

Ignore this. I see the problem now. I was thinking you did not
reconnect. I see the cpu usage. Let me do some digging.

Mike Christie

unread,

Aug 5, 2009, 12:45:47 PM8/5/09

to open-...@googlegroups.com

I found it. The problem is that we will send the signal if the xmit
thread is running or not. If it is not running the workqueue code will
keep getting woken up to handle the signal, but because we have not
called queue_work the workqueue code will not let the thread run so we
never get to flush the signal until we reconnect and send down a login
pdu (the login pdu does a queue_work finally).

Erez Zilber

unread,

Aug 5, 2009, 1:34:54 PM8/5/09

to open-...@googlegroups.com

When you say "the xmit thread is running", I guess that you mean that
the xmit thread is busy with IO, right? Note that I said that this
happens whether I'm running IO or everything's idle. 2 more thing that
I forgot to mention:

1. I didn't try to reconnect the cable (actually, I disabled the port
in the switch) and see if the problem goes away.
2. When I logout (while the port is stil disconnected), everything
goes back to normal, but I guess that this is because the xmit thread
dies.

Erez

Mike Christie

unread,

Aug 5, 2009, 3:22:24 PM8/5/09

to open-...@googlegroups.com

On 08/05/2009 12:34 PM, Erez Zilber wrote:
>> I found it. The problem is that we will send the signal if the xmit
>> thread is running or not. If it is not running the workqueue code will
>> keep getting woken up to handle the signal, but because we have not
>> called queue_work the workqueue code will not let the thread run so we
>> never get to flush the signal until we reconnect and send down a login
>> pdu (the login pdu does a queue_work finally).
>>
>
> When you say "the xmit thread is running", I guess that you mean that
> the xmit thread is busy with IO, right? Note that I said that this

No. workqueue.c:worker_thread() is spinning. It is looping because there
is a signal pending, but the iscsi work code which has the flush_signals
is not getting run because there is no work queued.

So you could add a

if (signal_pending(current))
flush_signals(current)

to worker_thread() "for" loop and I think this will fix the problem.

Erez Zilber

unread,

Aug 6, 2009, 6:26:10 AM8/6/09

to open-...@googlegroups.com

Looks like this solves the problem. I've added the following patch to
the centos 5.3 kernel (2.6.18-128.1.6.el5):

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 8594efb..e148ed8 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -253,6 +253,9 @@ static int worker_thread(void *__cwq)

set_current_state(TASK_INTERRUPTIBLE);
while (!kthread_should_stop()) {
+ if (signal_pending(current))
+ flush_signals(current);
+
add_wait_queue(&cwq->more_work, &wait);
if (list_empty(&cwq->worklist))
schedule();

I'm running with open-iscsi.git + 2 commits from linux-2.6-iscsi.git
(9c302cc45b70ecc4b606d65a445902381066061b &
75be23dc40ba2f215779d5ba60fda9a762271bbe).

Will you push it upstream & into the RHEL kernel?

Thanks,
Erez

Mike Christie

unread,

Aug 6, 2009, 11:32:57 AM8/6/09

to open-...@googlegroups.com

I am not sure. I was thinking that switching from a workqueue to a
thread is the right thing to do. The drawback is that the workqueue is
nice when there are multiple sessions for a host like is done with
bnx2i, cxgb3i and be_iscsi. I can just queue_work and pass the
connection to send on. If I switch to a work_queue I have to add my own
code to do that.

I am going to post a patch like you did to linux-kernel and see what
people say is best. If it goes in then I will port to RHEL.

Erez Zilber

unread,

Feb 2, 2010, 10:25:27 AM2/2/10

to open-...@googlegroups.com

> --~--~---------~--~----~------------~-------~--~----~
> You received this message because you are subscribed to the Google Groups "open-iscsi" group.
> To post to this group, send email to open-...@googlegroups.com
> To unsubscribe from this group, send email to open-iscsi+...@googlegroups.com
> For more options, visit this group at http://groups.google.com/group/open-iscsi
> -~----------~----~----~----~------~----~------~--~---
>
>

Mike,

We had this discussion a long time ago. I don't remember what
eventually happened with it. Did you push the workqueue patch to the
kernel? What about the suspend-and-wake patch?

Thanks,
Erez

Mike Christie

unread,

Feb 2, 2010, 11:59:33 AM2/2/10

to open-...@googlegroups.com, Erez Zilber

It looks like I posted it at Red Hat and never got a response, and I
probably then forgot about it and never asked upstream. Will send mail
upstream now.

Erez Zilber

unread,

Feb 2, 2010, 1:56:30 PM2/2/10

to open-...@googlegroups.com

> --

I encountered this problem ~6 months ago and found some workaround.
Now, I moved to new (and faster) HW, and I'm hitting this again and
again in scenarios with lots of I/O + killing the target machine.

Erez

Erez Zilber

unread,

Feb 3, 2010, 2:50:18 AM2/3/10

to open-...@googlegroups.com

> It looks like I posted it at Red Hat and never got a response, and I
> probably then forgot about it and never asked upstream. Will send mail
> upstream now.

Which list are you sending it to? I thought it was lkml, but didn't
find any discussion there.

Erez

Mike Christie

unread,

Feb 3, 2010, 4:30:20 AM2/3/10

to open-...@googlegroups.com, Erez Zilber

I think I found a nicer solution. See the attached patch made over
linus's tree. I am just not sure if we are allowed to set the sk_err
field - maybe it is supposed to be internal to the socket code. The
patch seems to be working for me.

iscsi-force-wake.patch

Erez Zilber

unread,

Feb 3, 2010, 7:07:58 AM2/3/10

to Mike Christie, open-...@googlegroups.com

Works great for me.

Erez

Mike Christie

unread,

Feb 3, 2010, 4:51:54 PM2/3/10

to open-...@googlegroups.com, Erez Zilber

Ok. I am going to post it to netdev today/tomorrow, to make sure they
are ok with how I am accessing the sock struct.

Erez Zilber

unread,

Mar 1, 2010, 6:38:23 AM3/1/10

to Mike Christie, open-...@googlegroups.com

Mike,

Did you get any response from the netdev list?

Thanks,
Erez

Mike Christie

unread,

Mar 1, 2010, 7:09:04 PM3/1/10

to Erez Zilber, open-...@googlegroups.com

I just sent it to linux-scsi after looking at some similar code. It
ended up getting merged in James's tree.

Erez Zilber

unread,

Mar 2, 2010, 3:22:31 AM3/2/10

to Mike Christie, open-...@googlegroups.com

Thanks!

Reply all

Reply to author

Forward