This tries to balance the two by reducing the
wait to 3 seconds from 15. If we have waited 3 secs
to send a pdu then perf is already taking a hit so
grabbing the session lock again is not going to make a
difference. And waiting up to 3 secs for the xmit thread
to flush and suspend is not that long (at least a lot better
than 15).
Cool. I'm running with 1 sec here, but the principle is
the same. Especially for a multipathed setup you really
want this.
Oh, what about making this setting dependend on the
transport class timeout?
Worst case sendpages/sendmsg will take up to 3 seconds
now before it even will return an error.
So having a transport class timeout lower than that
is pointless as we have no means of terminating
a call being stuck in sendpages/sendmsg and the
transport class will always terminate the command.
So we should either limit the transport class timeout
to not being able to be set lower than 3 seconds or
make this timeout set by the transport class timeout.
Cheers,
Hannes
--
Dr. Hannes Reinecke zSeries & Storage
ha...@suse.de +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
Good point! Let me backout my patch, and do some more digging on why I
cannot just do
signal(xmit thread)
to wake it from sendpage/sendmsg right away.
If I cannot get that to work, then I will send a patch to implement what
you describe.
I'm running with open-iscsi.git HEAD + the check suspend bit patch +
the wake xmit on error patch. If I disconnect the cable on the
initiator side (even while not running IO), I see that after sending
the signal, the iscsi_q_XX thread reaches 100% cpu. I ran it over
several 1GB/ 10 GB drivers and got the same results.
If I remove the wake xmit on error patch, I don't see this behavior.
Erez
Shoot, I have been running the xmit wakeup and suspend bit patch here
fine. Let me do some more testing.
Is this something you always hit? Could you send me the final patch you
ended up using?
I see this every time. Note that I'm not running with
linux-2.6-iscsi.git. I'm using the open-iscsi.git tree + the 2 patches
that I took without any change (using git-show) from the
linux-2.6-iscsi.git tree. Which tree did you test it on?
I added some printks to the code and saw that the signal does get sent
from iscsi_sw_tcp_conn_stop, but I didn't see that (rc == -EINTR || rc
== -EAGAIN) in iscsi_sw_tcp_xmit (), even when I ran IO on that
session.
Erez
Does r in iscsi_sw_tcp_xmit_segment == 0?
If not I think you need a diffferent patch. In one of the patch versions
iscsi_sw_tcp_xmit_segment could return -ENODATA (this is when I had a
check for suspend_tx in there). iscsi_sw_tcp_xmit did not check this and
so I think we can loop.
Could you try the attached patch. It was made over open-iscsi.git for
you. I dropped the suspend bit check in iscsi_sw_tcp_xmit_segment,
because it is not needed. If we end up blocking the signal will wake us.
No, it is never zero.
> If not I think you need a diffferent patch. In one of the patch versions
> iscsi_sw_tcp_xmit_segment could return -ENODATA (this is when I had a
> check for suspend_tx in there). iscsi_sw_tcp_xmit did not check this and
> so I think we can loop.
>
> Could you try the attached patch. It was made over open-iscsi.git for
> you. I dropped the suspend bit check in iscsi_sw_tcp_xmit_segment,
> because it is not needed. If we end up blocking the signal will wake us.
I ran it and got the same 100% cpu usage. Did you try to run it on
your machines with open-iscsi.git? Did you see a different behavior?
Erez
I just ran it. Maybe I am looking for the wrong thing though.
For your problem, when the signal is sent does the recovery go ok and we
end up reconnecting? But the problem is just that the xmit thread takes
up 100% of the cpu?
Or.
For your problem, when the signal is sent does the recovery stall and we
do not reconnect, because the xmit thread is just spinning and taking
100% of the cpu?
Ignore this. I see the problem now. I was thinking you did not
reconnect. I see the cpu usage. Let me do some digging.
I found it. The problem is that we will send the signal if the xmit
thread is running or not. If it is not running the workqueue code will
keep getting woken up to handle the signal, but because we have not
called queue_work the workqueue code will not let the thread run so we
never get to flush the signal until we reconnect and send down a login
pdu (the login pdu does a queue_work finally).
When you say "the xmit thread is running", I guess that you mean that
the xmit thread is busy with IO, right? Note that I said that this
happens whether I'm running IO or everything's idle. 2 more thing that
I forgot to mention:
1. I didn't try to reconnect the cable (actually, I disabled the port
in the switch) and see if the problem goes away.
2. When I logout (while the port is stil disconnected), everything
goes back to normal, but I guess that this is because the xmit thread
dies.
Erez
No. workqueue.c:worker_thread() is spinning. It is looping because there
is a signal pending, but the iscsi work code which has the flush_signals
is not getting run because there is no work queued.
So you could add a
if (signal_pending(current))
flush_signals(current)
to worker_thread() "for" loop and I think this will fix the problem.
Looks like this solves the problem. I've added the following patch to
the centos 5.3 kernel (2.6.18-128.1.6.el5):
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 8594efb..e148ed8 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -253,6 +253,9 @@ static int worker_thread(void *__cwq)
set_current_state(TASK_INTERRUPTIBLE);
while (!kthread_should_stop()) {
+ if (signal_pending(current))
+ flush_signals(current);
+
add_wait_queue(&cwq->more_work, &wait);
if (list_empty(&cwq->worklist))
schedule();
I'm running with open-iscsi.git + 2 commits from linux-2.6-iscsi.git
(9c302cc45b70ecc4b606d65a445902381066061b &
75be23dc40ba2f215779d5ba60fda9a762271bbe).
Will you push it upstream & into the RHEL kernel?
Thanks,
Erez
I am not sure. I was thinking that switching from a workqueue to a
thread is the right thing to do. The drawback is that the workqueue is
nice when there are multiple sessions for a host like is done with
bnx2i, cxgb3i and be_iscsi. I can just queue_work and pass the
connection to send on. If I switch to a work_queue I have to add my own
code to do that.
I am going to post a patch like you did to linux-kernel and see what
people say is best. If it goes in then I will port to RHEL.
Mike,
We had this discussion a long time ago. I don't remember what
eventually happened with it. Did you push the workqueue patch to the
kernel? What about the suspend-and-wake patch?
Thanks,
Erez
It looks like I posted it at Red Hat and never got a response, and I
probably then forgot about it and never asked upstream. Will send mail
upstream now.
I encountered this problem ~6 months ago and found some workaround.
Now, I moved to new (and faster) HW, and I'm hitting this again and
again in scenarios with lots of I/O + killing the target machine.
Erez
Which list are you sending it to? I thought it was lkml, but didn't
find any discussion there.
Erez
I think I found a nicer solution. See the attached patch made over
linus's tree. I am just not sure if we are allowed to set the sk_err
field - maybe it is supposed to be internal to the socket code. The
patch seems to be working for me.
Works great for me.
Erez
Ok. I am going to post it to netdev today/tomorrow, to make sure they
are ok with how I am accessing the sock struct.
Mike,
Did you get any response from the netdev list?
Thanks,
Erez
I just sent it to linux-scsi after looking at some similar code. It
ended up getting merged in James's tree.
Thanks!