[PATCH] I/O stall finally solved

Hannes Reinecke

unread,

Aug 14, 2009, 8:01:37 AM8/14/09

to Mike Christie, open-iscsi

Hi Mike,

finally I've found the mysterious I/O stall I've been chasing
for the last two months.

Problem is iscsi_conn_queue_work(); we're just calling
queue_work() without checking the return value.

However, queue_work() will be effectively a no-op when
there is some work already running, so we'll be missing
quite a few invocations (on my NetApp target I've been
counting up to 120000 missed invocations ...).

In addition to that, iscsi_check_cmdsn_window_closed()
doesn't actually check if the cmdsn window has been closed
(in the sense that we're not allowed to send any PDUs),
but rather that the new PDU _to be queued_ will be rejected.
There might still be PDUs in the queue, waiting to be
transferred.

So if we're hitting queuecommand hard enough we might be running into this
scenario:

- iscsi_data_xmit:
Transfers last Data-out PDU in requeue list; all queues are empty.
-> xmit_task()
-> unlock session

- queuecommand()
being called repeatedly until iscsi_check_cmdsn triggers, giving
xmit thread no chance to grab the session lock
-> iscsi_conn_queue_work() does nothing as work is already pending

- iscsi_data_xmit:
-> locks session
-> returns from xmit_task()
-> end of iscsi_data_xmit()

-> I/O stall.

So we really should check if we've missed some queue_work calls,
and restart iscsi_data_xmit() if so.
Proposed patch is attached.

Note that in theory we might loop here when ->xmit_task() returns
an error. But this loop should be closed as we have had some
changes to the lists as someone called queue_work in the meantime.

Comments etc welcome.

Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
ha...@suse.de +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)

0001-libiscsi-I-O-stall-under-heavy-load.patch

Mike Christie

unread,

Aug 15, 2009, 5:01:17 PM8/15/09

to open-...@googlegroups.com

On 08/14/2009 07:01 AM, Hannes Reinecke wrote:
> Hi Mike,
>
> finally I've found the mysterious I/O stall I've been chasing
> for the last two months.
>
> Problem is iscsi_conn_queue_work(); we're just calling
> queue_work() without checking the return value.
>
> However, queue_work() will be effectively a no-op when
> there is some work already running, so we'll be missing
> quite a few invocations (on my NetApp target I've been
> counting up to 120000 missed invocations ...).
>

Huh, I think I am misunderstanding you or I must be misreading workqueue
code or probably both :(

When you saying running do you mean the work fn is running or that the
work is queued (WORK_STRUCT_PENDING bit is set but the work fn is not
yet called)?

I thought this happens when queue_work returns 1:

run_workqueue
work_clear_pending
f(work)

so if we called queue_work while iscsi_data_xmit was running then
queue_work's pending test:

if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work))) {
BUG_ON(!list_empty(&work->entry));
__queue_work(wq_per_cpu(wq, cpu), work);

would actually queue the work to be run. So I thought we could actually
be getting extra calls to our work function, because we could call
queue_work right after the WORK_STRUCT_PENDING bit cleared but the work
fn is not yet running.

If the queue_work returns 0, then I thought iscsi_data_xmit is not yet
running, so when it does run it would handle the task added to the list
by the caller of queue_work (we do list_add then queue_work so it should
be on the list when the work function would run if the work is pending
when we call queue_work).

Hey, for the list_add+queue_work sequence do I need to add some sort of
barrier?

> In addition to that, iscsi_check_cmdsn_window_closed()
> doesn't actually check if the cmdsn window has been closed
> (in the sense that we're not allowed to send any PDUs),
> but rather that the new PDU _to be queued_ will be rejected.
> There might still be PDUs in the queue, waiting to be
> transferred.
>
> So if we're hitting queuecommand hard enough we might be running into this
> scenario:
>
> - iscsi_data_xmit:
> Transfers last Data-out PDU in requeue list; all queues are empty.
> -> xmit_task()
> -> unlock session
>
> - queuecommand()
> being called repeatedly until iscsi_check_cmdsn triggers, giving
> xmit thread no chance to grab the session lock
> -> iscsi_conn_queue_work() does nothing as work is already pending

Yeah, so I thought here because iscsi_data_xmit is running we hit this:

run_workqueue
work_clear_pending

//WORK_STRUCT_PENDING is now cleared so queue_work will queue the work
(queue_work returns 1).

f(work)

Hannes Reinecke

unread,

Aug 17, 2009, 11:00:34 AM8/17/09

to open-...@googlegroups.com

Yeah, you are right. My assumptions were too simplistic.
(Read: next time I should read the code more thoroughly ...)

However, I still have the feeling there still is potential
for an I/O stall; this iscsi_check_cmdsn_window_closed() check
simply doesn't feel safe.

>
> Hey, for the list_add+queue_work sequence do I need to add some sort of
> barrier?
>

Not sure; might've as queue_work is potentially run on another CPU.

>
>
>> In addition to that, iscsi_check_cmdsn_window_closed()
>> doesn't actually check if the cmdsn window has been closed
>> (in the sense that we're not allowed to send any PDUs),
>> but rather that the new PDU _to be queued_ will be rejected.
>> There might still be PDUs in the queue, waiting to be
>> transferred.
>>
>> So if we're hitting queuecommand hard enough we might be running into this
>> scenario:
>>
>> - iscsi_data_xmit:
>> Transfers last Data-out PDU in requeue list; all queues are empty.
>> -> xmit_task()
>> -> unlock session
>>
>> - queuecommand()
>> being called repeatedly until iscsi_check_cmdsn triggers, giving
>> xmit thread no chance to grab the session lock
>> -> iscsi_conn_queue_work() does nothing as work is already pending
>
> Yeah, so I thought here because iscsi_data_xmit is running we hit this:
>
> run_workqueue
> work_clear_pending
>
> //WORK_STRUCT_PENDING is now cleared so queue_work will queue the work
> (queue_work returns 1).
>
> f(work)
>
>
>

But we still will be stuck eg if sendpage() returns a retryable
error other than -EAGAIN.
If such a thing exists. Will have to check.

Anyway, hope to have some more details tomorrow.
Running failover tests against a NetApp Filer.

Mike Christie

unread,

Aug 18, 2009, 12:52:42 PM8/18/09

to open-...@googlegroups.com

The netapp target you are using has a fixed window for your test right?
You could just set can_queue to that window, and see if the problem occurs.

I was asking on your other patch where you modified the queue_work in
iscsi_update_cmdsn, if when the sendpage happens if
iscsi_sw_tcp_write_space should always be called when space finally
opens up. If so then iscsi_sw_tcp_write_space will call
iscsi_conn_queue_work and that should start us up again.

If iscsi_sw_tcp_write_space does not always get called then we have to
add some code to iscsi_complete_command to wake up the xmit thread when
a command still needs to be sent, and we have to modify iscsi_data_xmit
to do a delayed queue_work when there are no commands in flight.

Mike Christie

unread,

Aug 18, 2009, 1:12:16 PM8/18/09

to open-...@googlegroups.com

On 08/17/2009 10:00 AM, Hannes Reinecke wrote:

> However, I still have the feeling there still is potential
> for an I/O stall; this iscsi_check_cmdsn_window_closed() check
> simply doesn't feel safe.
>

Do you still have that test in iscsi_data_xmit or iscsi_xmit_task? If so
there are the other problems I mentioned in the patch you modified
iscsi_update_cmdsn's iscsi_conn_queue_work call.

If you just mean the iscsi_check_cmdsn_window_closed call in
iscsi_queuecommand, then I have been hitting problems with fcoe.ko when
it returns SCSI_MLQUEUE_HOST_BUSY. The throughput drops from 1XX MB/s to
a couple MB/s. I thought there might be bug in the
SCSI_MLQUEUE_HOST_BUSY handling code.

If you know the window size of your target you can remove the
iscsi_check_cmdsn_window_closed call and set can_queue to it. Or you can
also just set the can_queue to 1 command and see if you still hit the
same problem.

Reply all

Reply to author

Forward