finally I've found the mysterious I/O stall I've been chasing
for the last two months.
Problem is iscsi_conn_queue_work(); we're just calling
queue_work() without checking the return value.
However, queue_work() will be effectively a no-op when
there is some work already running, so we'll be missing
quite a few invocations (on my NetApp target I've been
counting up to 120000 missed invocations ...).
In addition to that, iscsi_check_cmdsn_window_closed()
doesn't actually check if the cmdsn window has been closed
(in the sense that we're not allowed to send any PDUs),
but rather that the new PDU _to be queued_ will be rejected.
There might still be PDUs in the queue, waiting to be
transferred.
So if we're hitting queuecommand hard enough we might be running into this
scenario:
- iscsi_data_xmit:
Transfers last Data-out PDU in requeue list; all queues are empty.
-> xmit_task()
-> unlock session
- queuecommand()
being called repeatedly until iscsi_check_cmdsn triggers, giving
xmit thread no chance to grab the session lock
-> iscsi_conn_queue_work() does nothing as work is already pending
- iscsi_data_xmit:
-> locks session
-> returns from xmit_task()
-> end of iscsi_data_xmit()
-> I/O stall.
So we really should check if we've missed some queue_work calls,
and restart iscsi_data_xmit() if so.
Proposed patch is attached.
Note that in theory we might loop here when ->xmit_task() returns
an error. But this loop should be closed as we have had some
changes to the lists as someone called queue_work in the meantime.
Comments etc welcome.
Cheers,
Hannes
--
Dr. Hannes Reinecke zSeries & Storage
ha...@suse.de +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
Huh, I think I am misunderstanding you or I must be misreading workqueue
code or probably both :(
When you saying running do you mean the work fn is running or that the
work is queued (WORK_STRUCT_PENDING bit is set but the work fn is not
yet called)?
I thought this happens when queue_work returns 1:
run_workqueue
work_clear_pending
f(work)
so if we called queue_work while iscsi_data_xmit was running then
queue_work's pending test:
if (!test_and_set_bit(WORK_STRUCT_PENDING, work_data_bits(work))) {
BUG_ON(!list_empty(&work->entry));
__queue_work(wq_per_cpu(wq, cpu), work);
would actually queue the work to be run. So I thought we could actually
be getting extra calls to our work function, because we could call
queue_work right after the WORK_STRUCT_PENDING bit cleared but the work
fn is not yet running.
If the queue_work returns 0, then I thought iscsi_data_xmit is not yet
running, so when it does run it would handle the task added to the list
by the caller of queue_work (we do list_add then queue_work so it should
be on the list when the work function would run if the work is pending
when we call queue_work).
Hey, for the list_add+queue_work sequence do I need to add some sort of
barrier?
> In addition to that, iscsi_check_cmdsn_window_closed()
> doesn't actually check if the cmdsn window has been closed
> (in the sense that we're not allowed to send any PDUs),
> but rather that the new PDU _to be queued_ will be rejected.
> There might still be PDUs in the queue, waiting to be
> transferred.
>
> So if we're hitting queuecommand hard enough we might be running into this
> scenario:
>
> - iscsi_data_xmit:
> Transfers last Data-out PDU in requeue list; all queues are empty.
> -> xmit_task()
> -> unlock session
>
> - queuecommand()
> being called repeatedly until iscsi_check_cmdsn triggers, giving
> xmit thread no chance to grab the session lock
> -> iscsi_conn_queue_work() does nothing as work is already pending
Yeah, so I thought here because iscsi_data_xmit is running we hit this:
run_workqueue
work_clear_pending
//WORK_STRUCT_PENDING is now cleared so queue_work will queue the work
(queue_work returns 1).
f(work)
However, I still have the feeling there still is potential
for an I/O stall; this iscsi_check_cmdsn_window_closed() check
simply doesn't feel safe.
>
> Hey, for the list_add+queue_work sequence do I need to add some sort of
> barrier?
>
Not sure; might've as queue_work is potentially run on another CPU.
>
>
>> In addition to that, iscsi_check_cmdsn_window_closed()
>> doesn't actually check if the cmdsn window has been closed
>> (in the sense that we're not allowed to send any PDUs),
>> but rather that the new PDU _to be queued_ will be rejected.
>> There might still be PDUs in the queue, waiting to be
>> transferred.
>>
>> So if we're hitting queuecommand hard enough we might be running into this
>> scenario:
>>
>> - iscsi_data_xmit:
>> Transfers last Data-out PDU in requeue list; all queues are empty.
>> -> xmit_task()
>> -> unlock session
>>
>> - queuecommand()
>> being called repeatedly until iscsi_check_cmdsn triggers, giving
>> xmit thread no chance to grab the session lock
>> -> iscsi_conn_queue_work() does nothing as work is already pending
>
> Yeah, so I thought here because iscsi_data_xmit is running we hit this:
>
> run_workqueue
> work_clear_pending
>
> //WORK_STRUCT_PENDING is now cleared so queue_work will queue the work
> (queue_work returns 1).
>
> f(work)
>
>
>
But we still will be stuck eg if sendpage() returns a retryable
error other than -EAGAIN.
If such a thing exists. Will have to check.
Anyway, hope to have some more details tomorrow.
Running failover tests against a NetApp Filer.
The netapp target you are using has a fixed window for your test right?
You could just set can_queue to that window, and see if the problem occurs.
I was asking on your other patch where you modified the queue_work in
iscsi_update_cmdsn, if when the sendpage happens if
iscsi_sw_tcp_write_space should always be called when space finally
opens up. If so then iscsi_sw_tcp_write_space will call
iscsi_conn_queue_work and that should start us up again.
If iscsi_sw_tcp_write_space does not always get called then we have to
add some code to iscsi_complete_command to wake up the xmit thread when
a command still needs to be sent, and we have to modify iscsi_data_xmit
to do a delayed queue_work when there are no commands in flight.
Do you still have that test in iscsi_data_xmit or iscsi_xmit_task? If so
there are the other problems I mentioned in the patch you modified
iscsi_update_cmdsn's iscsi_conn_queue_work call.
If you just mean the iscsi_check_cmdsn_window_closed call in
iscsi_queuecommand, then I have been hitting problems with fcoe.ko when
it returns SCSI_MLQUEUE_HOST_BUSY. The throughput drops from 1XX MB/s to
a couple MB/s. I thought there might be bug in the
SCSI_MLQUEUE_HOST_BUSY handling code.
If you know the window size of your target you can remove the
iscsi_check_cmdsn_window_closed call and set can_queue to it. Or you can
also just set the can_queue to 1 command and see if you still hit the
same problem.