Question: iscsi_tcp: fix locking around iscsi sk user data

Sebastian Riemer

unread,

Feb 2, 2012, 9:44:06 AM2/2/12

to Mike Christie, open-...@googlegroups.com

Hi Mike,

we had a locking issue yesterday with iscsi_tcp (open-iscsi 2.0-869 from
OFED-1.5.4) on Ethernet and a defective SSD on the target storage system
at the same time.

The storage crashed, because the OS was on the SSD without replication
and another server with the QEMU/KVM VMs running on that iSCSI target
crashed, too. This is just a testing environment - therefore no
replication and no InfiniBand.

I've seen commit 52734d26ffca727da0e687963333ae88056ad84b from
linux-stable and wanted to apply it to that 2.6.30-based open-iscsi
kernel code. But when testing without the patch I can't trigger the
panic again.
I've already tried many parallel I/O and CPU intensive processes and
then pulling out the Ethernet plug.

Do you have an idea how to trigger this? I couldn't find anything in the
list archives related to that commit.

I've attached the screenshot of the panic.

Cheers,

Sebastian

iscsi_tcp server panic.png

Mike Christie

unread,

Feb 2, 2012, 6:02:34 PM2/2/12

to Sebastian Riemer, open-...@googlegroups.com

In theory you should be able hit it when there is IO running and the
socket is closed. This happens during error recovery, like when you pull
a cable or reboot a target, and when you shutdown the iscsi service.

Users have only been hitting it during shutdown at reboot time though. I
think what was happening was the initiator sent a logout, got a logout
response, but then the target still sent data/responses for other IO
that was running at the time. The initiator assumed that once the logout
response was sent the target would stop sending data.

> I've attached the screenshot of the panic.
>

Need the beginning of the panic.

Ulrich Windl

unread,

Feb 3, 2012, 2:41:09 AM2/3/12

to open-...@googlegroups.com

>>> Mike Christie <mich...@cs.wisc.edu> schrieb am 03.02.2012 um 00:02 in
Nachricht <4F2B160A...@cs.wisc.edu>:

[...]

> Users have only been hitting it during shutdown at reboot time though. I
> think what was happening was the initiator sent a logout, got a logout
> response, but then the target still sent data/responses for other IO
> that was running at the time. The initiator assumed that once the logout
> response was sent the target would stop sending data.

[...]

According to my experience this can happen with multipath: When you use "multipath -f map", the command returns quickly, but actually patch checkers are removed in background after that. So if you disconnect your iSCSI immediately after flushing your maps, you may hit this. I haven't had this with iSCSI, but with a regular SAN (where I also had hard hangs in I/O).

Regards,
Ulrich

Sebastian Riemer

unread,

Feb 3, 2012, 5:17:49 AM2/3/12

to open-...@googlegroups.com

On 03/02/12 08:41, Ulrich Windl wrote:
> [...]
>
>> > Users have only been hitting it during shutdown at reboot time though. I
>> > think what was happening was the initiator sent a logout, got a logout
>> > response, but then the target still sent data/responses for other IO
>> > that was running at the time. The initiator assumed that once the logout
>> > response was sent the target would stop sending data.
>>
> [...]
>
> According to my experience this can happen with multipath: When you use "multipath -f map", the command returns quickly, but actually patch checkers are removed in background after that. So if you disconnect your iSCSI immediately after flushing your maps, you may hit this. I haven't had this with iSCSI, but with a regular SAN (where I also had hard hangs in I/O).
>

Thanks for your responses. I'll try this. Luckily, this hasn't too high
impact for us.

Unfortunately, we have AUFS on our disk-less servers. After rebooting
everything is gone. So the display from IPMI is the only thing I have at
the moment. We need to change this.

Cheers,

Sebastian

Mike Christie

unread,

Feb 3, 2012, 9:00:11 AM2/3/12

to open-...@googlegroups.com, Sebastian Riemer

On 02/03/2012 04:17 AM, Sebastian Riemer wrote:
> On 03/02/12 08:41, Ulrich Windl wrote:
>> [...]
>>
>>>> Users have only been hitting it during shutdown at reboot time though. I
>>>> think what was happening was the initiator sent a logout, got a logout
>>>> response, but then the target still sent data/responses for other IO
>>>> that was running at the time. The initiator assumed that once the logout
>>>> response was sent the target would stop sending data.
>>>
>> [...]
>>
>> According to my experience this can happen with multipath: When you use "multipath -f map", the command returns quickly, but actually patch checkers are removed in background after that. So if you disconnect your iSCSI immediately after flushing your maps, you may hit this. I haven't had this with iSCSI, but with a regular SAN (where I also had hard hangs in I/O).
>>
>
> Thanks for your responses. I'll try this. Luckily, this hasn't too high
> impact for us.
>
> Unfortunately, we have AUFS on our disk-less servers.

Are you doing iscsi root? I think that is where I have seen it the most.
iscsi root then user reboots. In the reboot then they would hit the problem.

Sebastian Riemer

unread,

Feb 3, 2012, 9:11:29 AM2/3/12

to Mike Christie, open-...@googlegroups.com

On 03/02/12 15:00, Mike Christie wrote:
> Are you doing iscsi root? I think that is where I have seen it the most.
> iscsi root then user reboots. In the reboot then they would hit the problem.
>

Yes, but only in the VMs - QEMU/KVM gets the sdX device from the host as
storage. The host has its rootfs on AUFS via a PXE booted live image.

O.K. then, I should test hitting that with VMs on the iSCSI storage.
Thanks for the hint.

Reply all

Reply to author

Forward