we had a locking issue yesterday with iscsi_tcp (open-iscsi 2.0-869 from
OFED-1.5.4) on Ethernet and a defective SSD on the target storage system
at the same time.
The storage crashed, because the OS was on the SSD without replication
and another server with the QEMU/KVM VMs running on that iSCSI target
crashed, too. This is just a testing environment - therefore no
replication and no InfiniBand.
I've seen commit 52734d26ffca727da0e687963333ae88056ad84b from
linux-stable and wanted to apply it to that 2.6.30-based open-iscsi
kernel code. But when testing without the patch I can't trigger the
panic again.
I've already tried many parallel I/O and CPU intensive processes and
then pulling out the Ethernet plug.
Do you have an idea how to trigger this? I couldn't find anything in the
list archives related to that commit.
I've attached the screenshot of the panic.
Cheers,
Sebastian
In theory you should be able hit it when there is IO running and the
socket is closed. This happens during error recovery, like when you pull
a cable or reboot a target, and when you shutdown the iscsi service.
Users have only been hitting it during shutdown at reboot time though. I
think what was happening was the initiator sent a logout, got a logout
response, but then the target still sent data/responses for other IO
that was running at the time. The initiator assumed that once the logout
response was sent the target would stop sending data.
> I've attached the screenshot of the panic.
>
Need the beginning of the panic.
[...]
> Users have only been hitting it during shutdown at reboot time though. I
> think what was happening was the initiator sent a logout, got a logout
> response, but then the target still sent data/responses for other IO
> that was running at the time. The initiator assumed that once the logout
> response was sent the target would stop sending data.
[...]
According to my experience this can happen with multipath: When you use "multipath -f map", the command returns quickly, but actually patch checkers are removed in background after that. So if you disconnect your iSCSI immediately after flushing your maps, you may hit this. I haven't had this with iSCSI, but with a regular SAN (where I also had hard hangs in I/O).
Regards,
Ulrich
Thanks for your responses. I'll try this. Luckily, this hasn't too high
impact for us.
Unfortunately, we have AUFS on our disk-less servers. After rebooting
everything is gone. So the display from IPMI is the only thing I have at
the moment. We need to change this.
Cheers,
Sebastian
Are you doing iscsi root? I think that is where I have seen it the most.
iscsi root then user reboots. In the reboot then they would hit the problem.
O.K. then, I should test hitting that with VMs on the iSCSI storage.
Thanks for the hint.