You can do
echo running > /sys/block/sdX/device/state
but you might not want to because the device may not be back.
> Is this a known problem, and has it been fixed in newer open-iscsi
> versions?
Are you using a older version of the sun target?
>
> Mar 18 18:21:33 eq1-vz2 kernel: connection1:0: detected conn error
> (1011)
> Mar 18 18:21:36 eq1-vz2 kernel: session1: host reset succeeded
When we log back in we tell scsi-ml that we are ok.
> Mar 18 18:22:16 eq1-vz2 kernel: sd 6:0:0:0: scsi: Device offlined -
> not ready after error recovery
scsi-ml will send a Test unit ready (TUR) command to check that the
device is ready to go. The TUR seems to be failing and so the scsi layer
sets the device offline.
I think there was some target issue and was fixed in newer ones.
If you can easily replicate this then you should take wireshark/ethereal
trace and send the trace here so we can see why the TUR failed and make
sure it is not our fault before you go to the trouble of updating.
First the scsi command timer would expire. You can see/set this in
/sys/block/sdX/device/timeout (there is also a udev rule). This causes
the scsi eh to run. That will try to abort the tasks on the device. If
that fails we try a lu reset. If that fails we drop the sessions on the
host and relogin (that is where the host reset messages comes from). So
for a disk failure, we can log back in quickly because the target is
fine. The scsi eh will then send a TUR to the device to verify it is
back. The TUR would/could then fail quickly like you saw because the
disk really is bad. For this when you know the disk is back online then
you would want to manually set the state to running. Eventually
multipathd will then set the path back online in the mulitpath device.
>
>>> Is this a known problem, and has it been fixed in newer open-iscsi
>>> versions?
>> Are you using a older version of the sun target?
>
> I am. I am running OpenSoalris SXCE build 93, which is about 8 months
> old. I'll be upgrading this soon.
>
>>
>>
>>> Mar 18 18:21:33 eq1-vz2 kernel: connection1:0: detected conn error
>>> (1011)
>>> Mar 18 18:21:36 eq1-vz2 kernel: session1: host reset succeeded
>> When we log back in we tell scsi-ml that we are ok.
>
> At what level does the connection receive an error and reset (can't
> log in to target, read/write errors, etc), and what functionality is
> needed to be considered ok? If the device wasn't really ready to be
> used again, shouldn't iscsi know this and attempt another recovery?
> I'm not particularly well versed in iscsi protocol.
iSCSI does not know this and does not really deal with the device. It
deals with the connections/session to the target port/portal. So the
target seems fine, and so can relog in quickly. The connections are fine
and we can send iscsi level IOs like logins and nops to the target and
it will respond ok. The target could tell the initiator that it is
temporarily unavailable when we try to login again, but if it can allow
IO to other disks while this problem on the one bad disk is going on it
probably would not want to do this.
If the target is returning something in the TUR that indicates that the
device is only temporarily gone, then maybe we would want to change the
scsi layer so that instead of failing and setting the device offline
right away it retries its eh a little later.
>
>>> Mar 18 18:22:16 eq1-vz2 kernel: sd 6:0:0:0: scsi: Device offlined -
>>> not ready after error recovery
>> scsi-ml will send a Test unit ready (TUR) command to check that the
>> device is ready to go. The TUR seems to be failing and so the scsi layer
>> sets the device offline.
>
> Is there only one TUR sent? I would have assumed a more robust
> recovery procedure here.
Only a TUR is sent to check if the aborts or resets worked.
>
>> I think there was some target issue and was fixed in newer ones.
>>
>> If you can easily replicate this then you should take wireshark/ethereal
>> trace and send the trace here so we can see why the TUR failed and make
>> sure it is not our fault before you go to the trouble of updating.
>
> I'll see what I can do to get a wire trace next time I have an
> opportunity to intentionally hiccup the iscsi target.
>
You probably do not need to worry about this. It is working like expected.
But if you could get a trace we can see what the TUR is failed with and
maybe see if we can add some code so that if the device is telling us it
is only a temporary problem then we do not fail right away.