Hang during logout & cache-sync with Promise array

304 views
Skip to first unread message

benjamin9999

unread,
Aug 12, 2007, 9:52:20 PM8/12/07
to open-iscsi

kernel 2.6.22.2 vanilla
open-iscsi-2.0-865.9

login to a promise target with many LUs, one might look like this
i have a bunch of LU on this one, this /dev/sdal is the last one

scsi 4:0:0:35: Direct-Access Promise VTrak M500i 020E PQ: 0
ANSI: 4
sd 4:0:0:35: [sdal] 419430400 512-byte hardware sectors (214748 MB)
sd 4:0:0:35: [sdal] Write Protect is off
sd 4:0:0:35: [sdal] Mode Sense: 97 00 10 08
sd 4:0:0:35: [sdal] Write cache: enabled, read cache: enabled,
supports DPO and FUA
sd 4:0:0:35: [sdal] 419430400 512-byte hardware sectors (214748 MB)
sd 4:0:0:35: [sdal] Write Protect is off
sd 4:0:0:35: [sdal] Mode Sense: 97 00 10 08
sd 4:0:0:35: [sdal] Write cache: enabled, read cache: enabled,
supports DPO and FUA
sdal: sdal1
sd 4:0:0:35: [sdal] Attached SCSI disk
sd 4:0:0:35: Attached scsi generic sg37 type 0

notice Write cache is enabled, in contrast to a (common) IET target
does not enable this.

iscsiadm -m session -r 1 -u and on logout, you get these messages

sd 4:0:0:0: [sdc] Synchronizing SCSI cache
sd 4:0:0:1: [sdd] Synchronizing SCSI cache
sd 4:0:0:2: [sde] Synchronizing SCSI cache
< hangs here >

(remember, this target has many more LU than sde)

iscsiadm logout is just sitting

and iscsid -f is sitting also

output from the logout command on looks like


iscsid: match session [iqn.2006-09.com.domain.storage,
192.168.111.1,3260][ tcp,default,192.168.111.12]
iscsid: to [iqn.2006-09.com.domain.storage,192.168.111.1,3260][default
tcp,default,]
iscsid: sending pdu opcode 0x46:
iscsid: in ksend_pdu_begin
iscsid: send PDU began for hdr 48 bytes and data 0 bytes
iscsid: in kwritev
iscsid: wrote 48 bytes of PDU header
iscsid: in ksend_pdu_end
iscsid: in __kipc_call
iscsid: in kwritev
iscsid: in nlpayload_read
iscsid: in nlpayload_read
iscsid: send PDU finished for conn 1:0
iscsid: thread 00538ac0 schedule: delay 60 state 2
iscsid: logout timeout timer 0x538ac0 start

iscsid: poll result 1
iscsid: in ctldev_handle
iscsid: in nl_read
iscsid: ctldev_handle got event type 101

iscsid: message real length is 120 bytes, recv_handle 0x53eb60
iscsid: in nlpayload_read
iscsid: event_type: 1 adding to the tail:
0x0x53ab50:0x0x53ab50:0x0x53ab80:0x0x53eb50 elem 0x0x53ab50 length 48
iscsid: thread 0053aa80 schedule: delay 0 state 2
iscsid: exec thread 0053aa80 callback
iscsid: event_type: 1 removing from the head:
0x0x53ab50:0x0x53ab80:0x0x53ab80:0x0x53eb50 elem 0x0x53ab50 length 48
iscsid: in krecv_pdu_begin
iscsid: recv PDU began, pdu handle 0x0x53eb60
iscsid: in kread
iscsid: read 48 bytes of PDU header
iscsid: read 48 PDU header bytes, opcode 0x26, dlength 0, data
0x536638, max 8192
iscsid: in krecv_pdu_end
iscsid: recv PDU finished for pdu handle 0x0x53eb60
iscsid: Recv: logout response

iscsid: thread 00538ac0 delete: state 0
iscsid: deleting a scheduled/waiting thread!
iscsid: in kstop_conn
iscsid: in __kipc_call
iscsid: in kwritev
iscsid: in nlpayload_read
iscsid: in nlpayload_read
iscsid: thread 00538b00 delete: state 0
iscsid: deleting a scheduled/waiting thread!
iscsid: thread 00538b40 delete: state 2
iscsid: conn noop out timer 0x538b00 stopped

iscsid: thread 00538a40 delete: state 2
iscsid: in kdestroy_conn
iscsid: in __kipc_call
iscsid: in kwritev
iscsid: in nlpayload_read
iscsid: in nlpayload_read
iscsid: disconnecting conn 0x536578, fd 6
iscsid: in kdestroy_session
iscsid: in __kipc_call
iscsid: in kwritev

< no more output after this ever >

now back to the kernel messages...

after a few minutes have passed,

iscsi: cmd 0x35 is not queued (6)
iscsi: cmd 0x35 is not queued (6)
iscsi: cmd 0x35 is not queued (6)
sd 4:0:0:2: [sde] Result: hostbyte=DID_NO_CONNECT
driverbyte=DRIVER_OK,SUGGEST_OK
sd 4:0:0:3: [sdf] Synchronizing SCSI cache
iscsi: scsi conn_destroy(): host_busy 0 host_failed 0

iscsiadm and iscsid still sitting, and doesn't seem like any more
kernel messages will appear either now.


it looks like logout does not wait for the cache to sync and removes
the devices in an unfriendly way while it is trying to complete.

-benjamin

Mike Christie

unread,
Aug 13, 2007, 12:47:01 PM8/13/07
to open-...@googlegroups.com
benjamin9999 wrote:
>
> kernel 2.6.22.2 vanilla
> open-iscsi-2.0-865.9
>
> login to a promise target with many LUs, one might look like this
> i have a bunch of LU on this one, this /dev/sdal is the last one


....

>
> it looks like logout does not wait for the cache to sync and removes
> the devices in an unfriendly way while it is trying to complete.
>

Nice analysis! You are right. We went over this a week or so ago, and we
have to fix up the shutdown code for the new scsi-ml behavior or maybe
fix up scsi-ml if others depended on the old behavior.

The problem was that in older kernels if you removed the device from the
userspace sysfs interface, the operation would block until the sync
cache failed or completed. In 2.6.22, this behavior changed so that when
you write to /sys/block/sdX/device/delete it only initiates the removal
and then returns right away.

So before userspace knew that the device was removed when the delete
opertation was done. And so once the the devices were deleted we could
go and logout the connection. Now userspace only knows that the
operation was started and does not know the result or when it has completed.

benjamin9999

unread,
Aug 13, 2007, 2:44:21 PM8/13/07
to open-iscsi

> have to fix up the shutdown code for the new scsi-ml behavior or maybe
> fix up scsi-ml if others depended on the old behavior.

sounds like this will take some time then...

perhaps a sleep() could be inserted into iscsid someplace as a
workaround for now? or some other polling-type wait ?

-benjamin

benjamin9999

unread,
Aug 13, 2007, 2:46:40 PM8/13/07
to open-iscsi

> userspace sysfs interface, the operation would block until the sync
> cache failed or completed. In 2.6.22, this behavior changed so that when

also somewhat OT, but could you provide some details on the write-
cache & sync and what's going on here anyway? i assume that the
target is telling the OS that it is OK to buffer writes client-side?
maybe there is more to it than that, seems like something that could
improve performance if IET allowed this ?

-benjamin

Mike Christie

unread,
Aug 13, 2007, 2:48:36 PM8/13/07
to open-...@googlegroups.com

Yeah, I was looking into a polling hack over the weekend. I thought we
could check if the device's sysfs dir was still around. I was not sure
though if the sysfs dir is removed then we send the sync cahce or the
other way around. I am still looking at the code. If you know already
let me know :)

Mike Christie

unread,
Aug 13, 2007, 2:55:02 PM8/13/07
to open-...@googlegroups.com

I believe IET supports either. Check the IOMode= and Type= arguments in
the ietd.conf man page.

You should ask the IET list for sure which versions do what.

benjamin9999

unread,
Aug 13, 2007, 3:01:40 PM8/13/07
to open-iscsi

On Aug 13, 2:55 pm, Mike Christie <micha...@cs.wisc.edu> wrote:
> benjamin9999 wrote:
>
> >> userspace sysfs interface, the operation would block until the sync
> >> cache failed or completed. In 2.6.22, this behavior changed so that when
>
> > also somewhat OT, but could you provide some details on the write-
> > cache & sync and what's going on here anyway? i assume that the
>

> I believe IET supports either. Check the IOMode= and Type= arguments in
> the ietd.conf man page.
>
> You should ask the IET list for sure which versions do what.

is the Sync cache that occurs at logout actually sending data? or is
it just sending a scsi msg to flush/checkpoint the target side?

if that is what this is, then i understand what's going on with IET
here.


Mike Christie

unread,
Aug 13, 2007, 3:20:18 PM8/13/07
to open-...@googlegroups.com
benjamin9999 wrote:
>
>
> On Aug 13, 2:55 pm, Mike Christie <micha...@cs.wisc.edu> wrote:
>> benjamin9999 wrote:
>>
>>>> userspace sysfs interface, the operation would block until the sync
>>>> cache failed or completed. In 2.6.22, this behavior changed so that when
>>> also somewhat OT, but could you provide some details on the write-
>>> cache & sync and what's going on here anyway? i assume that the
>> I believe IET supports either. Check the IOMode= and Type= arguments in
>> the ietd.conf man page.
>>
>> You should ask the IET list for sure which versions do what.
>
> is the Sync cache that occurs at logout actually sending data? or is
> it just sending a scsi msg to flush/checkpoint the target side?
>

No data. It is just telling the target side to basically flush what it
has in its cache. For example tgt does a sync() to write out data to the
backing store. I believe IET does the equivalent of a sync() but it is
done from the kernel since IET is implemtented there. Look at the source
for IET or just fire off a mail to that list if you want to know all the
details of what it does.

Brett Neumeier

unread,
Aug 15, 2007, 2:08:47 PM8/15/07
to open-...@googlegroups.com
On 8/13/07, Mike Christie <mich...@cs.wisc.edu> wrote:
> > it looks like logout does not wait for the cache to sync and removes
> > the devices in an unfriendly way while it is trying to complete.
> [..]

> The problem was that in older kernels if you removed the device from the
> userspace sysfs interface, the operation would block until the sync
> cache failed or completed. In 2.6.22, this behavior changed so that when
> you write to /sys/block/sdX/device/delete it only initiates the removal
> and then returns right away.

Hi,

We are experiencing exactly the same problem at our site (that is:
linux-2.6.22, Promise VTrak M500i, open-iscsi 865.9, multiple LUs).

I thought I would mention, after reviewing this thread, we tried
downgrading our server to 2.6.21.6. With that kernel, we were
/occasionally/ able to shut down iSCSI sessions without a perpetual
hang during synchronization of the SCSI cache; but two out of three
attempts still wound up with the same hanging condition.

Please let me know if it would be useful to repeat the exercise with
debugging enabled in the iscsid, or in the kernel modules, or both;
I'd be delighted to do so.

Cheers,

bn

--
Brett Neumeier (bneu...@gmail.com)

Mike Christie

unread,
Aug 15, 2007, 2:46:43 PM8/15/07
to open-...@googlegroups.com
Brett Neumeier wrote:
> On 8/13/07, Mike Christie <mich...@cs.wisc.edu> wrote:
>>> it looks like logout does not wait for the cache to sync and removes
>>> the devices in an unfriendly way while it is trying to complete.
>> [..]
>> The problem was that in older kernels if you removed the device from the
>> userspace sysfs interface, the operation would block until the sync
>> cache failed or completed. In 2.6.22, this behavior changed so that when
>> you write to /sys/block/sdX/device/delete it only initiates the removal
>> and then returns right away.
>
> Hi,
>
> We are experiencing exactly the same problem at our site (that is:
> linux-2.6.22, Promise VTrak M500i, open-iscsi 865.9, multiple LUs).
>
> I thought I would mention, after reviewing this thread, we tried
> downgrading our server to 2.6.21.6. With that kernel, we were
> /occasionally/ able to shut down iSCSI sessions without a perpetual
> hang during synchronization of the SCSI cache; but two out of three
> attempts still wound up with the same hanging condition.
>

This seems like it might be something different because the code that is
the problem is upstream 2.6.22 and I do not think 2.6.21.6. has it.

Could you run iscsiadm with "-d 8" and could run iscsid with "-d 8" and
send all the output when you run the logout command.

Also what do you mean by hang? Is the box locking up hard or does the
iscsiadm command just not complete? Also how long do you wait for
iscsiadm to complete? It may take up to replacement_timeout seconds
which is 2 minutes by default.

Mike Christie

unread,
Aug 15, 2007, 2:51:07 PM8/15/07
to open-...@googlegroups.com
benjamin9999 wrote:
> iscsid: thread 00538a40 delete: state 2
> iscsid: in kdestroy_conn
> iscsid: in __kipc_call
> iscsid: in kwritev
> iscsid: in nlpayload_read
> iscsid: in nlpayload_read
> iscsid: disconnecting conn 0x536578, fd 6
> iscsid: in kdestroy_session
> iscsid: in __kipc_call
> iscsid: in kwritev
>
> < no more output after this ever >
>

Hey after this message how long to did you wait? It should not block for
longer than replacement_timeout seconds which is 2 minutes by default.
Also are you using the tools and kernel modules from
open-iscsi-2-0.866.9 or just the tools? The logout should actually not
take as long as 2 minutes if you are using the kernel modules from the
open-iscsi-2-0.866.9 release. If you use the kernel modules from the
upstream kernel then it could take some time because
iscsi_session_teardown was missing a call to iscsi_unblock_session.

Mike Christie

unread,
Aug 15, 2007, 2:53:56 PM8/15/07
to open-...@googlegroups.com

Oh yeah, same question I asked to Ben. Are you using the kernel modules
from open-iscsi 865.9 or the ones from the upstream kernel? The upstream
kernel modules in .22 and .21 have a bug (well maybe not a bug, but it
makes shutdown at this time really slow (at least 2 minutes like I said
above).

Also when iscsiadm hangs, could you do

cat /sys/class/scsi_host/hostX/state

where X is the number of the iscsi host. You can get the iscsi host
number by running "iscsiadm -m session -P 3".

Brett Neumeier

unread,
Aug 15, 2007, 5:14:40 PM8/15/07
to open-...@googlegroups.com
> Mike Christie wrote:
> > This seems like it might be something different because the code that is
> > the problem is upstream 2.6.22 and I do not think 2.6.21.6. has it.

OK, these logs were generated with linux 2.6.21.7,
open-iscsi-2.0-865.10 (both the userspace tools and the kernel modules
from open-iscsi), Promise VTrak M500i.

(Incidentally, exactly the same operations using an EMC AX150i work fine.)

> > Could you run iscsiadm with "-d 8" and could run iscsid with "-d 8" and
> > send all the output when you run the logout command.

Absolutely. For good measure I'm attaching four files: the output of
iscsiadm login with -d 8; the output of iscsiadm logout with -d 8; the
iscsid output with -d8 (this is the output from iscsid for both the
login and the logout); and the relevant portions of the kernel log
messages.

> > Also what do you mean by hang? Is the box locking up hard or does the
> > iscsiadm command just not complete?

The iscsiadm command does not complete. Also, the iscsi subsystem
appears to become completely non-functional when this occurs -- that
is, if we have iSCSI sessions with our Promise unit and EMC unit, and
do a logout of the Promise unit, then none of the EMC LUs are
accessible any more either. I'm not completely sure my memory on that
point is correct, though.

The box otherwise continues to work fine, but when we try to do a
reboot it winds up hanging at some point, at which point we have to
power-cycle the server.

> > Also how long do you wait for
> > iscsiadm to complete? It may take up to replacement_timeout seconds
> > which is 2 minutes by default.

During this test we waited three minutes, and iscsiadm never completed.

> Also when iscsiadm hangs, could you do
> cat /sys/class/scsi_host/hostX/state

Yep, it says "running".

Thanks very much for your time and attention! We appreciate it greately.

Please let me know if we can do any additional testing for you.

iscsiadm-login-promise-1.bz2
iscsiadm-logout-promise-1.bz2
iscsid-output-promise.bz2
kernel-log.bz2

benjamin9999

unread,
Aug 15, 2007, 5:15:23 PM8/15/07
to open-iscsi

> Also when iscsiadm hangs, could you do
> cat /sys/class/scsi_host/hostX/state

i'll get this info.

i was using both kernel & userspace from the open-iscsi-2.0-865.9
release, i've never used the kernel-included iscsi modules.

i was aware of the 2-minutes counter, i kept waiting until 10 minutes
had passed before i collected my cut-n-paste to post.

i can check the sysfs data and post that.

as for the "hang", the system still works, but you cannot shut down
the system properly since the iscsi_tcp module will never unload.

i have to use 'b' > sysrq-trigger to restart the system.

i only did this test against 2.6.22, i could test an earlier kernel.


Mike Christie

unread,
Aug 15, 2007, 5:23:04 PM8/15/07
to open-...@googlegroups.com

Thanks for the info guys. I am hitting something here while trying to
rpelicate. When I run iscsiadm logout the console goes crazy, but it
eventually logs out. Maybe a locking bug. Let me see what I can find out.

Mike Christie

unread,
Aug 15, 2007, 5:32:19 PM8/15/07
to open-...@googlegroups.com
Brett Neumeier wrote:
>> Mike Christie wrote:
>>> This seems like it might be something different because the code that is
>>> the problem is upstream 2.6.22 and I do not think 2.6.21.6. has it.
>
> OK, these logs were generated with linux 2.6.21.7,
> open-iscsi-2.0-865.10 (both the userspace tools and the kernel modules
> from open-iscsi), Promise VTrak M500i.
>
> (Incidentally, exactly the same operations using an EMC AX150i work fine.)
>
>>> Could you run iscsiadm with "-d 8" and could run iscsid with "-d 8" and
>>> send all the output when you run the logout command.
>
> Absolutely. For good measure I'm attaching four files: the output of
> iscsiadm login with -d 8; the output of iscsiadm logout with -d 8; the
> iscsid output with -d8 (this is the output from iscsid for both the
> login and the logout); and the relevant portions of the kernel log
> messages.

Ok, for some reason both of you are getting stuck in the session
teardown. It looks like we locking or thread/workq waiting on some work
problem. Could either of you run this patch? It applies to
open-iscsi-2.0-865.9 or .10. Just send the kernel log output. You do not
have to send the -d 8 output. Thanks.

If you are not sure how to apply just untar the iscsi tarball. cd to
that dir. Then run "patch -p1 -i
path-to-patch/add-teardown-debug.patch". Then build and install like normal.

add-teardown-debug.patch

Brett Neumeier

unread,
Aug 15, 2007, 6:03:24 PM8/15/07
to open-...@googlegroups.com
On 8/15/07, Mike Christie <mich...@cs.wisc.edu> wrote:
> Ok, for some reason both of you are getting stuck in the session
> teardown. It looks like we locking or thread/workq waiting on some work
> problem. Could either of you run this patch? It applies to
> open-iscsi-2.0-865.9 or .10. Just send the kernel log output. You do not
> have to send the -d 8 output. Thanks.

OK, I applied the patch. While I was at it I switched back to 2.6.22.2...

The iscsiadm command prints:
Logout session [sid: 1, target:
iqn.1994-12.com.promise.39.63.4.55.1.0.0.20, portal: 10.4.0.2,3260]

and then hangs indefinitely (more than 2 minutes).

The kernel log reports:
[ 1548.620272] sd 1:0:0:0: [sdb] Synchronizing SCSI cache
[ 1548.620909] sd 1:0:0:1: [sdc] Synchronizing SCSI cache
[ 1548.863997] iscsi_session_teardown unblocking session
[ 1548.864081] iscsi remove host

--
Brett Neumeier (bneu...@gmail.com)

benjamin9999

unread,
Aug 15, 2007, 6:07:28 PM8/15/07
to open-iscsi
> problem. Could either of you run this patch? It applies to

logout from IET target with no cache
Aug 15 18:05:13 [kernel] iscsi_session_teardown unblocking session
Aug 15 18:05:13 [kernel] iscsi remove host
Aug 15 18:05:13 [kernel] host removed
Aug 15 18:05:13 [kernel] cancel delayed work
Aug 15 18:05:13 [kernel] flush work
Aug 15 18:05:13 [kernel] grab ihost mutex
Aug 15 18:05:13 [kernel] remove targets
Aug 15 18:05:13 [kernel] unreg transport dev
Aug 15 18:05:13 [kernel] device del
Aug 15 18:05:13 [kernel] put host


logout from promise target with cache
Aug 15 18:05:52 [kernel] sd 4:0:0:0: [sdc] Synchronizing SCSI cache
Aug 15 18:05:52 [kernel] sd 4:0:0:1: [sdd] Synchronizing SCSI cache
Aug 15 18:05:52 [kernel] sd 4:0:0:2: [sde] Synchronizing SCSI cache
Aug 15 18:05:52 [kernel] sd 4:0:0:3: [sdf] Synchronizing SCSI cache
Aug 15 18:05:53 [kernel] iscsi_session_teardown unblocking session
Aug 15 18:05:53 [kernel] iscsi remove host

also, the /sys/class/scsi_host/hostX/state for this host is 'running'
even after this, and after 5 minutes have passed.

Mike Christie

unread,
Aug 15, 2007, 7:17:20 PM8/15/07
to open-...@googlegroups.com
Mike Christie wrote:
> Brett Neumeier wrote:
>> On 8/13/07, Mike Christie <mich...@cs.wisc.edu> wrote:
>>>> it looks like logout does not wait for the cache to sync and removes
>>>> the devices in an unfriendly way while it is trying to complete.
>>> [..]
>>> The problem was that in older kernels if you removed the device from the
>>> userspace sysfs interface, the operation would block until the sync
>>> cache failed or completed. In 2.6.22, this behavior changed so that when
>>> you write to /sys/block/sdX/device/delete it only initiates the removal
>>> and then returns right away.
>> Hi,
>>
>> We are experiencing exactly the same problem at our site (that is:
>> linux-2.6.22, Promise VTrak M500i, open-iscsi 865.9, multiple LUs).
>>
>> I thought I would mention, after reviewing this thread, we tried
>> downgrading our server to 2.6.21.6. With that kernel, we were
>> /occasionally/ able to shut down iSCSI sessions without a perpetual
>> hang during synchronization of the SCSI cache; but two out of three
>> attempts still wound up with the same hanging condition.
>>
>
> This seems like it might be something different because the code that is
> the problem is upstream 2.6.22 and I do not think 2.6.21.6. has it.
>

Bah I was wrong. 2.6.22 and 2.6.21 return before completely deleting the
device. iscsiadm should not hang though. So still send the output with
that patch.

benjamin9999

unread,
Aug 15, 2007, 11:31:57 PM8/15/07
to open-iscsi

> device. iscsiadm should not hang though. So still send the output with
> that patch.

patched output near the end of this thread, maybe you missed it ?


Mike Christie

unread,
Aug 16, 2007, 12:03:53 PM8/16/07
to open-...@googlegroups.com

I got it now. Thanks.

Mike Christie

unread,
Aug 16, 2007, 12:22:49 PM8/16/07
to open-...@googlegroups.com
benjamin9999 wrote:
> logout from promise target with cache
> Aug 15 18:05:52 [kernel] sd 4:0:0:0: [sdc] Synchronizing SCSI cache
> Aug 15 18:05:52 [kernel] sd 4:0:0:1: [sdd] Synchronizing SCSI cache
> Aug 15 18:05:52 [kernel] sd 4:0:0:2: [sde] Synchronizing SCSI cache
> Aug 15 18:05:52 [kernel] sd 4:0:0:3: [sdf] Synchronizing SCSI cache
> Aug 15 18:05:53 [kernel] iscsi_session_teardown unblocking session
> Aug 15 18:05:53 [kernel] iscsi remove host
>
> also, the /sys/class/scsi_host/hostX/state for this host is 'running'
> even after this, and after 5 minutes have passed.
>

Looks like we are getting stuck on the scan_mutex in scsi_remove_host.
When the cache sync is sent, it is sent with the scan_mutex held so
maybe the command is getting lost.

Could you guys run once with this patch? It works for open-iscsi
2.0-865.9 or greater. It should basically just always fail the cache
sync command so there would be no way to get into the scan_mutex mess.

Could you also try _without_ the patch but build the open-iscsi package
with:

make DEBUG_SCSI=1
make DEBUG_SCSI=1 install

Then login and logout and send just kernel output.

dont-partially-delete-from-up.patch

Mike Christie

unread,
Aug 16, 2007, 1:43:05 PM8/16/07
to open-...@googlegroups.com

Oh yeah, ,one other thing, could you check to see if the device is still
in sysfs when iscsiadm hangs. So if you see


Aug 15 18:05:52 [kernel] sd 4:0:0:0: [sdc] Synchronizing SCSI cache

and the scsi device is sdc then do

cat /sys/block/sdc/device/state

for each device.

Mike Christie

unread,
Aug 15, 2007, 2:07:15 AM8/15/07
to open-...@googlegroups.com

Actually both of you guys do not worry about running any more tests. I
setup IET with lots and lots of disks and set it up to do wb and I now I
can finally get the timing right to hit the lock up, so I can debug here.

Mike Christie

unread,
Aug 16, 2007, 4:32:55 PM8/16/07
to open-...@googlegroups.com

Ok do not worry about testing. I found the problem. Attached is a patch
which fixes the hang during a normal shutdown. There still is a problem
if the session is down and then you try to logout or if someone uses the
scsi sysfs interface to remove the device while the session is down then
tries to logout. I am working on a fix for that now, but the bug is in
the scsi layer.

The problem is that if the session is down we block the devices on that
session. If someone deletes the device from sysfs or from some other
function than scsi_remove_host, __scsi_remove_devices sets the device
state to "cancel". It then tries to send the sync cache command. But
when we blocked the session, the scsi ml function we use also blocks the
stops the block layer queue. So when __scsi_remove_devices tries to send
the cache sync command it set the device state ok, but it does not set
the queue into the correct state to execute commands, and that command
sits in the queue forever because the block layer bit is still set to
stop. When we later try to call scsi_remove_host we wait for the command
to finish which never happens and hang.

fail-incoming-when-logging-out.patch

Brett Neumeier

unread,
Aug 17, 2007, 9:57:42 AM8/17/07
to open-...@googlegroups.com
On 8/16/07, Mike Christie <mich...@cs.wisc.edu> wrote:
> Ok do not worry about testing. I found the problem. Attached is a patch
> which fixes the hang during a normal shutdown. [...]

Mr Christie,

Thanks for your hard work! Just wanted to confirm that I have tested
this patch with our Promise unit, open-iscsi-2.0-865.10, linux
2.6.22.2; it appears to resolve the issue completely.

Cheers,

bn

benjamin9999

unread,
Aug 17, 2007, 6:21:14 PM8/17/07
to open-iscsi

>
> fail-incoming-when-logging-out.patch

sorry, on the first try, not so much luck for me.
kernel output follows, i clipped out a bit of the repeating part

Aug 17 18:15:08 [kernel] sd 4:0:0:0: [sdc] Synchronizing SCSI cache
Aug 17 18:15:08 [kernel] sd 4:0:0:1: [sdd] Synchronizing SCSI cache
Aug 17 18:15:08 [kernel] sd 4:0:0:2: [sde] Synchronizing SCSI cache
Aug 17 18:15:08 [kernel] sd 4:0:0:3: [sdf] Synchronizing SCSI cache
Aug 17 18:15:08 [kernel] sd 4:0:0:4: [sdg] Synchronizing SCSI cache
Aug 17 18:15:08 [kernel] iscsi: cmd 0x35 is not queued (6)
- Last output repeated 2 times -
Aug 17 18:15:08 [kernel] sd 4:0:0:4: [sdg] Result:
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
Aug 17 18:15:08 [kernel] Unable to handle kernel NULL pointer
dereference at 0000000000000020 RIP:
Aug 17 18:15:08 [kernel] <5>sd 4:0:0:5: [sdh] Synchronizing SCSI cache
Aug 17 18:15:08 [kernel] iscsi: cmd 0x35 is not queued (6)
Aug 17 18:15:08 [kernel] [<ffffffff80446e6b>] klist_del+0xa/0x46
Aug 17 18:15:08 [kernel] PGD 0
Aug 17 18:15:08 [kernel] iscsi: cmd 0x35 is not queued (6)
Aug 17 18:15:08 [kernel] SMP
Aug 17 18:15:08 [kernel] CPU 0
Aug 17 18:15:08 [kernel] Modules linked in: iscsi_tcp libiscsi
scsi_transport_iscsi crc32c crypto_algapi libcrc32c<6>sd 4:0:0:5:
[sdh] Result: hostbyte=DID_NO_CONNECT dri
Aug 17 18:15:08 [kernel] sd 4:0:0:6: [sdi] Synchronizing SCSI cache
Aug 17 18:15:08 [kernel] iscsi: cmd 0x35 is not queued (6)
- Last output repeated twice -
Aug 17 18:15:08 [kernel] e1000 tg3
Aug 17 18:15:08 [kernel] iscsi: cmd 0x35 is not queued (6)
Aug 17 18:15:08 [kernel] sd 4:0:0:6: [sdi] Result:
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
Aug 17 18:15:08 [kernel] sd 4:0:0:8: [sdk] Synchronizing SCSI cache
Aug 17 18:15:08 [kernel] iscsi: cmd 0x35 is not queued (6)
- Last output repeated 2 times -
Aug 17 18:15:08 [kernel] sd 4:0:0:8: [sdk] Result:
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
Aug 17 18:15:08 [kernel] sd 4:0:0:9: [sdl] Synchronizing SCSI cache
Aug 17 18:15:08 [kernel] iscsi: cmd 0x35 is not queued (6)
- Last output repeated 2 times -
Aug 17 18:15:08 [kernel] sd 4:0:0:9: [sdl] Result:
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
Aug 17 18:15:08 [kernel] sd 4:0:0:10: [sdm] Synchronizing SCSI cache
Aug 17 18:15:08 [kernel] iscsi: cmd 0x35 is not queued (6)
- Last output repeated 2 times -
Aug 17 18:15:08 [kernel] sd 4:0:0:10: [sdm] Result:
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
Aug 17 18:15:08 [kernel] sd 4:0:0:11: [sdn] Synchronizing SCSI cache
Aug 17 18:15:08 [kernel] iscsi: cmd 0x35 is not queued (6)
- Last output repeated 2 times -
Aug 17 18:15:08 [kernel] sd 4:0:0:11: [sdn] Result:
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
Aug 17 18:15:08 [kernel] sd 4:0:0:12: [sdo] Synchronizing SCSI cache
Aug 17 18:15:08 [kernel] iscsi: cmd 0x35 is not queued (6)
- Last output repeated 2 times -
Aug 17 18:15:08 [kernel] sd 4:0:0:12: [sdo] Result:
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
Aug 17 18:15:08 [kernel] sd 4:0:0:13: [sdp] Synchronizing SCSI cache
Aug 17 18:15:08 [kernel] iscsi: cmd 0x35 is not queued (6)
- Last output repeated 2 times -
Aug 17 18:15:08 [kernel] sd 4:0:0:13: [sdp] Result:
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
Aug 17 18:15:08 [kernel] sd 4:0:0:14: [sdq] Synchronizing SCSI cache
Aug 17 18:15:08 [kernel] iscsi: cmd 0x35 is not queued (6)
- Last output repeated 2 times -
Aug 17 18:15:08 [kernel] sd 4:0:0:14: [sdq] Result:
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
Aug 17 18:15:08 [kernel] sd 4:0:0:15: [sdr] Synchronizing SCSI cache
Aug 17 18:15:08 [kernel] iscsi: cmd 0x35 is not queued (6)
- Last output repeated 2 times -
Aug 17 18:15:08 [kernel] sd 4:0:0:15: [sdr] Result:
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
Aug 17 18:15:08 [kernel] sd 4:0:0:16: [sds] Synchronizing SCSI cache
Aug 17 18:15:08 [kernel] iscsi: cmd 0x35 is not queued (6)
- Last output repeated 2 times -
Aug 17 18:15:08 [kernel] sd 4:0:0:16: [sds] Result:
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
Aug 17 18:15:08 [kernel] sd 4:0:0:17: [sdt] Synchronizing SCSI cache
Aug 17 18:15:08 [kernel] iscsi: cmd 0x35 is not queued (6)
- Last output repeated 2 times -
Aug 17 18:15:08 [kernel] sd 4:0:0:17: [sdt] Result:
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
....... more of this stuff.......
Aug 17 18:15:08 [kernel] sd 4:0:0:35: [sdal] Synchronizing SCSI cache
Aug 17 18:15:08 [kernel] iscsi: cmd 0x35 is not queued (6)
- Last output repeated 2 times -
Aug 17 18:15:08 [kernel] sd 4:0:0:35: [sdal] Result:
hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK
Aug 17 18:15:08 [kernel] Pid: 7, comm: events/0 Not tainted 2.6.22.2
#3
Aug 17 18:15:08 [kernel] RIP: 0010:[<ffffffff80446e6b>]
[<ffffffff80446e6b>] klist_del+0xa/0x46
Aug 17 18:15:08 [kernel] RSP: 0018:ffff81012b4e5e20 EFLAGS: 00010286
Aug 17 18:15:08 [kernel] RAX: ffffffff8054a288 RBX: 0000000000000000
RCX: ffffffff8046c3a8
Aug 17 18:15:08 [kernel] RDX: ffff8102230a10e8 RSI: ffff8102230a10e8
RDI: ffff8102230a1110
Aug 17 18:15:08 [kernel] RBP: ffff8102230a10e8 R08: 0000000000000000
R09: ffff81012981f180
Aug 17 18:15:08 [kernel] R10: 0000000000000000 R11: ffff81012981f180
R12: ffff8102230a1110
Aug 17 18:15:08 [kernel] R13: ffff81012a2b9028 R14: ffffffff805a61e0
R15: 0000000000000000
Aug 17 18:15:08 [kernel] FS: 00002b5df689b6d0(0000)
GS:ffffffff8055e000(0000) knlGS:0000000000000000
Aug 17 18:15:08 [kernel] CS: 0010 DS: 0018 ES: 0018 CR0:
000000008005003b
Aug 17 18:15:08 [kernel] CR2: 0000000000000020 CR3: 0000000000201000
CR4: 00000000000006e0
Aug 17 18:15:08 [kernel] Stack: ffff8102230a10e8 ffff8102230a10e8
ffff81022bc1b860 ffffffff80367116
Aug 17 18:15:08 [kernel] ffff81022bc1b860 ffff8102230a1000
ffff8102230a10e8 ffff81022bc1b860
Aug 17 18:15:08 [kernel] ffffffffffffffff ffffffff8038cb7f
ffffffff8030e3a3 ffff8102230a1000
Aug 17 18:15:08 [kernel] Call Trace:
Aug 17 18:15:08 [kernel] [<ffffffff80367116>] device_del+0x22/0x2bb
Aug 17 18:15:08 [kernel] [<ffffffff8038cb7f>] __scsi_remove_device
+0x3a/0x7a
Aug 17 18:15:08 [kernel] [<ffffffff8030e3a3>] kobject_release+0x0/0x9
Aug 17 18:15:08 [kernel] [<ffffffff802a74ce>]
sysfs_schedule_callback_work+0x0/0x33
Aug 17 18:15:08 [kernel] [<ffffffff8038cbe0>] scsi_remove_device
+0x21/0x2e
Aug 17 18:15:08 [kernel] [<ffffffff80448ca0>] __mutex_unlock_slowpath
+0x2a/0x33
Aug 17 18:15:08 [kernel] [<ffffffff802a74e3>]
sysfs_schedule_callback_work+0x15/0x33
Aug 17 18:15:08 [kernel] [<ffffffff8023692d>] run_workqueue
+0x83/0x10c
Aug 17 18:15:08 [kernel] [<ffffffff802370d9>] worker_thread+0x0/0xe4
Aug 17 18:15:08 [kernel] [<ffffffff802371b3>] worker_thread+0xda/0xe4
Aug 17 18:15:08 [kernel] [<ffffffff80239d42>] autoremove_wake_function
+0x0/0x2e
Aug 17 18:15:08 [kernel] [<ffffffff80239c22>] kthread+0x47/0x75
Aug 17 18:15:08 [kernel] [<ffffffff8020a308>] child_rip+0xa/0x12
Aug 17 18:15:08 [kernel] [<ffffffff80239bdb>] kthread+0x0/0x75
Aug 17 18:15:08 [kernel] [<ffffffff8020a2fe>] child_rip+0x0/0x12
Aug 17 18:15:08 [kernel] Code: 48 8b 6b 20 48 89 df e8 43 2c 00 00 4c
89 e7 e8 4f ff ff ff
Aug 17 18:15:08 [kernel] RIP [<ffffffff80446e6b>] klist_del+0xa/0x46
Aug 17 18:15:08 [kernel] RSP <ffff81012b4e5e20>

benjamin9999

unread,
Aug 17, 2007, 6:58:12 PM8/17/07
to open-iscsi

also, i did not use the patch, looked like commit
0b6bb0da7730fc08587acdf5a51666542be47843 was the same so i just
grabbed the .11 .

http://git.kernel.org/?p=linux/kernel/git/mnc/open-iscsi.git;a=commit;h=0b6bb0da7730fc08587acdf5a51666542be47843

benjamin9999

unread,
Aug 17, 2007, 9:03:56 PM8/17/07
to open-iscsi
> also, i did not use the patch, looked like commit
> 0b6bb0da7730fc08587acdf5a51666542be47843 was the same so i just

i looked over this again and noticed i was wrong.

i patched the code, and now the logout is ok, but i suppose this is
just an interim fix, shouldn't we really successfully send the cache
sync command? good enough for now though.

but what about the strange null pointer business earlier? hmmm


Mike Christie

unread,
Aug 17, 2007, 9:41:15 PM8/17/07
to open-...@googlegroups.com
benjamin9999 wrote:
>> also, i did not use the patch, looked like commit
>> 0b6bb0da7730fc08587acdf5a51666542be47843 was the same so i just
>
> i looked over this again and noticed i was wrong.
>

yeah, .11 does not have the fix. there are some other changes that are
needed for that patch to be complete. Some of the changes are in git,
but it missing one more thing. I am trying to do the proper fix where
the cache sync gets sent and executed correctly and I have to fix
scsi-ml since the problem can occur with any transport.

> i patched the code, and now the logout is ok, but i suppose this is
> just an interim fix, shouldn't we really successfully send the cache
> sync command? good enough for now though.

Yeah, it is temp fix so we do not hang. It does not fix that sync cahnce
is not getting sent correctly. We really want to send the cache sync and
have it executed properly. I am working on it. It is more invassive fix.
I am going on vacation this weekend so hopefully I can get back to it
Monday.

Thanks for testing Brett and Ben (Ben sorry I misread your mail the
first time and missed the part about the hang - I just addressed the
part about the sync cahce).

Reply all
Reply to author
Forward
0 new messages