I'm using open-iscsi v0.5 r454 with Debian 2.6.15-1-686-smp kernel, using
the default iscsi modules from that kernel.
Discovery works OK to Equallogic target, and I get the volume list. But,
when I try to login, I get these errors:
# iscsiadm -m node --record 6fe140 --login
iscsiadm: iscsid reported error (11 - iSCSI PDU timed out)
from syslog:
kernel: iscsi4: detected conn error (1011)
iscsid: detected iSCSI connection (handle 0xf78cd000) error (1011) state (2)
Is there solution to this problem?
Same target/volume works with other initiators.
-- Pasi Kärkkäinen
^
. .
Linux
/ - \
Choice.of.the
.Next.Generation.
# iscsiadm -m node
[6fe140] 10.xx.yy.245:3260,0 iqn.equallogic:6-8a0900-2f3432c01-e950000001843d65-vol01
[0143ba] 10.xx.yy.245:3260,0 iqn.equallogic:6-8a0900-252432c01-a2d0000001543d65-testvol01
That's what I get.. is that "3260,0" correct?
-- Pasi
No problem.
So, you have open-iscsi working with Equallogic target? What version of
open-iscsi? What version of Equallogic firmware?
Also please post your iscsid.conf so I can try to figure out what's the
problem in my setup..
Thanks!
-- Pasi
Anyone?
Same target/volume works fine with linux-iscsi (cisco) initiator.. and also
with windows initiator.
Recommended steps for debugging this?
-- Pasi
I actually have one of these at work I will try it out in the morning.
What model are you using? Maybe a ethereal trace would help others or
maybe turn on the iscsi and scsi debugging to better see where the error
is coming from.
I'm using Equallogic PS-200E.
I'll try with the debug options..
Thanks!
-- Pasi
If that's login problem, you should send 'iscsid -d9' log output.
Did you also send the kernel log parts? I ran with a "Vendor: EQLOGIC
Model: 100E-00" and it seems to work ok. It was very slow becuase I was
in minneapolis MN and the target was in westford MA but I could fdisk,
and do mkfs. I am wondering where in the kernel the conn error that
produced this
iscsid: detected iSCSI connection (handle 0xd2a40000) error (1011) state (2)
in the daemon came from.
dmesg:
iscsi: registered transport (tcp)
scsi4 : iSCSI Initiator over TCP/IP, v.0.3
iscsi4: detected conn error (1011)
Not very informative..
-- Pasi
We should probably come up with a way to do make DEBUG, but for now
could you turn on debugging in the driver. To do this you must open
kernel/iscsi_tcp.c and uncomment these lines:
/* #define DEBUG_TCP */
/* #define DEBUG_SCSI */
Or you can apply this patch which uncomments the lines for you. To apply
the patch go to the open-iscsi root dir and do a "patch -p0 -i
path-to-patch/out.debug.patch"
$ iscsiadm -m node --record b66f1a --login
iscsiadm: iscsid reported error (11 - iSCSI PDU timed out)
dmesg with DEBUG_TCP and DEBUG_SCSI turned on:
iscsi: registered transport (tcp)
scsi5 : iSCSI Initiator over TCP/IP, v.0.3
scsi: mgmtpdu [op 0x43 hdr->itt 0xa00 datalen 492]
scsi: mtask deq [cid 0 state 4 itt 0xa00]
tcp: sendhdr 48 bytes, sent 0 res 48
tcp: sendpage: 492 bytes, sent 0 left 492 sent 0 res 492
tcp: iscsi_tcp_state_change: TCP_CLOSE|TCP_CLOSE_WAIT
iscsi5: detected conn error (1011)
tcp: in 80 bytes
tcp: conn 0 Rx suspended!
Hope that helps..
-- Pasi
Hmm.. Could you also check your open-iscsi + kernel versions ? Would be nice
to know which version of open-iscsi works out-of-the-box with EqualLogic
targets.
-- Pasi
I am not sure if I am reading this correctly but from your two logs it
looks like the target is closing the connection. Is that the correct
interpretation? I am not sure why.... Maybe there is a iscsi reason and
for that I guess we could add some code to dump the values in the PDUs
or could you send a ethereal trace? Or maybe it is just a target
misconfig - that is just a complete guess based on similar problems with
another type if target.
The target is (should be) OK, because it works with Linux Cisco Initiator,
and also with Microsoft initiator..
I'll capture ethereal trace tomorrow and post it.
-- Pasi
The same problem happens with Linux 2.6.15 + open-iscsi v1.0 r485.
Looking at the ethereal capture, the login process fails when EqualLogic
target sends "iSCSI Login Response (Target moved temporarily)".
That happens because of automatic connection load balancing in EqualLogic target.
Initiators connect to the group (load balancing) IP which then redirects to
the real target interfaces.
Open-iscsi initiator seems to ask ARP for the correct target interface IP
(which it got from the "Target moved temporarily" message), but it doesn't send
*anything* to the real target IP..
So the problem is related to the "Target moved temporarily" handling in
open-iscsi.
Ideas?
-- Pasi
For some reason it looks like the response pdu is not getting to the
daemon. This is not a fix, but could you try this out? It should work
with svn or open-iscsi v1.0 r485. Thanks.
I guess if this works, maybe we want to be taking the sk_callback_lock
with write_lock becuase we could be setting the suspend bit if we call
iscsi_conn_failure.
I tried this.. the patch didn't apply to v1.0, but I did the same changes
manually.
I also defined DEBUG_SCSI and DEBUG_TCP.
The connection still fails.
$ iscsiadm -m node --record b66f1a --login
iscsiadm: iscsid reported error (11 - iSCSI PDU timed out)
dmesg:
scsi13 : iSCSI Initiator over TCP/IP, v.0.3
iscsi13: detected conn error (1011)
iscsid -d9 log available from:
http://nrg.joroinen.fi/iscsid-v1.0-d9.log
-Pasi
Is this all you have for the kernel messages? Maybe you need to get the
KERN_DEBUG messages.
My copy of the regression test is slightly modified. I've got the
fdisk/mkfs/bonnie turned off, as I've got a need for raw disk (also
it's got a patch for parallel runs that I should submit)
Additionally I have a wrapper script that runs the regression test in a
loop, logging to a new sequential file each time.
Your patch does appear to help a lot (I can complete 3 parallel
regression runs 95% of the time now, where it died much earlier in them
before), but things still aren't 100%.
I get a LOT of 'iscsi_tcp_state_change state 8' output, esp. around the
spots that iscsiadm is changing parameters.
For 3 parallel runs, 500+ instances of the error for the course of the
hour that regression.sh takes to run.
However if I modify the regression test to just change parameters, and
not actually touch the device at all, I see
'iscsi_tcp_state_change state 7' instead.
Grepping the regression logs for other error output, here are the counts
for the hour.
2 iscsiadm: got read error (0), daemon died?
4 iscsiadm: iscsid reported error (11 - iSCSI PDU timed out)
8 iscsiadm: iscsid reported error (5 - encountered iSCSI login failure)
--
Robin Hugh Johnson
E-Mail : rob...@gentoo.org
GnuPG FP : 11AC BA4F 4778 E3F6 E4ED F38E B27B 944E 3488 4E85
I did remember that when I did login redirect work, Tomof has a Equallogic
and he tested my patch fine with it. Not sure about why it doesn't work for you.
But it seems racing that daemon got tcp disconn message before login
rsp pdu. You'd better post tcpdump or ethereal capture, thanks.
>>
>> For some reason it looks like the response pdu is not getting to the
>> daemon. This is not a fix, but could you try this out? It should work
>> with svn or open-iscsi v1.0 r485. Thanks.
>
>
> I tried this.. the patch didn't apply to v1.0, but I did the same
> changes manually.
>
> I also defined DEBUG_SCSI and DEBUG_TCP.
>
> The connection still fails.
>
> $ iscsiadm -m node --record b66f1a --login
> iscsiadm: iscsid reported error (11 - iSCSI PDU timed out)
>
>
> dmesg:
> scsi13 : iSCSI Initiator over TCP/IP, v.0.3
> iscsi13: detected conn error (1011)
>
>
Yep, that's all.
I tried again, started with fresh v1.0 tarball, patched iscsi_tcp.c with
your patch and defined DEBUG_SCSI and DEBUG_TCP. Compiled, and loaded the
modules. Did I miss something?
Still nothing in dmesg or syslog.. I wonder why..
- Pasi
So before my patch, you could not log in at all. Now you can log in but
we have similar problems later on? Am I understanding that right? When
you run the regression tests and you can login initially ok but it later
fails, do you any async message PDUs being sent from the target to the
initiator (something about the connection or session being dropped or
logged out)?
Running multiple passes in parallel triggered the error within 10
minutes of starting reasonably quickly, but the quantity of traffic
made it very tough to capture iSCSI packets without running out of disk
space.
> When you run the regression tests and you can login initially ok but
> it later fails, do you any async message PDUs being sent from the
> target to the initiator (something about the connection or session
> being dropped or logged out)?
I did some initial debugging this some last November with zhen, and found
an async logout. Zhen's async patch '[RFC] handler for reject and async
event pdu' and it's later editions that added a stub
'iscsi_async_event_rsp' did help a little as well, but left more
questions unanswered than not, and I suspect that the union of them and
your testing patch from this thread may resolve the issue completely.
(Thereafter I didn't have any access to the initiator box and target
box for most of December and January, and I've come back to the fray
now).
So, you have EqualLogic PS-100E.. what firmware?
What version of open-iscsi?
I'm asking because I can't login at all.. I have PS-200E with v2.2.6
firmware, and open-scsi v1.0.
> I did some initial debugging this some last November with zhen, and found
> an async logout. Zhen's async patch '[RFC] handler for reject and async
> event pdu' and it's later editions that added a stub
> 'iscsi_async_event_rsp' did help a little as well, but left more
> questions unanswered than not, and I suspect that the union of them and
> your testing patch from this thread may resolve the issue completely.
>
> (Thereafter I didn't have any access to the initiator box and target
> box for most of December and January, and I've come back to the fray
> now).
>
Nice to have others solving these problems too.. :)
-- Pasi
> What version of open-iscsi?
I keep up with SVN HEAD, which currently matches v1.0.
> I'm asking because I can't login at all.. I have PS-200E with v2.2.6
> firmware, and open-scsi v1.0.
I'm not aware of what changes were made between 2.2.3 and 2.2.6, but I
certainly haven't ever had problems logging in - but I don't use any
authentication in my environment.
$ iscsiadm -m discovery -t sendtargets -p 10.0.0.245:3260
[6f4fb5] 10.0.0.245:3260,0 iqn.equallogic:6-8a0900-c4f432c01-a210000003143e86-testvol01
$ iscsiadm -m node -r 6f4fb5 -l
iscsiadm: iscsid reported error (11 - iSCSI PDU timed out)
dmesg:
scsi5 : iSCSI Initiator over TCP/IP, v.0.3
iscsi5: detected conn error (1011)
http://nrg.joroinen.fi/iscsid-d9-equallogic.txt
http://nrg.joroinen.fi/open-iscsi-v1.0-equallogic-login-error.pcap
There you are.. and I'm running open-iscsi v1.0 with 2.6.15-1-k7 Debian
kernel.
IP-addresses:
initiator: 10.0.0.10
target group IP: 10.0.0.245
target-if1 IP: 10.0.0.241
target-if2 IP: 10.0.0.242
target-if3 IP: 10.0.0.243
OK.
> > What version of open-iscsi?
> I keep up with SVN HEAD, which currently matches v1.0.
>
OK.
> > I'm asking because I can't login at all.. I have PS-200E with v2.2.6
> > firmware, and open-scsi v1.0.
> I'm not aware of what changes were made between 2.2.3 and 2.2.6, but I
> certainly haven't ever had problems logging in - but I don't use any
> authentication in my environment.
>
Hmm.. I'm not using CHAP authentication either.
I'm only using access restrictions based on initiator IP address
(currently).
Maybe I should try with unrestricted volumes (in the target)..
Currently I cannot login at all..I just posted ethereal dump of the failing
login process, hopefully that will help debugging the problem.
-- Pasi
We moved to restricting on initiator name instead to prevent future
occurrences.
if that ethereal capture is not enough, please tell me.. I'm able to do new
captures quite easily.
-- Pasi
This seems still a problem right? Pasi?
From your iscsid log, code to do target temp move isn't called, but server
has closed conn by which initiator won't go further any more.
How about this? In tcp state change, we check if we're in normal state,
otherwise just ignore this callback.
---
--- kernel/iscsi_tcp.c.orig 2006-02-15 14:38:43.000000000 +0800
+++ kernel/iscsi_tcp.c 2006-02-15 15:12:26.000000000 +0800
@@ -1242,7 +1242,8 @@ iscsi_tcp_state_change(struct sock *sk)
if ((sk->sk_state == TCP_CLOSE_WAIT ||
sk->sk_state == TCP_CLOSE) &&
- !atomic_read(&sk->sk_rmem_alloc)) {
+ !atomic_read(&sk->sk_rmem_alloc) &&
+ (session->state == ISCSI_STATE_LOGGED_IN)) {
debug_tcp("iscsi_tcp_state_change: TCP_CLOSE|TCP_CLOSE_WAIT\n");
iscsi_conn_failure(conn, ISCSI_ERR_CONN_FAILED);
}
Yep, the problem is not yet resolved.
I also sent ethereal packet capture dump last week to this list.
I'm able to test your patch propably tomorrow. I'll let you know if it
helps!
Thanks!
- Pasi
With this patch I can now successfully mount volume from Equallogic target!!
Thanks!
I'll do some stress testing next week and let you know how it goes..
- Pasi
This is strange. Zhen's patch makes it so it does not fire only when we
are not yet logged in. I took the big hammer out and removed all that
code so that iscsi_conn_failure would never fire here, which would of
course include Zhen's case. But my patch did not work in handling this?
rm that "not yet"
I just figured out I had backup copy of original 2.6.15 iscsi modules under
/lib/modules/ (sigh), and that was the reason why I didn't get any debug
output.. the original modules were loaded :( and that's the reason why your
patch didn't help.
Should I try your patch also?
-- Pasi
no, my patch was just a hack to make sure we were racing with the login
code.
thanks for the follow up.
Mike, I just got a feel that should we do bind_conn that early? In case of
login redirect we'll possibly recreate socket. I'm not quite sure about this...
But the race is truly out there, so my patch would say that ignore conn fail
when session is not logged in, that depends on userspace daemon to handle all
the cases(redirect, timeout, etc). Is this ok?
yeah, agree on the race.
> when session is not logged in, that depends on userspace daemon to handle all
> the cases(redirect, timeout, etc). Is this ok?
>
I am still thinking about it. We can either overhaul the conn fail path
to remove the race or for the short term go your route.
The only problem I hit with the patch is that test you added makes some
login failures a lot longer (minutes instead of seconds). For one of my
test setups I use to verify patches, that code was detecting network
problems and would bail us out right away. My case is probably not a big
deal if it comes down to getting this to work for eql vs making failures
faster.
At least I'd hope to get the login stuff working with Equallogic targets :)
There's no way to get both? login working and the failure detection working
without long delays?
-- Pasi
^
. .
Linux
/ - \
Choice.of.the
.Next.Generation.