All right.
The issue is a bit different. Indeed, there are two issues, one which we
can disregard completely, and one which seems to be my problem.
I'm cross posting to open-iscsi list, since I'm not sure where the
problem is (if I'd have to guess, I'd say it's IET's problem).
1) I had an reproducible oops on a Mandriva cooker kernel. Since it's
not a vanilla kernel, we won't analyze it and assume it's something
broken in Mandriva's development kernel (or the issues are here easier
to trigger).
2) 0.4.15 indeed works with a vanilla 2.6.21 kernel. Which doesn't mean
"Login I/O error, failed to receive a PDU" doesn't happen.
It's actually easy to reproduce.
a) Install IET 0.4.15 to work with 2.6.21 kernel
b) connect the initiator (open-iscsi 2.0-754)
c) play a bit, make sure everything works
d) if everything works, stop the target:
# /etc/init.d/iscsi-target stop
e) wait a couple of seconds (important!), and start the target again
f) on the initiator, issue this command:
# iscsiadm -m discovery -t sendtargets -p <TARGET_IP>
It will take very long time to complete, and you will get "Login I/O
error, failed to receive a PDU".
If that command succeeds, do either:
- reissue that same command again (three, four times or so) - it will fail
- repeat e) and f) again.
For me, the issue is 100% reproducible on kernel 2.6.21 and IET 0.4.15.
It is *NOT* reproducible on kernel 2.6.17 and IET 0.4.15.
--
Tomasz Chmielewski
http://wpkg.org
sorry that i can not reproduce this at all. my target box is fc5 with
own 2.6.21.1 kernel with IET.15. ini box is RHEL5 with
iscsi-initiator-utils-6.2.0.742-0.5.el5
tried couple times with different time delay from couple seconds to
couple minutes.
can you run tcpdump at BOTH side and gather 2 logs about this? also run
"netstat -t" before and after u restart the iet.
ming
>
>
(...)
>> e) wait a couple of seconds (important!), and start the target again
>> f) on the initiator, issue this command:
>>
>> # iscsiadm -m discovery -t sendtargets -p <TARGET_IP>
>>
>> It will take very long time to complete, and you will get "Login I/O
>> error, failed to receive a PDU".
>>
>> If that command succeeds, do either:
>>
>> - reissue that same command again (three, four times or so) - it will fail
>> - repeat e) and f) again.
>>
>>
>> For me, the issue is 100% reproducible on kernel 2.6.21 and IET 0.4.15.
>> It is *NOT* reproducible on kernel 2.6.17 and IET 0.4.15.
>
> sorry that i can not reproduce this at all. my target box is fc5 with
> own 2.6.21.1 kernel with IET.15. ini box is RHEL5 with
> iscsi-initiator-utils-6.2.0.742-0.5.el5
>
> tried couple times with different time delay from couple seconds to
> couple minutes.
>
>
> can you run tcpdump at BOTH side and gather 2 logs about this? also run
> "netstat -t" before and after u restart the iet.
I'll do it eventually, but the box is pretty important, so it can take
some time before I send the results.
In the meantime, I'll try to reproduce it on some other box.
Any slight trace it could be CPU-related? The machine where it happens
is Thecus n5200, with Celeron M processor:
$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 9
model name : Intel(R) Celeron(R) M processor 600MHz
stepping : 5
cpu MHz : 599.015
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr mce cx8 apic sep mtrr pge mca
cmov pat clflush dts acpi mmx fxsr sse sse2 tm pbe
bogomips : 1198.69
clflush size : 64
no idea. i tested in 2 vms in my VMware server.
> sorry that i can not reproduce this at all. my target box is fc5 with
> own 2.6.21.1 kernel with IET.15. ini box is RHEL5 with
> iscsi-initiator-utils-6.2.0.742-0.5.el5
>
> tried couple times with different time delay from couple seconds to
> couple minutes.
>
>
> can you run tcpdump at BOTH side and gather 2 logs about this? also run
> "netstat -t" before and after u restart the iet.
OK... I just reproduced it again... and there will be no tcpdump output,
because there isn't any.
Looks like it is open-iscsi that's failing after all - instead of
tcpdump output I have some strace:
# strace iscsiadm -m discovery -t sendtargets -p
SOME_ABSTRACT_OR_REAL_ADDRESS
execve("/sbin/iscsiadm", ["iscsiadm", "-m", "discovery", "-t",
"sendtargets", "-p", "192.168.111.183"], [/* 51 vars */]) = 0
brk(0) = 0x8066000
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0xb7eff000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or
directory)
open("/etc/ld.so.cache", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=30422, ...}) = 0
mmap2(NULL, 30422, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7ef7000
close(3) = 0
open("/lib/i686/libc.so.6", O_RDONLY) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\240X\1"...,
512) = 512
fstat64(3, {st_mode=S_IFREG|0644, st_size=1220244, ...}) = 0
mmap2(NULL, 1230204, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3,
0) = 0xb7dca000
mmap2(0xb7ef1000, 12288, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x126) = 0xb7ef1000
mmap2(0xb7ef4000, 9596, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xb7ef4000
close(3) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0xb7dc9000
set_thread_area({entry_number:-1 -> 6, base_addr:0xb7dc96c0,
limit:1048575, seg_32bit:1, contents:0, read_exec_only:0,
limit_in_pages:1, seg_not_present:0, useable:1}) = 0
mprotect(0xb7ef1000, 4096, PROT_READ) = 0
mprotect(0xb7f19000, 4096, PROT_READ) = 0
munmap(0xb7ef7000, 30422) = 0
rt_sigaction(SIGINT, {0x805cd70, [], 0}, {SIG_DFL}, 8) = 0
umask(0177) = 022
socket(PF_FILE, SOCK_STREAM, 0) = 3
connect(3, {sa_family=AF_FILE, path=@ISCSIADM_ABSTRACT_NAMESPACE}, 110) = 0
write(3, "\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
2500) = 2500
recv(3,
And it sits there and waits :(
Or:
# strace iscsiadm -m node -T iqn.2007-05.net.my:test -p 192.168.111.177 -l
execve("/sbin/iscsiadm", ["iscsiadm", "-m", "node", "-T",
"iqn.2007-05.net.syneticon:supert"..., "-p", "192.168.111.177", "-l"],
[/* 51 vars */]) = 0
brk(0) = 0x8066000
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0xb7f04000
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or
directory)
open("/etc/ld.so.cache", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=30422, ...}) = 0
mmap2(NULL, 30422, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7efc000
close(3) = 0
open("/lib/i686/libc.so.6", O_RDONLY) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\240X\1"...,
512) = 512
fstat64(3, {st_mode=S_IFREG|0644, st_size=1220244, ...}) = 0
mmap2(NULL, 1230204, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3,
0) = 0xb7dcf000
mmap2(0xb7ef6000, 12288, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x126) = 0xb7ef6000
mmap2(0xb7ef9000, 9596, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xb7ef9000
close(3) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0xb7dce000
set_thread_area({entry_number:-1 -> 6, base_addr:0xb7dce6c0,
limit:1048575, seg_32bit:1, contents:0, read_exec_only:0,
limit_in_pages:1, seg_not_present:0, useable:1}) = 0
mprotect(0xb7ef6000, 4096, PROT_READ) = 0
mprotect(0xb7f1e000, 4096, PROT_READ) = 0
munmap(0xb7efc000, 30422) = 0
rt_sigaction(SIGINT, {0x805cd70, [], 0}, {SIG_DFL}, 8) = 0
umask(0177) = 022
socket(PF_FILE, SOCK_STREAM, 0) = 3
connect(3, {sa_family=AF_FILE, path=@ISCSIADM_ABSTRACT_NAMESPACE}, 110) = 0
write(3, "\10\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
2500) = 2500
recv(3,
No tcpdump output here as well.
I only get some tcpdump output when I telnet to the port directly - so
it's not a connectivity problem:
# telnet 192.168.111.177 3260
Trying 192.168.111.177...
Connected to 192.168.111.177 (192.168.111.177).
Escape character is '^]'.
looks like the communication between iscsiadm and iscsid has issues. ;)
>>
>> And it sits there and waits :(
>
> looks like the communication between iscsiadm and iscsid has issues. ;)
Yep.
I restarted iscsid (well, it didn't want to restart so easily), tried to
login to a target, and the machine rebooted...
Sadly, nothing in logs.
I guess my network is haunted.
try to run iscsid -d8 and see if any log message. and try to enable some
crash log and see if u can capture one or two.
>
> Sadly, nothing in logs.
>
>
> I guess my network is haunted.
>
>
come on, Halloween still 5 months away!
You might have multiple iscsid or iscsiadm versions installed in your
system. Run "whereis iscsid" and "whereis iscsiadm". If you have more
than one than remove them and reinstall freshly.
> You might have multiple iscsid or iscsiadm versions installed in your
> system. Run "whereis iscsid" and "whereis iscsiadm". If you have more
> than one than remove them and reinstall freshly.
No:
# whereis iscsid
iscsid: /sbin/iscsid /usr/share/man/man8/iscsid.8
# whereis iscsiadm
iscsiadm: /sbin/iscsiadm /usr/share/man/man8/iscsiadm.8
I'm closer to reproduce this, though :)
1. Start tgtd, with many targets
2. Start the (open-iscsi) initiator on the other side, say 2 hosts,
connect nodes
3. Kill tgtd
4. Start IET, with one target
5. On each of initiators, run:
iscsiadm -m discovery -t sendtargets -p 192.168.111.177
It will either say:
[root@syn2 manager]# iscsiadm -m discovery -t sendtargets -p 192.168.111.177
iscsiadm: connection to discovery address 192.168.111.177 failed
iscsiadm: connection to discovery address 192.168.111.177 failed
iscsiadm: connection to discovery address 192.168.111.177 failed
iscsiadm: connection to discovery address 192.168.111.177 failed
iscsiadm: connection to discovery address 192.168.111.177 failed
iscsiadm: connection login retries (reopen_max) 5 exceeded
[root@syn4 usr]# iscsiadm -m discovery -t sendtargets -p 192.168.111.177
iscsiadm: socket 3 header read timed out
iscsiadm: Login I/O error, failed to receive a PDU
iscsiadm: retrying discovery login to 192.168.111.177
iscsiadm: connection to discovery address 192.168.111.177 failed
iscsiadm: connection to discovery address 192.168.111.177 failed
iscsiadm: connection to discovery address 192.168.111.177 failed
iscsiadm: connection to discovery address 192.168.111.177 failed
iscsiadm: connection login retries (reopen_max) 5 exceeded
6. Stop IET/ietd
7. Start tgtd with the same single target IET/ietd was running - now
"iscsiadm -m discovery..." run fine.
Is it normal? If no, what options should I start tcpdump with?
Initiator:
[root@syn4 manager]# tcpdump -i eth0 port 3260 -vv
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 96
bytes
18:18:42.530445 IP (tos 0x0, ttl 64, id 11879, offset 0, flags [DF],
proto: TCP (6), length: 60) 192.168.111.176.52473 >
192.168.111.177.3260: S, cksum 0x9169 (correct),
2498385728:2498385728(0) win 5840 <mss 1460,sackOK,timestamp 817447
0,nop,wscale 7>
18:18:45.520597 IP (tos 0x0, ttl 64, id 11880, offset 0, flags [DF],
proto: TCP (6), length: 60) 192.168.111.176.52473 >
192.168.111.177.3260: S, cksum 0x903d (correct),
2498385728:2498385728(0) win 5840 <mss 1460,sackOK,timestamp 817747
0,nop,wscale 7>
18:18:51.299384 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto:
TCP (6), length: 60) 192.168.111.177.3260 > 192.168.111.176.52472: S,
cksum 0x1b35 (correct), 2952957677:2952957677(0) ack 2462022456 win 5792
<mss 1460,sackOK,timestamp 1710516 816371,nop,wscale 5>
18:18:51.299399 IP (tos 0x0, ttl 64, id 4731, offset 0, flags [DF],
proto: TCP (6), length: 64) 192.168.111.176.52472 >
192.168.111.177.3260: ., cksum 0xa3a9 (correct), 302:302(0) ack 1 win 46
<nop,nop,timestamp 818325 1710516,nop,nop,sack 1 {0:1}>
18:18:51.518131 IP (tos 0x0, ttl 64, id 11881, offset 0, flags [DF],
proto: TCP (6), length: 60) 192.168.111.176.52473 >
192.168.111.177.3260: S, cksum 0x8de5 (correct),
2498385728:2498385728(0) win 5840 <mss 1460,sackOK,timestamp 818347
0,nop,wscale 7>
18:18:58.545294 IP (tos 0x0, ttl 64, id 7288, offset 0, flags [DF],
proto: TCP (6), length: 60) 192.168.111.176.52474 >
192.168.111.177.3260: S, cksum 0x7eb8 (correct),
2508546834:2508546834(0) win 5840 <mss 1460,sackOK,timestamp 819050
0,nop,wscale 7>
18:18:58.635212 IP (tos 0x0, ttl 64, id 4732, offset 0, flags [DF],
proto: TCP (6), length: 100) 192.168.111.176.52472 >
192.168.111.177.3260: P 1:49(48) ack 1 win 46 <nop,nop,timestamp 819059
1710516>
18:19:01.544014 IP (tos 0x0, ttl 64, id 7289, offset 0, flags [DF],
proto: TCP (6), length: 60) 192.168.111.176.52474 >
192.168.111.177.3260: S, cksum 0x7d8c (correct),
2508546834:2508546834(0) win 5840 <mss 1460,sackOK,timestamp 819350
0,nop,wscale 7>
18:19:07.541550 IP (tos 0x0, ttl 64, id 7290, offset 0, flags [DF],
proto: TCP (6), length: 60) 192.168.111.176.52474 >
192.168.111.177.3260: S, cksum 0x7b34 (correct),
2508546834:2508546834(0) win 5840 <mss 1460,sackOK,timestamp 819950
0,nop,wscale 7>
18:19:14.558726 IP (tos 0x0, ttl 64, id 17378, offset 0, flags [DF],
proto: TCP (6), length: 60) 192.168.111.176.52475 >
192.168.111.177.3260: S, cksum 0x1406 (correct),
2528691790:2528691790(0) win 5840 <mss 1460,sackOK,timestamp 820652
0,nop,wscale 7>
18:19:17.557439 IP (tos 0x0, ttl 64, id 17379, offset 0, flags [DF],
proto: TCP (6), length: 60) 192.168.111.176.52475 >
192.168.111.177.3260: S, cksum 0x12da (correct),
2528691790:2528691790(0) win 5840 <mss 1460,sackOK,timestamp 820952
0,nop,wscale 7>
18:19:23.554974 IP (tos 0x0, ttl 64, id 17380, offset 0, flags [DF],
proto: TCP (6), length: 60) 192.168.111.176.52475 >
192.168.111.177.3260: S, cksum 0x1082 (correct),
2528691790:2528691790(0) win 5840 <mss 1460,sackOK,timestamp 821552
0,nop,wscale 7>
18:19:30.572154 IP (tos 0x0, ttl 64, id 24694, offset 0, flags [DF],
proto: TCP (6), length: 60) 192.168.111.176.52476 >
192.168.111.177.3260: S, cksum 0xe765 (correct),
2546723736:2546723736(0) win 5840 <mss 1460,sackOK,timestamp 822254
0,nop,wscale 7>
18:19:33.570857 IP (tos 0x0, ttl 64, id 24695, offset 0, flags [DF],
proto: TCP (6), length: 60) 192.168.111.176.52476 >
192.168.111.177.3260: S, cksum 0xe639 (correct),
2546723736:2546723736(0) win 5840 <mss 1460,sackOK,timestamp 822554
0,nop,wscale 7>
18:19:39.299354 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto:
TCP (6), length: 60) 192.168.111.177.3260 > 192.168.111.176.52472: S,
cksum 0xfdf4 (correct), 2952957677:2952957677(0) ack 2462022456 win 5792
<mss 1460,sackOK,timestamp 1715316 819059,nop,wscale 5>
18:19:39.299369 IP (tos 0x0, ttl 64, id 4733, offset 0, flags [DF],
proto: TCP (6), length: 64) 192.168.111.176.52472 >
192.168.111.177.3260: ., cksum 0x7e27 (correct), 302:302(0) ack 1 win 46
<nop,nop,timestamp 823127 1715316,nop,nop,sack 1 {0:1}>
18:19:39.568394 IP (tos 0x0, ttl 64, id 24696, offset 0, flags [DF],
proto: TCP (6), length: 60) 192.168.111.176.52476 >
192.168.111.177.3260: S, cksum 0xe3e1 (correct),
2546723736:2546723736(0) win 5840 <mss 1460,sackOK,timestamp 823154
0,nop,wscale 7>
18:19:46.585571 IP (tos 0x0, ttl 64, id 12283, offset 0, flags [DF],
proto: TCP (6), length: 60) 192.168.111.176.52477 >
192.168.111.177.3260: S, cksum 0x0df1 (correct),
2563423691:2563423691(0) win 5840 <mss 1460,sackOK,timestamp 823856
0,nop,wscale 7>
18:19:49.584279 IP (tos 0x0, ttl 64, id 12284, offset 0, flags [DF],
proto: TCP (6), length: 60) 192.168.111.176.52477 >
192.168.111.177.3260: S, cksum 0x0cc5 (correct),
2563423691:2563423691(0) win 5840 <mss 1460,sackOK,timestamp 824156
0,nop,wscale 7>
18:19:55.581815 IP (tos 0x0, ttl 64, id 12285, offset 0, flags [DF],
proto: TCP (6), length: 60) 192.168.111.176.52477 >
192.168.111.177.3260: S, cksum 0x0a6d (correct),
2563423691:2563423691(0) win 5840 <mss 1460,sackOK,timestamp 824756
0,nop,wscale 7>
Target:
superthecus:/etc/init.d# tcpdump -i bond0 host 192.168.111.176 -vv
tcpdump: listening on bond0, link-type EN10MB (Ethernet), capture size
96 bytes
18:18:42.511335 IP (tos 0x0, ttl 64, id 11879, offset 0, flags [DF],
proto: TCP (6), length: 60) 192.168.111.176.52473 >
192.168.111.177.3260: S, cksum 0x9169 (correct),
2498385728:2498385728(0) win 5840 <mss 1460,sackOK,timestamp 817447
0,nop,wscale 7>
18:18:45.501506 IP (tos 0x0, ttl 64, id 11880, offset 0, flags [DF],
proto: TCP (6), length: 60) 192.168.111.176.52473 >
192.168.111.177.3260: S, cksum 0x903d (correct),
2498385728:2498385728(0) win 5840 <mss 1460,sackOK,timestamp 817747
0,nop,wscale 7>
18:18:51.280102 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto:
TCP (6), length: 60) 192.168.111.177.3260 > 192.168.111.176.52472: S,
cksum 0x1b35 (correct), 2952957677:2952957677(0) ack 2462022456 win 5792
<mss 1460,sackOK,timestamp 1710516 816371,nop,wscale 5>
18:18:51.280298 IP (tos 0x0, ttl 64, id 4731, offset 0, flags [DF],
proto: TCP (6), length: 64) 192.168.111.176.52472 >
192.168.111.177.3260: ., cksum 0xa3a9 (correct), 302:302(0) ack 1 win 46
<nop,nop,timestamp 818325 1710516,nop,nop,sack 1 {0:1}>
18:18:51.499037 IP (tos 0x0, ttl 64, id 11881, offset 0, flags [DF],
proto: TCP (6), length: 60) 192.168.111.176.52473 >
192.168.111.177.3260: S, cksum 0x8de5 (correct),
2498385728:2498385728(0) win 5840 <mss 1460,sackOK,timestamp 818347
0,nop,wscale 7>
18:18:58.526171 IP (tos 0x0, ttl 64, id 7288, offset 0, flags [DF],
proto: TCP (6), length: 60) 192.168.111.176.52474 >
192.168.111.177.3260: S, cksum 0x7eb8 (correct),
2508546834:2508546834(0) win 5840 <mss 1460,sackOK,timestamp 819050
0,nop,wscale 7>
18:18:58.616123 IP (tos 0x0, ttl 64, id 4732, offset 0, flags [DF],
proto: TCP (6), length: 100) 192.168.111.176.52472 >
192.168.111.177.3260: P 1:49(48) ack 1 win 46 <nop,nop,timestamp 819059
1710516>
18:19:01.524912 IP (tos 0x0, ttl 64, id 7289, offset 0, flags [DF],
proto: TCP (6), length: 60) 192.168.111.176.52474 >
192.168.111.177.3260: S, cksum 0x7d8c (correct),
2508546834:2508546834(0) win 5840 <mss 1460,sackOK,timestamp 819350
0,nop,wscale 7>
18:19:07.522445 IP (tos 0x0, ttl 64, id 7290, offset 0, flags [DF],
proto: TCP (6), length: 60) 192.168.111.176.52474 >
192.168.111.177.3260: S, cksum 0x7b34 (correct),
2508546834:2508546834(0) win 5840 <mss 1460,sackOK,timestamp 819950
0,nop,wscale 7>
18:19:14.539608 IP (tos 0x0, ttl 64, id 17378, offset 0, flags [DF],
proto: TCP (6), length: 60) 192.168.111.176.52475 >
192.168.111.177.3260: S, cksum 0x1406 (correct),
2528691790:2528691790(0) win 5840 <mss 1460,sackOK,timestamp 820652
0,nop,wscale 7>
18:19:17.538350 IP (tos 0x0, ttl 64, id 17379, offset 0, flags [DF],
proto: TCP (6), length: 60) 192.168.111.176.52475 >
192.168.111.177.3260: S, cksum 0x12da (correct),
2528691790:2528691790(0) win 5840 <mss 1460,sackOK,timestamp 820952
0,nop,wscale 7>
18:19:23.535881 IP (tos 0x0, ttl 64, id 17380, offset 0, flags [DF],
proto: TCP (6), length: 60) 192.168.111.176.52475 >
192.168.111.177.3260: S, cksum 0x1082 (correct),
2528691790:2528691790(0) win 5840 <mss 1460,sackOK,timestamp 821552
0,nop,wscale 7>
18:19:30.553043 IP (tos 0x0, ttl 64, id 24694, offset 0, flags [DF],
proto: TCP (6), length: 60) 192.168.111.176.52476 >
192.168.111.177.3260: S, cksum 0xe765 (correct),
2546723736:2546723736(0) win 5840 <mss 1460,sackOK,timestamp 822254
0,nop,wscale 7>
18:19:33.551734 IP (tos 0x0, ttl 64, id 24695, offset 0, flags [DF],
proto: TCP (6), length: 60) 192.168.111.176.52476 >
192.168.111.177.3260: S, cksum 0xe639 (correct),
2546723736:2546723736(0) win 5840 <mss 1460,sackOK,timestamp 822554
0,nop,wscale 7>
18:19:39.280094 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto:
TCP (6), length: 60) 192.168.111.177.3260 > 192.168.111.176.52472: S,
cksum 0xfdf4 (correct), 2952957677:2952957677(0) ack 2462022456 win 5792
<mss 1460,sackOK,timestamp 1715316 819059,nop,wscale 5>
18:19:39.280256 IP (tos 0x0, ttl 64, id 4733, offset 0, flags [DF],
proto: TCP (6), length: 64) 192.168.111.176.52472 >
192.168.111.177.3260: ., cksum 0x7e27 (correct), 302:302(0) ack 1 win 46
<nop,nop,timestamp 823127 1715316,nop,nop,sack 1 {0:1}>
18:19:39.549263 IP (tos 0x0, ttl 64, id 24696, offset 0, flags [DF],
proto: TCP (6), length: 60) 192.168.111.176.52476 >
192.168.111.177.3260: S, cksum 0xe3e1 (correct),
2546723736:2546723736(0) win 5840 <mss 1460,sackOK,timestamp 823154
0,nop,wscale 7>
18:19:44.280089 arp who-has 192.168.111.176 tell 192.168.111.177
18:19:44.280270 arp reply 192.168.111.176 is-at 00:11:25:aa:2d:b8 (oui
Unknown)
18:19:46.566424 IP (tos 0x0, ttl 64, id 12283, offset 0, flags [DF],
proto: TCP (6), length: 60) 192.168.111.176.52477 >
192.168.111.177.3260: S, cksum 0x0df1 (correct),
2563423691:2563423691(0) win 5840 <mss 1460,sackOK,timestamp 823856
0,nop,wscale 7>
18:19:49.565164 IP (tos 0x0, ttl 64, id 12284, offset 0, flags [DF],
proto: TCP (6), length: 60) 192.168.111.176.52477 >
192.168.111.177.3260: S, cksum 0x0cc5 (correct),
2563423691:2563423691(0) win 5840 <mss 1460,sackOK,timestamp 824156
0,nop,wscale 7>
18:19:55.562694 IP (tos 0x0, ttl 64, id 12285, offset 0, flags [DF],
proto: TCP (6), length: 60) 192.168.111.176.52477 >
192.168.111.177.3260: S, cksum 0x0a6d (correct),
2563423691:2563423691(0) win 5840 <mss 1460,sackOK,timestamp 824756
0,nop,wscale 7>
(...)
>> recv(3,
>
> You might have multiple iscsid or iscsiadm versions installed in your
> system. Run "whereis iscsid" and "whereis iscsiadm". If you have more
> than one than remove them and reinstall freshly.
Can it be that open-iscsi doesn't know what to do and freezes if it was
disconnected from the target rather abruptly (like, target killed with
-9 etc.)?
Another way to reproduce - compile and run the current SVN snapshot of
IET (102).
There, you're not able to connect at all.
sorry, my box is not haunted at all...
What target software are you using? tgtd isn't part of IET
> 2. Start the (open-iscsi) initiator on the other side, say 2 hosts,
> connect nodes
> 3. Kill tgtd
> 4. Start IET, with one target
> 5. On each of initiators, run:
You seem to have two iscsi target versions running.
1) IET (iscsi_trgt.ko ietd ietadm)
2) TGT (tgtd etc)
You should only have 1 version going on the server...
That would explain a lot. Pick one, if initiators log on with one
they will NOT be able to switch over to the other (on the same
host at least, maybe with MPIO to another host...).
> --------------------------------------------------------------
> -----------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> Iscsitarget-devel mailing list
> Iscsitar...@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/iscsitarget-devel
>
______________________________________________________________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.
I tried to reproduce the issue once again:
- stopped all initiators
- started, 1, 2, 3... 10 initiators, restarting IET... still nothing
With normal operation, that particular target machine offers more than
40 targets.
5 initiator machines connected to it, each then connecting to that 40
initiators. That makes more than 200 connections to the target.
It's only then when I can easily reproduce the issue - just restart IET
or IET machine, and expect trouble - the machines are not able to
connect to the target anymore (you can restart IET, but it won't help;
start tgtd - no more trouble).
Does it ring a bell?
If no, this x86 target machine has 256 MB RAM (about 70 MB used) - maybe
that has something to do?
I remember in my early days with iSCSI I reported instability problems
on an ARM machine with 256 MB RAM (or was it 128?) - the answer from
this list that normally, you guys have like 1 GB on target machines.
I had to fix some /proc values on ARM target - no more problems.
Maybe the issue is similar this time, too (too few RAM - 256 MB, too
many targets - more than 200).
i assumed "nothing" here meant nothing wrong happened?
>
> With normal operation, that particular target machine offers more than
> 40 targets.
> 5 initiator machines connected to it, each then connecting to that 40
> initiators. That makes more than 200 connections to the target.
ok, sounds complex.
>
> It's only then when I can easily reproduce the issue - just restart IET
> or IET machine, and expect trouble - the machines are not able to
> connect to the target anymore (you can restart IET, but it won't help;
> start tgtd - no more trouble).
after unload, can u check if the daemon is killed and also the module is
unloaded. after u loaded, if the daemon is actually running and the
module is loaded. or can u run ietd -f, make it run foreground and see
if it core dump or exit?
>
> Does it ring a bell?
>
> If no, this x86 target machine has 256 MB RAM (about 70 MB used) - maybe
> that has something to do?
> I remember in my early days with iSCSI I reported instability problems
> on an ARM machine with 256 MB RAM (or was it 128?) - the answer from
> this list that normally, you guys have like 1 GB on target machines.
> I had to fix some /proc values on ARM target - no more problems.
>
> Maybe the issue is similar this time, too (too few RAM - 256 MB, too
> many targets - more than 200).
could you in same box only configure 1, 5, 10, 20, 40 target each time
and see if reproduce? so we can grab some clue.
also one might related issue is when unload iet, if you have too many
targets and too many ini, when u unload in the middle and another ini
begin to reconnect, it will fall into a dead lock like situation and
lead to unexpected situation.
>
>
>
Yes, "nothing" == "everything working fine", nothing unexpected happened.
>> With normal operation, that particular target machine offers more than
>> 40 targets.
>> 5 initiator machines connected to it, each then connecting to that 40
>> initiators. That makes more than 200 connections to the target.
>
> ok, sounds complex.
>
>> It's only then when I can easily reproduce the issue - just restart IET
>> or IET machine, and expect trouble - the machines are not able to
>> connect to the target anymore (you can restart IET, but it won't help;
>> start tgtd - no more trouble).
>
> after unload, can u check if the daemon is killed and also the module is
> unloaded.
Yes - the daemon is not running, and the module is not loaded.
Sometimes, the daemon is stopped, but the script says "module in use"
and doesn't unload it. When I do "lsmod", I see the module is not in
use, and I can do "rmmod" manually.
netstat -tpna shows there are no established connections to port 3260,
and nothing listens on port 3260 (the default port I'm running IET on).
Also - I can restart the machine running IET, and after that, some/all
initiators are not able to connect anymore.
So my reasoning is that something breaks in IET when 200 initiators want
to reconnect almost at the same time?
> after u loaded, if the daemon is actually running and the
> module is loaded. or can u run ietd -f, make it run foreground and see
> if it core dump or exit?
I tried that - running with -f, -d 9 etc. - it gave no scary messages.
And it didn't crash, coredump, or exit. The process runs, but does
something unexpected.
>> Does it ring a bell?
>>
>> If no, this x86 target machine has 256 MB RAM (about 70 MB used) - maybe
>> that has something to do?
>> I remember in my early days with iSCSI I reported instability problems
>> on an ARM machine with 256 MB RAM (or was it 128?) - the answer from
>> this list that normally, you guys have like 1 GB on target machines.
>> I had to fix some /proc values on ARM target - no more problems.
>>
>> Maybe the issue is similar this time, too (too few RAM - 256 MB, too
>> many targets - more than 200).
>
> could you in same box only configure 1, 5, 10, 20, 40 target each time
> and see if reproduce? so we can grab some clue.
It may take some time, as it's a production machine. But yes, it's a
good idea, I'll do it.
> also one might related issue is when unload iet, if you have too many
> targets and too many ini, when u unload in the middle and another ini
> begin to reconnect, it will fall into a dead lock like situation and
> lead to unexpected situation.
Do you mean target, or initiator by saying "it"?
Read below - the initiator has only problems connecting to that
particular target:
I rebooted the IET machine, and still some/all initiators were not able
to reconnect.
A command like:
iscsiadm -m discovery -t sendtargets -p <IET_THAT_BREAKS>
started on an initiator takes very long to finish, with the output like
in the email subject ("Login I/O error, failed to receive a PDU").
When we connect from the same initiator machine to another IET (running
just a couple of targets), everything works fine, we receive a list of
available targets.
Or, we stop that problematic IET, and start tgtd instead (with exactly
the same list of targets) - discovery/sendtargets works then.