I use open-iscsi-2.0-754 with iscsitarget-0.4.14 on openSuSE 10.2
(kernel 2.6.18) and experience the following:
Initiator can connect just fine, basic operation seems to work. I can
create and change files, etc.
Now, for basic performance analysis, I mount the disc on the initiator
and run this command a couple of times:
dd if=/dev/zero of=largefile bs=1024k count=2000
After a couple of times, the session is stalled. I can still ping the
initiator, but not ssh into it. Not even a tty is directly accessable
(just black screen, no keypress is recognized). After resetting the
machine, there are no messages in syslog that something went wrong and a
tcpdump running while I reproduced this behaviour did not show anything
suspicious.
This is reproducable. What can I do about it?
Regards
Dominik
initiator:
cat iscsid.conf
node.startup = automatic
node.session.auth.username = jim
node.session.auth.password = othersecret
discovery.sendtargets.auth.username = joe
discovery.sendtargets.auth.password = secret
node.session.timeo.replacement_timeout = 120
node.conn[0].timeo.login_timeout = 15
node.conn[0].timeo.logout_timeout = 15
node.conn[0].timeo.noop_out_interval = 10
node.conn[0].timeo.noop_out_timeout = 15
node.session.iscsi.InitialR2T = No
node.session.iscsi.ImmediateData = Yes
node.session.iscsi.FirstBurstLength = 65536
node.session.iscsi.MaxBurstLength = 262144
node.conn[0].iscsi.MaxRecvDataSegmentLength = 65536
target:
cat ietd.conf
IncomingUser joe secret
OutgoingUser jack secret2
Target iqn.2007-04.net.in-telegence:ACD-xen02.disk
IncomingUser jim othersecret
OutgoingUser james yetanothersecret
Lun 0 Path=/dev/sdc,Type=fileio
Alias Test
HeaderDigest None
DataDigest None
MaxConnections 1
InitialR2T Yes
ImmediateData No
MaxRecvDataSegmentLength 8192
MaxXmitDataSegmentLength 8192
MaxBurstLength 262144
FirstBurstLength 65536
DefaultTime2Wait 2
DefaultTime2Retain 20
MaxOutstandingR2T 8
DataPDUInOrder Yes
DataSequenceInOrder Yes
ErrorRecoveryLevel 0
since u are using a pretty new kernel, you might want to enable the
kdump, get a dump, and send to someone who like to help you on this.
otherwise u will have to wait till somebody who can reproduce this and
fix for you.
So which kernel version is recommended?
Regards
Dominik
i meant this kernel is new enough to support kdump which can provide
some clues for developers since you have no console info.
or maybe u can do a simpler one if you have serial cable. enable serial
console and post the console output here.
i do not use opensuse, so no idea.
also one common problem is to mess up the in kernel module with modules
from the open-iscsi 2.x out of tree modules. be sure you load the module
from out of trees. i always do make, lsmod make sure non of them loaded,
then insmod ./../foo.ko to make sure.
>
> Regards
> Dominik
>
> >
You are not running the target and initiator on the same box are you? If
so what NIC are you using?
>
> Initiator can connect just fine, basic operation seems to work. I can
> create and change files, etc.
>
> Now, for basic performance analysis, I mount the disc on the initiator
> and run this command a couple of times:
> dd if=/dev/zero of=largefile bs=1024k count=2000
>
> After a couple of times, the session is stalled. I can still ping the
> initiator, but not ssh into it. Not even a tty is directly accessable
> (just black screen, no keypress is recognized). After resetting the
> machine, there are no messages in syslog that something went wrong and a
> tcpdump running while I reproduced this behaviour did not show anything
> suspicious.
>
> This is reproducable. What can I do about it?
Recompile open-iscsi with
make clean
make DEBUG_SCSI=1 DEBUG_TCP=1
We might get lucky and see something in the trace. I would also try to
do some of the things Ming listed.
No I am not running initiator and target on one box.
But anyway, I use Intel Primergy RX300 and Dell 82541GI/PI Gigabit
Ethernet Cards.
> Recompile open-iscsi with
>
> make clean
> make DEBUG_SCSI=1 DEBUG_TCP=1
>
> We might get lucky and see something in the trace. I would also try to
> do some of the things Ming listed.
I did this but it does not seem to produce any more output!?
I checked the ttys and all files in /var/log
Where is the debug output supposed to be displayed?
I did some more testing with some more kernels:
2.6.18.2-34-default (shipped with openSuSE) has its own iscsi modules.
They seem to work fine. I can run the mentioned command (dd ..) >15
times without error, whereas my other kernels fail after a maximum of
about 5 times.
As using iscsi with xen is my actual goal, I also tested this on:
2.6.16.33-xen (xen 3.0.4 src)
2.6.16.38-xen (xen 3.0.4-testing)
2.6.18-xen (xen unstable)
They all produce the mentioned problems (open-iscsi runs in dom0).
So if anybody has an idea to spare or a comment on how to use open-iscsi
with xen at best - feel free to share.
Regards
Dominik
Okay, actually
make DEBUG_SCSI=1 DEBUG_TCP=1
and
make DEBUG_SCSI=1 DEBUG_TCP=1 install (not just make install)
was necessary.
I attached the output in a compressed file.
Here's what I did:
linux:~ # uname -a
Linux ACD-xen01 2.6.16.33-xen #1 SMP Tue Apr 3 10:29:14 CEST 2007 i686
i686 i386 GNU/Linux
linux:/mnt # df -h
Dateisystem Größe Benut Verf Ben% Eingehängt auf
/dev/sda3 9,9G 7,4G 2,0G 80% /
udev 257M 204K 256M 1% /dev
/dev/sda4 22G 6,9G 14G 35% /home
/dev/sdb1 20G 173M 19G 1% /mnt
linux:/mnt # ls -la
insgesamt 24
drwxr-xr-x 3 root root 4096 3. Apr 11:36 ./
drwxr-xr-x 22 root root 4096 3. Apr 10:44 ../
drwx------ 2 root root 16384 30. Mär 16:27 lost+found/
linux:/mnt # logger dk teststart
linux:/mnt # while :
> do
> dd if=/dev/zero of=largefile bs=1024k count=2000
> sleep 5
> logger dk loop done
> done
The end of the log files is when the machine actually "stalled". It was
not reachable via ssh, all ttys are dead, no keypress recognized. But
still (as I wrote earlier), it was pingable.
Regards
Dominik
The setup runs just fine until I set the xen dom0 to only use one of the
four CPUs in my machine (actually 2 HT CPUs).
So with
(dom0-cpus 0)
in /etc/xen/xend-config.sxp
this works. The while-loop actually ran fine for 2 days straight.
With
(dom0-cpus 1)
it crashes as described within a few minutes.
I will cc this to the xen-list.
Full thread here:
http://groups.google.com/group/open-iscsi/browse_thread/thread/495b17fa2ab52e3c/8acf2cc82a384646?lnk=gst&q=crashes&rnum=8
I'll be happy to supply additional information when needed.
Regards
Dominik
Dominik Klein schrieb:
Shoot, maybe this is a locking bug in the iscsi code. What version of
xen are you running? I will try to set it up here and recreate the problem.
Thanks for the debugging.
I am running xen 3.0.4 with kernel 2.6.16.33
Regards
Dominik
Just so you know this is next on my list.
I had to adjust paths in the patch file. As I use 2.6.16, I changed the
path to the files to open-iscsi/kernel/2.6.16-2.6.19/<filename>
I tried to apply the patch, but it did not work. Here's what I did and got:
# tar xzf open-iscsi-2.0-754.tar.gz
# mv open-iscsi-2.0-754 open-iscsi
# patch -p0 < fix-skb-pad.patch
patching file open-iscsi/kernel/2.6.16-2.6.19/iscsi_tcp.c
Hunk #1 succeeded at 893 (offset -2 lines).
Hunk #2 succeeded at 938 (offset -2 lines).
Hunk #3 FAILED at 949.
1 out of 3 hunks FAILED -- saving rejects to file
open-iscsi/kernel/2.6.16-2.6.19/iscsi_tcp.c.rej
patching file open-iscsi/kernel/2.6.16-2.6.19/iscsi_tcp.h
Regards
Dominik
I was able to apply the patch and think you meant to apply it on the SVN
code. I'll try that now and report later.
Regards
Dominik
So I got the latest SVN Code, applied the patch, compiled and installed
it. After rebooting the machine with one CPU in dom0, the test described
earlier in this thread lead to the same result: The machine hangs after
a couple of minutes.
It still works fine with all CPUs available in dom0, but that's not what
a lot of xen users want.
So it does not seem as if your earlier patch fixed this issue.
Regards
Dominik
I installed open-iscsi svn code (from yesterday (Apr, 23rd) morning) and
applied the patch you suggested.
rebooted with one CPU in dom0, here's what it does:
# mount /dev/sdd /mnt
# cd /mnt/tmp
# while :; do dd if=/dev/zero of=largefile bs=1024k count=2000 && logger
dk one done || break; done
...
<dd runs fine a couple of times>
...
dd: Writing „largefile“: read-only Filesystem
/var/log/messages:
Apr 23 16:21:53 ACD-xen01 kernel: EXT3-fs error (device sdd):
ext3_free_blocks_sb: bit already cleared for block 513133
Apr 23 16:21:53 ACD-xen01 kernel: Aborting journal on device sdd.
Apr 23 16:21:53 ACD-xen01 kernel: EXT3-fs error (device sdd):
ext3_free_blocks_sb: bit already cleared for block 513134
Apr 23 16:21:53 ACD-xen01 kernel: EXT3-fs error (device sdd):
ext3_free_blocks_sb: bit already cleared for block 513135
Apr 23 16:21:53 ACD-xen01 kernel: EXT3-fs error (device sdd) in
ext3_free_blocks_sb: Journal has aborted
Apr 23 16:21:53 ACD-xen01 kernel: EXT3-fs error (device sdd) in
ext3_free_blocks_sb: Journal has aborted
< message is repeated like a hundred times>
Apr 23 16:21:53 ACD-xen01 kernel: EXT3-fs error (device sdd) in
ext3_reserve_inode_write: Journal has aborted
Apr 23 16:21:53 ACD-xen01 kernel: EXT3-fs error (device sdd) in
ext3_reserve_inode_write: Journal has aborted
Apr 23 16:21:53 ACD-xen01 kernel: EXT3-fs error (device sdd) in
ext3_orphan_del: Journal has aborted
Apr 23 16:21:53 ACD-xen01 kernel: EXT3-fs error (device sdd) in
ext3_truncate: Journal has aborted
Apr 23 16:21:53 ACD-xen01 kernel: __journal_remove_journal_head: freeing
b_committed_data
Apr 23 16:21:53 ACD-xen01 kernel: __journal_remove_journal_head: freeing
b_committed_data
Apr 23 16:21:53 ACD-xen01 kernel: __journal_remove_journal_head: freeing
b_committed_data
Apr 23 16:21:53 ACD-xen01 kernel: __journal_remove_journal_head: freeing
b_committed_data
Apr 23 16:21:53 ACD-xen01 kernel: __journal_remove_journal_head: freeing
b_committed_data
Apr 23 16:21:53 ACD-xen01 kernel: __journal_remove_journal_head: freeing
b_committed_data
Apr 23 16:21:53 ACD-xen01 kernel: __journal_remove_journal_head: freeing
b_committed_data
Apr 23 16:21:53 ACD-xen01 kernel: __journal_remove_journal_head: freeing
b_committed_data
Apr 23 16:21:53 ACD-xen01 kernel: __journal_remove_journal_head: freeing
b_committed_data
Apr 23 16:21:53 ACD-xen01 kernel: __journal_remove_journal_head: freeing
b_committed_data
Apr 23 16:21:53 ACD-xen01 kernel: ext3_abort called.
Apr 23 16:21:53 ACD-xen01 kernel: EXT3-fs error (device sdd):
ext3_journal_start_sb: Detected aborted journal
Apr 23 16:21:53 ACD-xen01 kernel: Remounting filesystem read-only
It does not crash the entire machine, but is reproducable. I did this
like 3 times in a row with the same result.
This also happens with all physical CPUs available in dom0.
On the target side (iscsitarget-0.4.14), I see nothing in the logs.
If you need any more info, I'll be pleased to help.
Regards
Dominik
In 3.0.5rc2, this also happens with open-iscsi 0.754
No difference if dom0 has 1 or all CPUs.
Dominik Klein schrieb:
Do you see those errors with a freshly formatted filesystem? Did you see
any iscsi errors? Something like a connection error?
One other question I forgot was are the problems you are having occuring
when you have the iscsi initiator running in domU or dom0?
I just set up a new ext3 partition and re-tested.
Current setup uses xen 3.0.5rc2, Kernel 2.6.18, open-iscsi-2.0-754 (no
svn, no patch)
With one *AND* with all physical CPUs, the test ran fine for each about
2 hours. I will run the test some more time and let you know how this
works out.
Maybe this is not an open-iscsi issue, but xen-related?
I cannot exactly say wether the filesystem was clean before I ran tests
with xen 3.0.5 in the first place, so problems may have been due to that.
> Did you see
> any iscsi errors? Something like a connection error?
No.
> One other question I forgot was are the problems you are having occuring
> when you have the iscsi initiator running in domU or dom0?
The initiator has always run in dom0.