Initiator crashes

Dominik Klein

unread,

Apr 2, 2007, 7:05:28 AM4/2/07

to open-iscsi

Hello

I use open-iscsi-2.0-754 with iscsitarget-0.4.14 on openSuSE 10.2
(kernel 2.6.18) and experience the following:

Initiator can connect just fine, basic operation seems to work. I can
create and change files, etc.

Now, for basic performance analysis, I mount the disc on the initiator
and run this command a couple of times:
dd if=/dev/zero of=largefile bs=1024k count=2000

After a couple of times, the session is stalled. I can still ping the
initiator, but not ssh into it. Not even a tty is directly accessable
(just black screen, no keypress is recognized). After resetting the
machine, there are no messages in syslog that something went wrong and a
tcpdump running while I reproduced this behaviour did not show anything
suspicious.

This is reproducable. What can I do about it?

Regards
Dominik

initiator:
cat iscsid.conf
node.startup = automatic
node.session.auth.username = jim
node.session.auth.password = othersecret
discovery.sendtargets.auth.username = joe
discovery.sendtargets.auth.password = secret
node.session.timeo.replacement_timeout = 120
node.conn[0].timeo.login_timeout = 15
node.conn[0].timeo.logout_timeout = 15
node.conn[0].timeo.noop_out_interval = 10
node.conn[0].timeo.noop_out_timeout = 15
node.session.iscsi.InitialR2T = No
node.session.iscsi.ImmediateData = Yes
node.session.iscsi.FirstBurstLength = 65536
node.session.iscsi.MaxBurstLength = 262144
node.conn[0].iscsi.MaxRecvDataSegmentLength = 65536

target:
cat ietd.conf
IncomingUser joe secret
OutgoingUser jack secret2
Target iqn.2007-04.net.in-telegence:ACD-xen02.disk
IncomingUser jim othersecret
OutgoingUser james yetanothersecret
Lun 0 Path=/dev/sdc,Type=fileio
Alias Test
HeaderDigest None
DataDigest None
MaxConnections 1
InitialR2T Yes
ImmediateData No
MaxRecvDataSegmentLength 8192
MaxXmitDataSegmentLength 8192
MaxBurstLength 262144
FirstBurstLength 65536
DefaultTime2Wait 2
DefaultTime2Retain 20
MaxOutstandingR2T 8
DataPDUInOrder Yes
DataSequenceInOrder Yes
ErrorRecoveryLevel 0

Ming Zhang

unread,

Apr 2, 2007, 9:46:31 AM4/2/07

to open-...@googlegroups.com

On Mon, 2007-04-02 at 13:05 +0200, Dominik Klein wrote:
> Hello
>
> I use open-iscsi-2.0-754 with iscsitarget-0.4.14 on openSuSE 10.2
> (kernel 2.6.18) and experience the following:

since u are using a pretty new kernel, you might want to enable the
kdump, get a dump, and send to someone who like to help you on this.
otherwise u will have to wait till somebody who can reproduce this and
fix for you.

Dominik Klein

unread,

Apr 2, 2007, 10:28:27 AM4/2/07

to open-...@googlegroups.com

>> I use open-iscsi-2.0-754 with iscsitarget-0.4.14 on openSuSE 10.2
>> (kernel 2.6.18) and experience the following:
>
> since u are using a pretty new kernel, you might want to enable the
> kdump, get a dump, and send to someone who like to help you on this.
> otherwise u will have to wait till somebody who can reproduce this and
> fix for you.

So which kernel version is recommended?

Regards
Dominik

Ming Zhang

unread,

Apr 2, 2007, 10:41:01 AM4/2/07

to open-...@googlegroups.com

i meant this kernel is new enough to support kdump which can provide
some clues for developers since you have no console info.

or maybe u can do a simpler one if you have serial cable. enable serial
console and post the console output here.

i do not use opensuse, so no idea.

also one common problem is to mess up the in kernel module with modules
from the open-iscsi 2.x out of tree modules. be sure you load the module
from out of trees. i always do make, lsmod make sure non of them loaded,
then insmod ./../foo.ko to make sure.

>
> Regards
> Dominik
>
> >

Mike Christie

unread,

Apr 2, 2007, 12:37:58 PM4/2/07

to open-...@googlegroups.com

Dominik Klein wrote:
> Hello
>
> I use open-iscsi-2.0-754 with iscsitarget-0.4.14 on openSuSE 10.2
> (kernel 2.6.18) and experience the following:

You are not running the target and initiator on the same box are you? If
so what NIC are you using?

>
> Initiator can connect just fine, basic operation seems to work. I can
> create and change files, etc.
>
> Now, for basic performance analysis, I mount the disc on the initiator
> and run this command a couple of times:
> dd if=/dev/zero of=largefile bs=1024k count=2000
>
> After a couple of times, the session is stalled. I can still ping the
> initiator, but not ssh into it. Not even a tty is directly accessable
> (just black screen, no keypress is recognized). After resetting the
> machine, there are no messages in syslog that something went wrong and a
> tcpdump running while I reproduced this behaviour did not show anything
> suspicious.
>
> This is reproducable. What can I do about it?

Recompile open-iscsi with

make clean
make DEBUG_SCSI=1 DEBUG_TCP=1

We might get lucky and see something in the trace. I would also try to
do some of the things Ming listed.

Dominik Klein

unread,

Apr 3, 2007, 3:57:33 AM4/3/07

to open-...@googlegroups.com

>> I use open-iscsi-2.0-754 with iscsitarget-0.4.14 on openSuSE 10.2
>> (kernel 2.6.18) and experience the following:
>
> You are not running the target and initiator on the same box are you? If
> so what NIC are you using?

No I am not running initiator and target on one box.

But anyway, I use Intel Primergy RX300 and Dell 82541GI/PI Gigabit
Ethernet Cards.

> Recompile open-iscsi with
>
> make clean
> make DEBUG_SCSI=1 DEBUG_TCP=1
>
> We might get lucky and see something in the trace. I would also try to
> do some of the things Ming listed.

I did this but it does not seem to produce any more output!?
I checked the ttys and all files in /var/log
Where is the debug output supposed to be displayed?

I did some more testing with some more kernels:
2.6.18.2-34-default (shipped with openSuSE) has its own iscsi modules.
They seem to work fine. I can run the mentioned command (dd ..) >15
times without error, whereas my other kernels fail after a maximum of
about 5 times.

As using iscsi with xen is my actual goal, I also tested this on:
2.6.16.33-xen (xen 3.0.4 src)
2.6.16.38-xen (xen 3.0.4-testing)
2.6.18-xen (xen unstable)

They all produce the mentioned problems (open-iscsi runs in dom0).

So if anybody has an idea to spare or a comment on how to use open-iscsi
with xen at best - feel free to share.

Regards
Dominik

Dominik Klein

unread,

Apr 3, 2007, 5:52:40 AM4/3/07

to open-...@googlegroups.com

>> Recompile open-iscsi with
>>
>> make clean
>> make DEBUG_SCSI=1 DEBUG_TCP=1
>>
>> We might get lucky and see something in the trace. I would also try to
>> do some of the things Ming listed.
>
> I did this but it does not seem to produce any more output!?
> I checked the ttys and all files in /var/log
> Where is the debug output supposed to be displayed?

Okay, actually
make DEBUG_SCSI=1 DEBUG_TCP=1
and
make DEBUG_SCSI=1 DEBUG_TCP=1 install (not just make install)
was necessary.
I attached the output in a compressed file.

Here's what I did:

linux:~ # uname -a
Linux ACD-xen01 2.6.16.33-xen #1 SMP Tue Apr 3 10:29:14 CEST 2007 i686
i686 i386 GNU/Linux
linux:/mnt # df -h
Dateisystem Größe Benut Verf Ben% Eingehängt auf
/dev/sda3 9,9G 7,4G 2,0G 80% /
udev 257M 204K 256M 1% /dev
/dev/sda4 22G 6,9G 14G 35% /home
/dev/sdb1 20G 173M 19G 1% /mnt
linux:/mnt # ls -la
insgesamt 24
drwxr-xr-x 3 root root 4096 3. Apr 11:36 ./
drwxr-xr-x 22 root root 4096 3. Apr 10:44 ../
drwx------ 2 root root 16384 30. Mär 16:27 lost+found/
linux:/mnt # logger dk teststart
linux:/mnt # while :
> do

> dd if=/dev/zero of=largefile bs=1024k count=2000

> sleep 5
> logger dk loop done
> done

The end of the log files is when the machine actually "stalled". It was
not reachable via ssh, all ttys are dead, no keypress recognized. But
still (as I wrote earlier), it was pingable.

Regards
Dominik

debug.tar.gz

Dominik Klein

unread,

Apr 5, 2007, 8:47:12 AM4/5/07

to open-...@googlegroups.com, xen-...@lists.xensource.com

I think I have found the reason for this:

The setup runs just fine until I set the xen dom0 to only use one of the
four CPUs in my machine (actually 2 HT CPUs).

So with
(dom0-cpus 0)
in /etc/xen/xend-config.sxp
this works. The while-loop actually ran fine for 2 days straight.

With
(dom0-cpus 1)
it crashes as described within a few minutes.

I will cc this to the xen-list.
Full thread here:
http://groups.google.com/group/open-iscsi/browse_thread/thread/495b17fa2ab52e3c/8acf2cc82a384646?lnk=gst&q=crashes&rnum=8

I'll be happy to supply additional information when needed.

Regards
Dominik

Dominik Klein schrieb:

Mike Christie

unread,

Apr 6, 2007, 2:35:50 PM4/6/07

to open-...@googlegroups.com, xen-...@lists.xensource.com

Dominik Klein wrote:
> I think I have found the reason for this:
>
> The setup runs just fine until I set the xen dom0 to only use one of the
> four CPUs in my machine (actually 2 HT CPUs).
>
> So with
> (dom0-cpus 0)
> in /etc/xen/xend-config.sxp
> this works. The while-loop actually ran fine for 2 days straight.
>
> With
> (dom0-cpus 1)
> it crashes as described within a few minutes.
>

Shoot, maybe this is a locking bug in the iscsi code. What version of
xen are you running? I will try to set it up here and recreate the problem.

Thanks for the debugging.

Dominik Klein

unread,

Apr 7, 2007, 2:58:54 AM4/7/07

to open-...@googlegroups.com

Mike Christie schrieb:

I am running xen 3.0.4 with kernel 2.6.16.33

Regards
Dominik

Mike Christie

unread,

Apr 20, 2007, 3:54:45 PM4/20/07

to open-...@googlegroups.com

Just so you know this is next on my list.

Mike Christie

unread,

Apr 20, 2007, 6:31:10 PM4/20/07

to open-...@googlegroups.com

Just so we can make sure it is not the same lock up I just fixed could
you try the attached patch with svn 779.

fix-skb-pad.patch

Dominik Klein

unread,

Apr 23, 2007, 2:39:18 AM4/23/07

to open-...@googlegroups.com

Hi

I had to adjust paths in the patch file. As I use 2.6.16, I changed the
path to the files to open-iscsi/kernel/2.6.16-2.6.19/<filename>

I tried to apply the patch, but it did not work. Here's what I did and got:

# tar xzf open-iscsi-2.0-754.tar.gz
# mv open-iscsi-2.0-754 open-iscsi
# patch -p0 < fix-skb-pad.patch
patching file open-iscsi/kernel/2.6.16-2.6.19/iscsi_tcp.c
Hunk #1 succeeded at 893 (offset -2 lines).
Hunk #2 succeeded at 938 (offset -2 lines).
Hunk #3 FAILED at 949.
1 out of 3 hunks FAILED -- saving rejects to file
open-iscsi/kernel/2.6.16-2.6.19/iscsi_tcp.c.rej
patching file open-iscsi/kernel/2.6.16-2.6.19/iscsi_tcp.h

Regards
Dominik

Dominik Klein

unread,

Apr 23, 2007, 4:40:05 AM4/23/07

to open-...@googlegroups.com

Dominik Klein schrieb:

I was able to apply the patch and think you meant to apply it on the SVN
code. I'll try that now and report later.

Regards
Dominik

Dominik Klein

unread,

Apr 23, 2007, 5:07:25 AM4/23/07

to open-...@googlegroups.com

> I was able to apply the patch and think you meant to apply it on the SVN
> code. I'll try that now and report later.

So I got the latest SVN Code, applied the patch, compiled and installed
it. After rebooting the machine with one CPU in dom0, the test described
earlier in this thread lead to the same result: The machine hangs after
a couple of minutes.

It still works fine with all CPUs available in dom0, but that's not what
a lot of xen users want.

So it does not seem as if your earlier patch fixed this issue.

Regards
Dominik

Dominik Klein

unread,

Apr 24, 2007, 2:14:01 AM4/24/07

to open-...@googlegroups.com

So I gave xen 3.0.5rc2 a try. This uses kernel 2.6.18

I installed open-iscsi svn code (from yesterday (Apr, 23rd) morning) and
applied the patch you suggested.

rebooted with one CPU in dom0, here's what it does:

# mount /dev/sdd /mnt
# cd /mnt/tmp
# while :; do dd if=/dev/zero of=largefile bs=1024k count=2000 && logger
dk one done || break; done
...
<dd runs fine a couple of times>
...
dd: Writing „largefile“: read-only Filesystem

/var/log/messages:
Apr 23 16:21:53 ACD-xen01 kernel: EXT3-fs error (device sdd):
ext3_free_blocks_sb: bit already cleared for block 513133
Apr 23 16:21:53 ACD-xen01 kernel: Aborting journal on device sdd.
Apr 23 16:21:53 ACD-xen01 kernel: EXT3-fs error (device sdd):
ext3_free_blocks_sb: bit already cleared for block 513134
Apr 23 16:21:53 ACD-xen01 kernel: EXT3-fs error (device sdd):
ext3_free_blocks_sb: bit already cleared for block 513135
Apr 23 16:21:53 ACD-xen01 kernel: EXT3-fs error (device sdd) in
ext3_free_blocks_sb: Journal has aborted
Apr 23 16:21:53 ACD-xen01 kernel: EXT3-fs error (device sdd) in
ext3_free_blocks_sb: Journal has aborted
< message is repeated like a hundred times>
Apr 23 16:21:53 ACD-xen01 kernel: EXT3-fs error (device sdd) in
ext3_reserve_inode_write: Journal has aborted
Apr 23 16:21:53 ACD-xen01 kernel: EXT3-fs error (device sdd) in
ext3_reserve_inode_write: Journal has aborted
Apr 23 16:21:53 ACD-xen01 kernel: EXT3-fs error (device sdd) in
ext3_orphan_del: Journal has aborted
Apr 23 16:21:53 ACD-xen01 kernel: EXT3-fs error (device sdd) in
ext3_truncate: Journal has aborted
Apr 23 16:21:53 ACD-xen01 kernel: __journal_remove_journal_head: freeing
b_committed_data
Apr 23 16:21:53 ACD-xen01 kernel: __journal_remove_journal_head: freeing
b_committed_data
Apr 23 16:21:53 ACD-xen01 kernel: __journal_remove_journal_head: freeing
b_committed_data
Apr 23 16:21:53 ACD-xen01 kernel: __journal_remove_journal_head: freeing
b_committed_data
Apr 23 16:21:53 ACD-xen01 kernel: __journal_remove_journal_head: freeing
b_committed_data
Apr 23 16:21:53 ACD-xen01 kernel: __journal_remove_journal_head: freeing
b_committed_data
Apr 23 16:21:53 ACD-xen01 kernel: __journal_remove_journal_head: freeing
b_committed_data
Apr 23 16:21:53 ACD-xen01 kernel: __journal_remove_journal_head: freeing
b_committed_data
Apr 23 16:21:53 ACD-xen01 kernel: __journal_remove_journal_head: freeing
b_committed_data
Apr 23 16:21:53 ACD-xen01 kernel: __journal_remove_journal_head: freeing
b_committed_data
Apr 23 16:21:53 ACD-xen01 kernel: ext3_abort called.
Apr 23 16:21:53 ACD-xen01 kernel: EXT3-fs error (device sdd):
ext3_journal_start_sb: Detected aborted journal
Apr 23 16:21:53 ACD-xen01 kernel: Remounting filesystem read-only

It does not crash the entire machine, but is reproducable. I did this
like 3 times in a row with the same result.

This also happens with all physical CPUs available in dom0.

On the target side (iscsitarget-0.4.14), I see nothing in the logs.

If you need any more info, I'll be pleased to help.

Regards
Dominik

Dominik Klein

unread,

Apr 24, 2007, 3:41:03 AM4/24/07

to open-...@googlegroups.com

Did a little more testing here.

In 3.0.5rc2, this also happens with open-iscsi 0.754
No difference if dom0 has 1 or all CPUs.

Dominik Klein schrieb:

Mike Christie

unread,

Apr 24, 2007, 3:51:57 AM4/24/07

to open-...@googlegroups.com

Do you see those errors with a freshly formatted filesystem? Did you see
any iscsi errors? Something like a connection error?

One other question I forgot was are the problems you are having occuring
when you have the iscsi initiator running in domU or dom0?

Dominik Klein

unread,

Apr 24, 2007, 7:21:31 AM4/24/07

to open-...@googlegroups.com

> Do you see those errors with a freshly formatted filesystem?

I just set up a new ext3 partition and re-tested.

Current setup uses xen 3.0.5rc2, Kernel 2.6.18, open-iscsi-2.0-754 (no
svn, no patch)

With one *AND* with all physical CPUs, the test ran fine for each about
2 hours. I will run the test some more time and let you know how this
works out.

Maybe this is not an open-iscsi issue, but xen-related?

I cannot exactly say wether the filesystem was clean before I ran tests
with xen 3.0.5 in the first place, so problems may have been due to that.

> Did you see
> any iscsi errors? Something like a connection error?

No.

> One other question I forgot was are the problems you are having occuring
> when you have the iscsi initiator running in domU or dom0?

The initiator has always run in dom0.

Reply all

Reply to author

Forward