BUG when logging into Solaris build 60 target

jmacie...@gmail.com

unread,

Apr 1, 2007, 8:15:09 PM4/1/07

to open-iscsi

Back when I first reported problems using the open-iscsi initiator
with the Solaris build 45 target ("everything seems to be working
except block device" on 2006-08-24), Mike suggested rescanning the bus
to solve the block device problem.

The next issue I encountered was I/O errors. Mike concluded that
"Either the initiator or target is messing up the r2tsn." I posted my
problem to the OpenSolaris storage list and Rick McNeal responded that
he "failed to set the R2TSN field in the R2T PDU which is set from the
target to the initiator. Previous versions of the Linux initiator did
not flag this as an error, nor does the Solaris initiator." He said
the bug would be fixed in build 55 or 56.

I'm running SolarisExpress Community Release build 60 for the target
and open-iscsi-2.0-754 for the initiator now, and I'm hitting a BUG
immediately upon login. See http://people.msoe.edu/~maciejej/pages/images/open_iscsi_754_2.6.20.3-1_BUG.jpg
(sorry about the huge pic; I don't have a null modem cable and
networking dies too soon to get anything over SSH). My 2.6.20.3 kernel
has some patches for reiser4 (hence the -1 in the version), but
nothing that should affect open-iscsi.

Mike Christie

unread,

Apr 2, 2007, 12:42:30 PM4/2/07

to open-...@googlegroups.com

Ok, Could you send a ethereal trace and/or recompile open-iscsi with:

make DEBUG_SCSI=1 DEBUG_TCP=1

? We are getting some input from the target we did not expect. Are you
using data or header digests?

jmacie...@gmail.com

unread,

Apr 5, 2007, 3:51:09 AM4/5/07

to open-iscsi

On Apr 2, 11:42 am, Mike Christie <micha...@cs.wisc.edu> wrote:
>
> Ok, Could you send a ethereal trace and/or recompile open-iscsi with:
>
> make DEBUG_SCSI=1 DEBUG_TCP=1
>
> ? We are getting some input from the target we did not expect. Are you
> using data or header digests?

It doesn't look like I caught anything with the debug options. The
only relevant message is "scsi1 : iSCSI Initiator over TCP/IP", then I
get the same panic previously posted.

The dump is available at http://people.msoe.edu/~maciejej/iscsi/20070405/iscsi_login.tcpdump.gz.

I haven't changed anything in my configuration since my previous
attempts with build 45. I don't know if I'm using data or header
digests.

Mike Christie

unread,

Apr 5, 2007, 4:20:26 AM4/5/07

to open-...@googlegroups.com

jmacie...@gmail.com wrote:
> On Apr 2, 11:42 am, Mike Christie <micha...@cs.wisc.edu> wrote:
>> Ok, Could you send a ethereal trace and/or recompile open-iscsi with:
>>
>> make DEBUG_SCSI=1 DEBUG_TCP=1
>>
>> ? We are getting some input from the target we did not expect. Are you
>> using data or header digests?
>
> It doesn't look like I caught anything with the debug options. The
> only relevant message is "scsi1 : iSCSI Initiator over TCP/IP", then I
> get the same panic previously posted.
>

Did you do a make make DEBUG_SCSI=1 DEBUG_TCP=1 install as well as the
make? Those debug options have to give you more than the above message.

frankf

unread,

Apr 9, 2007, 4:35:32 AM4/9/07

to open-iscsi

Hi Mike,

I have got the same problem as the original poster. Using Solaris
Express Build [59|60|61] as target and open-iscsi leads to kernel
panic on the initiator machine. I have tested with several open-iscsi
versions (2.0-695, 2.0-754, 1.0-654, and svn 779) to no avail. If you
are intrested, I can give you access to the target and the initiator
server. Even the version used by RHEL5 kernel panics.

Here is the panic message on my test with RHEL5 Xen:

------------[ cut here ]------------
kernel BUG at drivers/scsi/iscsi_tcp.c:959!
invalid opcode: 0000 [#1]
SMP
last sysfs file: /class/iscsi_connection/connection0:0/exp_statsn
Modules linked in: bridge netloop netbk blktap blkbk autofs4 hidp
rfcomm l2cap bluetooth sunrpc iscsi_tcp libiscsi scsi_transport_iscsi
ip_conntrack_netbios_ns ipt_REJECT iptable_filter ip_tables xt_state
ip_conntrack nfnetlink xt_tcpudp ip6table_filter ip6_tables x_tables
ipv6 video sbs i2c_ec button battery asus_acpi ac parport_pc lp
parport floppy sg ide_cd i82875p_edac edac_mc cdrom e1000 i2c_i801
i2c_core pcspkr serio_raw dm_snapshot dm_zero dm_mirror dm_mod
ata_piix libata sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci_hcd
CPU: 0
EIP: 0061:[<ee4c7804>] Not tainted VLI
EFLAGS: 00010216 (2.6.18-1.2747.el5xen #1)
EIP is at iscsi_tcp_data_recv+0x10f2/0x123b [iscsi_tcp]
eax: 00000170 ebx: ec66496c ecx: ecbeb2e0 edx: ebaa46c0
esi: ebaa46c0 edi: ebaa46c0 ebp: ebaa46c0 esp: c0766d2c
ds: 007b es: 007b ss: 0069
Process kjournald (pid: 375, ti=c0766000 task=c0c92550
task.ti=c0c83000)
Stack: 0000016d ec691800 c0fa7d54 c0fa7d68 ecd10080 c0766dec 00000000
00000000
ec66496c ebaa46c0 c0fa7d40 eb817480 eba170ac ecd2d700 00000020
00000000
c05cc44d 00000000 ec691800 c05ca134 80000000 00000000 00000000
eb817480
Call Trace:
[<c05d20e0>] tcp_read_sock+0x61/0x165
[<ee4c7cd4>] iscsi_tcp_data_ready+0x32/0x3f [iscsi_tcp]
[<c05d91e3>] tcp_rcv_established+0x534/0x7d9
[<c05de630>] tcp_v4_do_rcv+0x28/0x2ce
[<c05e0bf8>] tcp_v4_rcv+0x8fc/0x944
[<c05c826a>] ip_local_deliver+0x16e/0x224
[<c05c80bf>] ip_rcv+0x44c/0x489
[<c05aca93>] netif_receive_skb+0x30c/0x384
[<c05ae3d0>] process_backlog+0x97/0xee
[<c05ae57f>] net_rx_action+0x96/0x185
[<c0420e3e>] __do_softirq+0x5e/0xc3
[<c0406cef>] do_softirq+0x59/0xc2
=======================
Code: c7 46 44 00 00 00 00 eb 0e 8b 7c 24 24 c7 87 80 00 00 00 00 00
00 00 8b 54 24 24 8b 42 58 2b 44 24 1c 3b 84 24 d8 00 00 00 76 08 <0f>
0b bf 03 bd 80 4c ee 3b 84 24 d8 00 00 00 0f 85 5a ef ff ff
EIP: [<ee4c7804>] iscsi_tcp_data_recv+0x10f2/0x123b [iscsi_tcp] SS:ESP
0069:c0766d2c
<0>Kernel panic - not syncing: Fatal exception in interrupt
(XEN) Domain 0 crashed: rebooting machine in 5 seconds.

Regards,

Frank

On 5 Apr., 10:20, Mike Christie <micha...@cs.wisc.edu> wrote:

Mike Christie

unread,

Apr 9, 2007, 1:36:01 PM4/9/07

to open-...@googlegroups.com

frankf wrote:
> Hi Mike,
>
> I have got the same problem as the original poster. Using Solaris
> Express Build [59|60|61] as target and open-iscsi leads to kernel
> panic on the initiator machine. I have tested with several open-iscsi
> versions (2.0-695, 2.0-754, 1.0-654, and svn 779) to no avail. If you
> are intrested, I can give you access to the target and the initiator
> server. Even the version used by RHEL5 kernel panics.
>

Just so other posters know, I am going to try and install a open solaris
box and put it on the testing list. I have not used solaris for some
time so if anyone knows problems I might have with some version of
solaris (BeleniX vs Solaris Express, Developer Edition vs whatever else
is here http://www.opensolaris.org/os/downloads/ and different versions
of those) let me know.

frankf

unread,

Apr 9, 2007, 2:15:41 PM4/9/07

to open-iscsi

Hi Mike,

if you choose Solaris Express (Community Edition) I made the
experience that you should use "Solaris Interactive Text (Console
session)" to install. A fine document describing the installation and
some steps after is given here:
www.opensolaris.org/os/community/os_user_groups/czosug/czosug6_simple_install.pdf
I would recommend to use Solaris Express Community Edition. There are
bi-weekley updates released with a defined upgrade mechanism
(liveUpgrade). Solaris Express is very close to Solaris 10, whereas
Nexenta OS is more like "Gnu Linux with solaris kernel aka GNU
Solaris.
The Developer edition is like the Community Edition but includes the
sun developer tools and - as far as I know - has a longer release
schedule (3-6months!?).

> is herehttp://www.opensolaris.org/os/downloads/and different versions

frank.fi...@googlemail.com

unread,

Apr 6, 2007, 3:29:12 PM4/6/07

to open-iscsi

Hi Mike,

I have an similar problem:

Target: Solaris Express b60, iscsi volume on zfs (zvol)
Initiator: open-iscsi-1.1-654 on Xen 3.0.3(kernel 2.6.16.29-xen) with
CentOS 4.4 personality

Symtpom: Initiator server restarts after issuing iscsadm -m node -r
0x44771b -l
Details: No details, initiator server restarts after the command. no
console ouput yet (server is located in datacenter, serial console
windows was not open)

What can I do to help to track down the problem?

Regards,

Frank

On Apr 5, 10:20 am, Mike Christie <micha...@cs.wisc.edu> wrote:

frank.fi...@googlemail.com

unread,

Apr 6, 2007, 3:37:25 PM4/6/07

to open-iscsi

Hi Mike,

I have an similar problem:

Target: Solaris Express b60, iscsi volume on zfs (zvol)
Initiator: open-iscsi-1.1-654 on Xen 3.0.3(kernel 2.6.16.29-xen) with
CentOS 4.4 personality

Symtpom: Initiator server restarts after issuing iscsadm -m node -r
0x44771b -l
Details: No details, initiator server restarts after the command. no
console ouput yet (server is located in datacenter, serial console
windows was not open)

What can I do to help to track down the problem?

Regards,

Frank

On Apr 5, 10:20 am, Mike Christie <micha...@cs.wisc.edu> wrote:

frank.fi...@googlemail.com

unread,

Apr 6, 2007, 4:25:09 PM4/6/07

to open-iscsi

Hi Mike,

I have an similar problem:

Target: Solaris Express b60, iscsi volume on zfs (zvol)
Initiator: open-iscsi-1.1-654 on Xen 3.0.3(kernel 2.6.16.29-xen) with
CentOS 4.4 personality

Symtpom: Initiator server restarts after issuing iscsadm -m node -r
0x44771b -l
Details: No details, initiator server restarts after the command. no
console ouput yet (server is located in datacenter, serial console
windows was not open)

What can I do to help to track down the problem?

Regards,

Frank

On Apr 5, 10:20 am, Mike Christie <micha...@cs.wisc.edu> wrote:

jmacie...@gmail.com

unread,

Apr 11, 2007, 7:41:13 PM4/11/07

to open-iscsi

> is herehttp://www.opensolaris.org/os/downloads/and different versions

> of those) let me know.

Mike,

Sorry I couldn't be more helpful with my bug report. I'm still not
sure why I wasn't seeing the debug messages.

Anyway, it's good to hear you're planning to install Solaris. This
should be easy to reproduce.

Jake

Mike Christie

unread,

Apr 12, 2007, 12:44:42 PM4/12/07

to open-...@googlegroups.com

Don't worry about it. Frank just got the info for me and has been
helping me get solaris installed.

Mike Christie

unread,

Apr 16, 2007, 8:00:50 PM4/16/07

to open-...@googlegroups.com

I have been running tests with the target set up in vmware and using the
current open-iscsi with a kernel.org kernel and some FC kernels and I
did not hit the problem.

The log that you sent me with debugging on also didshowed that the
variables we BUG out on above are actually set so that that should not
hit the BUG (maybe as you noted [part of the log got cut overrun and and
I am only looking at PDU 1 when there was a PDU 2 that BUGd), but the
wireshark trace you sent did not match with the kernel log (kernel log
said we did not get data with the login response but the trace showed we
did, which could occur if a itt got messed up but I am not sure if that
is the case yet.

I am going to try and set up solaris on real hardware and see what happens.

Oh yeah, what type of box are you running the target on? Is it a sparc
or x86/x86_64 box?

jmacie...@gmail.com

unread,

Apr 16, 2007, 10:02:47 PM4/16/07

to open-iscsi

> Oh yeah, what type of box are you running the target on? Is it a sparc
> or x86/x86_64 box?

I'm running amd64 for both initiator and target.

frankf

unread,

Apr 17, 2007, 9:05:55 AM4/17/07

to open-iscsi

Hi Mike,

my Hardware is x86_64(Supermicro X7DBE, Xeon 5130 Processor and 4 GB
FB RAM) on the server/target-side and either P4 and Xeon5130 on the
initiator side but using linux x86_32. The network ist GBit, e1000 NIC
on each side, no jumbo frames enabled. switch ports and cableing is
tripple checked and working.

Mike Christie

unread,

Apr 20, 2007, 3:34:39 PM4/20/07

to open-...@googlegroups.com

frankf wrote:
> Hi Mike,
>
> my Hardware is x86_64(Supermicro X7DBE, Xeon 5130 Processor and 4 GB
> FB RAM) on the server/target-side and either P4 and Xeon5130 on the
> initiator side but using linux x86_32. The network ist GBit, e1000 NIC
> on each side, no jumbo frames enabled. switch ports and cableing is
> tripple checked and working.
>

Hey, thanks to Frank letting me use his box, I found the problem. We are
getting three skbs (linux network buffers). The iscsi header. The iscsi
data buffer. Then the iscsi padding for the buffer. The initiator was
not handling the padding coming in a seperate skb from the data
correctly. I did a initial patch, but it is ugly. I should have
something for people to test in a bit. Thanks Frank!

Mike Christie

unread,

Apr 20, 2007, 3:52:42 PM4/20/07

to open-...@googlegroups.com

I did not do a thorough regression test yet, but I did a simple patch.
Give it a try. It is for against svn 779.

fix-skb-pad.patch

jmacie...@gmail.com

unread,

Apr 21, 2007, 1:44:39 PM4/21/07

to open-iscsi

> I did not do a thorough regression test yet, but I did a simple patch.
> Give it a try. It is for against svn 779.
>

> fix-skb-pad.patch
> 3KDownload

It works for me. Thank you.

Olaf Kirch

unread,

Apr 23, 2007, 4:27:00 AM4/23/07

to open-...@googlegroups.com, Mike Christie

On Friday 20 April 2007 21:34, Mike Christie wrote:
> Hey, thanks to Frank letting me use his box, I found the problem. We are
> getting three skbs (linux network buffers). The iscsi header. The iscsi
> data buffer. Then the iscsi padding for the buffer. The initiator was
> not handling the padding coming in a seperate skb from the data
> correctly. I did a initial patch, but it is ugly. I should have
> something for people to test in a bit. Thanks Frank!

Doesn't this still make assumptions about how TCP segments the
incoming data? To me, it looks like the first hunk of the patch still
assumes that padding doesn't straddle a TCP segment boundary.
And I think the way the receive side of things is currently set up,
there's still a lot of assumptions like these. For instance, it seems
as if the connection is failed when the PDU header + AHS cross
a segment boundary.

Olaf
--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
ok...@lst.de | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax

Mike Christie

unread,

Apr 23, 2007, 12:59:58 PM4/23/07

to open-...@googlegroups.com

Olaf Kirch wrote:
> On Friday 20 April 2007 21:34, Mike Christie wrote:
>> Hey, thanks to Frank letting me use his box, I found the problem. We are
>> getting three skbs (linux network buffers). The iscsi header. The iscsi
>> data buffer. Then the iscsi padding for the buffer. The initiator was
>> not handling the padding coming in a seperate skb from the data
>> correctly. I did a initial patch, but it is ugly. I should have
>> something for people to test in a bit. Thanks Frank!
>
> Doesn't this still make assumptions about how TCP segments the
> incoming data? To me, it looks like the first hunk of the patch still
> assumes that padding doesn't straddle a TCP segment boundary.

You would have to quote the code for me. I am not sure what you mean.
The first part of the patch is the digest handling of the padding code?

> And I think the way the receive side of things is currently set up,
> there's still a lot of assumptions like these. For instance, it seems
> as if the connection is failed when the PDU header + AHS cross
> a segment boundary.
>

We do not really support AHS yet. The OSD guys were fixing that I
believe, but what are the other places? Digest handling ones?

Olaf Kirch

unread,

Apr 24, 2007, 5:07:40 AM4/24/07

to open-...@googlegroups.com, Mike Christie

On Monday 23 April 2007 18:59, Mike Christie wrote:
> You would have to quote the code for me. I am not sure what you mean.
> The first part of the patch is the digest handling of the padding code?

My mistake - ignore that comment. The patch is correct.

Mike Christie

unread,

Apr 24, 2007, 11:43:37 AM4/24/07

to open-...@googlegroups.com

Olaf Kirch wrote:
> On Monday 23 April 2007 18:59, Mike Christie wrote:
>> You would have to quote the code for me. I am not sure what you mean.
>> The first part of the patch is the digest handling of the padding code?
>
> My mistake - ignore that comment. The patch is correct.
>

I think that is good proof that the code is crazt to understand :)
Thanks for reviewing the patch and extra thanks for fixing it up! :)

Reply all

Reply to author

Forward