Frequent Connection Errors with Dell Equallogic RAID

688 views
Skip to first unread message

Evan Broder

unread,
Dec 8, 2008, 8:48:22 AM12/8/08
to open-iscsi, x...@mit.edu
A group I work with is currently using a Dell Equallogic RAID with
four servers on a dedicated storage network. We've been regularly
experiencing connection errors:

Dec 8 00:50:36 aperture-science kernel: [1010621.595904]
connection1:0: iscsi: detected conn error (1011)
Dec 8 00:50:37 aperture-science iscsid: Kernel reported iSCSI
connection 1:0 error (1011) state (3)
Dec 8 00:50:39 aperture-science iscsid: Login authentication failed
with target iqn.2001-05.com.equallogic:
0-8a0906-2b6e7d402-891497db5ca48925-xvm-volume-1
Dec 8 00:50:39 aperture-science kernel: [1010624.597349] iscsi: host
reset succeeded
Dec 8 00:50:40 aperture-science iscsid: connection1:0 is operational
after recovery (1 attempts)

These errors always occur about 40 seconds after a multiple of 5
minutes after the hour. Other than that, we've seen no pattern in when
they occur, or their cause. We've tried running some scripts designed
to heavily utilize storage devices, but that doesn't seem to trigger
it. These errors occur on all four of our servers, but not at the same
time.

The RAID is being used as a physical volume for an LVM volume
group. The servers are being used for hosting Xen virtual machines,
which use LVs on the RAID as their disk images. Shortly before the
connection errors are logged, all disk I/O from the virtual machines
hangs completely, causing the VMs to basically become non-responsive
to non-trivial interaction.

Our four servers are running the stock Ubuntu Hardy 2.6.24 kernel. I
don't see an explicit version number in any of the iSCSI kernel
source, but drivers/scsi/scsi_transport_iscsi.c contains:
> > #define ISCSI_TRANSPORT_VERSION "2.0-724"

The userspace utilities are 2.0.865.

Does anyone know how to stop these errors? Is there more diagnostic
information we could provide? We're way out of our league in terms of
debugging this.

- Evan

Pasi Kärkkäinen

unread,
Dec 8, 2008, 10:52:37 AM12/8/08
to open-...@googlegroups.com, x...@mit.edu

Hmm.. wondering if those are related to automatic connection loadbalancing
on Equallogic arrays.

Maybe check with iscsiadm if the connected interface (on the EQL array)
changes after those errors:

iscsiadm -m session -P3

And check the 'Current Portal' and 'Persistent Portal' values..

Persistent Portal should be your EQL group IP address, and Current Portal
should be whatever interface you're happening to use atm..

-- Pasi

Mike Christie

unread,
Dec 8, 2008, 11:00:46 AM12/8/08
to open-...@googlegroups.com, x...@mit.edu
Evan Broder wrote:

>>> #define ISCSI_TRANSPORT_VERSION "2.0-724"
>
> The userspace utilities are 2.0.865.
>

Are these the Ubuntu tools or did you get them from open-iscsi.org? If
they are ubuntu ones is there more version info on the ubuntu package?
What is the iscsi package named in ubuntu btw?


Is there any other error messages in the logs other than this:

Dec 8 00:50:36 aperture-science kernel: [1010621.595904]
connection1:0: iscsi: detected conn error (1011)
Dec 8 00:50:37 aperture-science iscsid: Kernel reported iSCSI
connection 1:0 error (1011) state (3)


Is there something about a ping or nop timing out or a logout? If not
then look in the target logs for problems around that time.

Evan Broder

unread,
Dec 8, 2008, 1:49:50 PM12/8/08
to Mike Christie, open-...@googlegroups.com, x...@mit.edu
Mike Christie wrote:
> Evan Broder wrote:
>
>>>> #define ISCSI_TRANSPORT_VERSION "2.0-724"
>>
>> The userspace utilities are 2.0.865.
>>
>
> Are these the Ubuntu tools or did you get them from open-iscsi.org? If
> they are ubuntu ones is there more version info on the ubuntu package?
> What is the iscsi package named in ubuntu btw?

We are using both the kernel modules and userspace utilities from
Ubuntu. The kernel modules are included with the stock kernels; the
userspace is in the open-iscsi package.

The userspace is patched some from the upstream sources - see
http://archive.ubuntu.com/ubuntu/pool/main/o/open-iscsi/open-iscsi_2.0.865-1ubuntu3.diff.gz
- I'm not entirely clear on what the patches do, though

> Is there any other error messages in the logs other than this:
>
> Dec 8 00:50:36 aperture-science kernel: [1010621.595904]
> connection1:0: iscsi: detected conn error (1011)
> Dec 8 00:50:37 aperture-science iscsid: Kernel reported iSCSI
> connection 1:0 error (1011) state (3)
>
>
> Is there something about a ping or nop timing out or a logout? If not
> then look in the target logs for problems around that time.

Every time we see one of these connection failures, we get the series of
5 errors I included in the first e-mail. Nothing about nops, pings, or
logouts. The target logs just say

> iSCSI session to target '10.5.128.129:3260,
iqn.2001-05.com.equallogic:0-8a0906-2b6e7d402-891497db5ca48925-xvm-volume-1'
from initiator '10.5.128.16:53712, iqn
1993-08.org.debian:01:bce099ff4d44' was closed.
> iSCSI initiator connection failure.
> Connection was closed by peer.

- Evan

Evan Broder

unread,
Dec 8, 2008, 2:05:49 PM12/8/08
to open-...@googlegroups.com, x...@mit.edu

We currently only have a single interface enabled, although we have
plans to add more in the future.

Looking at the RAID configuration, we have the "Group IP address" set to
10.5.128.128, and the member NIC's IP address set to 10.5.128.129. The
persistent portal os 10.5.128.128 on all four servers, and the current
portal is 10.5.128.129 on all servers. It doesn't change after one of
the connection errors.

- Evan

Mike Christie

unread,
Dec 8, 2008, 2:54:23 PM12/8/08
to open-...@googlegroups.com, x...@mit.edu
Evan Broder wrote:
> A group I work with is currently using a Dell Equallogic RAID with
> four servers on a dedicated storage network. We've been regularly
> experiencing connection errors:
>
> Dec 8 00:50:36 aperture-science kernel: [1010621.595904]
> connection1:0: iscsi: detected conn error (1011)
> Dec 8 00:50:37 aperture-science iscsid: Kernel reported iSCSI
> connection 1:0 error (1011) state (3)
> Dec 8 00:50:39 aperture-science iscsid: Login authentication failed
> with target iqn.2001-05.com.equallogic:

Are you guys using CHAP at this time?

Evan Broder

unread,
Dec 8, 2008, 2:57:44 PM12/8/08
to Mike Christie, open-...@googlegroups.com, x...@mit.edu
Mike Christie wrote:
> Evan Broder wrote:
>> A group I work with is currently using a Dell Equallogic RAID with
>> four servers on a dedicated storage network. We've been regularly
>> experiencing connection errors:
>>
>> Dec 8 00:50:36 aperture-science kernel: [1010621.595904]
>> connection1:0: iscsi: detected conn error (1011)
>> Dec 8 00:50:37 aperture-science iscsid: Kernel reported iSCSI
>> connection 1:0 error (1011) state (3)
>> Dec 8 00:50:39 aperture-science iscsid: Login authentication failed
>> with target iqn.2001-05.com.equallogic:
>
> Are you guys using CHAP at this time?

Yeah. Our storage network is completely isolated, so we thought about
trying to disable CHAP, but we couldn't find an option to turn it off in
the Equallogic config.

- Evan

Mike Christie

unread,
Dec 8, 2008, 3:02:42 PM12/8/08
to open-...@googlegroups.com, x...@mit.edu

It is ok that it is on. I was just asking because it is strange that
when we try to log back in, we get a authentication error the first tim
but it looks like we retry again and it works ok, or we misread the
status and think we are in the middle of a redirect and drop the error
and log back in ok.

Pasi Kärkkäinen

unread,
Dec 8, 2008, 3:11:38 PM12/8/08
to open-...@googlegroups.com, Mike Christie, x...@mit.edu

In your volume properties, add access with just the IP address. Remove any
username access entries.

I'm using EQL volumes without CHAP.

-- Pasi

Evan Broder

unread,
Dec 8, 2008, 3:52:18 PM12/8/08
to Paul Koning, open-...@googlegroups.com, x...@mit.edu
Paul Koning wrote:
> You might check the event logs on the array to see if there are any
> messages there around the same time.
>
> paul
>
>
There isn't anything. There's a message for the disconnect, and a
message for the reconnect.

- Evan

Evan Broder

unread,
Dec 8, 2008, 7:33:07 PM12/8/08
to Pasi Kärkkäinen, open-...@googlegroups.com, Mike Christie, x...@mit.edu

Hmm, ok. So it's possible to turn off CHAP, but that doesn't really seem
like it's the issue here; from my layman's reading of the logs, the
authentication failures come off as a side-effect of mis-matched state
between the initiator and the target, does it not?

- Evan

Mike Christie

unread,
Dec 8, 2008, 8:51:26 PM12/8/08
to open-...@googlegroups.com, Pasi Kärkkäinen, x...@mit.edu
Evan Broder wrote:
> Pasi Kärkkäinen wrote:
>> On Mon, Dec 08, 2008 at 02:57:44PM -0500, Evan Broder wrote:
>>
>>> Yeah. Our storage network is completely isolated, so we thought about
>>> trying to disable CHAP, but we couldn't find an option to turn it off in
>>> the Equallogic config.
>>>
>>>
>> In your volume properties, add access with just the IP address. Remove any
>> username access entries.
>>
>> I'm using EQL volumes without CHAP.
>>
>> -- Pasi
>>
>
> Hmm, ok. So it's possible to turn off CHAP, but that doesn't really seem
> like it's the issue here; from my layman's reading of the logs, the

Yeah, like I said in the other thread, we should be fine with it on.

> authentication failures come off as a side-effect of mis-matched state
> between the initiator and the target, does it not?
>

This should not really happen. It has not in the past with these
targets. When the initiator starts a login/reconnect, the target will
figure out that we are trying to relogin and it will do the right thing
(this is part of the the iscsi spec). We should not get that error.

Is the only log error that bit about:

Evan Broder

unread,
Dec 8, 2008, 8:56:37 PM12/8/08
to Mike Christie, open-...@googlegroups.com, Pasi Kärkkäinen, x...@mit.edu
Mike Christie wrote:
> Is the only log error that bit about:
>
> iqn.2001-05.com.equallogic:0-8a0906-2b6e7d402-891497db5ca48925-xvm-volume-1'
>
> from initiator '10.5.128.16:53712, iqn
> 1993-08.org.debian:01:bce099ff4d44' was closed.
Yeah, that's the only thing that ever shows up. It's immediately
followed by a successful reconnection.

- Evan

Mike Christie

unread,
Dec 8, 2008, 9:05:55 PM12/8/08
to open-...@googlegroups.com, Pasi Kärkkäinen, x...@mit.edu

What firmware do you have, and did yo just start seeing these with a
target firmware upgrade or a iscsi/ubuntu upgrade?


Have you contacted the Equallogic guys? If not I will ping them if they
do not reply. They might be busy or have not seen the thread here.


If you can play around on the box, it would be helpful to run the
open-iscsi tarball release. Build it with

make DEBUG_TCP=1 make DEBUG_SCSI=1

then do

make DEBUG_TCP=1 make DEBUG_SCSI=1 install

This will overwrite the existing kernel modules and tools with the
open-iscsi ones, so do not this if you cannot mess around with the box
and reinstall the ubuntu kernel and tools.

This will dump out a lot of info in /var/log/messages, but we are just
looking for the part before you see these:

Dec 8 00:50:36 aperture-science kernel: [1010621.595904]
connection1:0: iscsi: detected conn error (1011)

You might see something about a TCP_CLOSE.

Mike Christie

unread,
Dec 8, 2008, 9:07:46 PM12/8/08
to open-...@googlegroups.com, Pasi Kärkkäinen, x...@mit.edu
Mike Christie wrote:
>
> If you can play around on the box, it would be helpful to run the
> open-iscsi tarball release. Build it with

Oh yeah that is here:
http://www.open-iscsi.org/bits/open-iscsi-2.0-870.1.tar.gz

Donald Williams

unread,
Dec 9, 2008, 12:34:29 AM12/9/08
to open-...@googlegroups.com
Hello, 

 I would strongly suggest using the code version Mike mentioned.  I use ubuntu 8.04/8.10 with that code without issues w/EQL arrays. 

Running the older transport kernel module has caused NOOP errors.  The initiator sends out NOOPs w/different SN numbers than what the array is expecting.   Upgrading the kernel module has resolved this at quite a few customers. 

 If this doesn't resolve the issue,  please open a case with EQL.  877.887.7337.   They can review the logs and compare it to when the errors occur. 

 Don 

Ulrich Windl

unread,
Dec 10, 2008, 6:45:04 AM12/10/08
to open-...@googlegroups.com
On 8 Dec 2008 at 17:52, Pasi Kärkkäinen wrote:

> iscsiadm -m session -P3
>

Hi,

being curious: My version if iscsiadm (version 2.0-754) doesn't know about the "-
P3". What ist it expected to do?

Regards,
Ulrich

Pasi Kärkkäinen

unread,
Dec 10, 2008, 7:45:24 AM12/10/08
to open-...@googlegroups.com

It just prints more information.. details.

-- Pasi

zhaoyuqiang

unread,
Dec 10, 2008, 11:21:54 AM12/10/08
to open-...@googlegroups.com, bro...@mit.edu
Hi,
Someone solved the issue like this with running opensuse10.3. you can try download source code from http://www.open-iscsi.org/bits/open-iscsi-2.0-865.13.tar.gz and complie for Ubuntu, it's like scsi_transport_iscsi 2.0-724 has some uncompatible with Equallogic in iscsi nohup response??  The vision 2.0-865.13.tar.gz which I find the scsi_transport_iscsi version is 2.0-865.  
Ubuntu
2008/12/8 Evan Broder <bro...@mit.edu>

Mike Christie

unread,
Dec 10, 2008, 11:38:36 AM12/10/08
to open-...@googlegroups.com

In older versions you might be able to do

iscsiadm -m session -i

This was used to dump all the iscsi info we had for debugging.

In later versions we stablized the interface and you can pass it how
much info you want (a number from 1 to 3), and what session you want
info on or if no session is given it will print them all.

Evan Broder

unread,
Dec 10, 2008, 12:00:55 PM12/10/08
to zhaoyuqiang, open-...@googlegroups.com
zhaoyuqiang wrote:
> Hi,
> Someone solved the issue like this with running opensuse10.3. you can
> try download source code from
> http://www.open-iscsi.org/bits/open-iscsi-2.0-865.13.tar.gz
> <http://www.open-iscsi.org/bits/open-iscsi-2.0-865.13.tar.gz> and
> complie for Ubuntu, it's like scsi_transport_iscsi 2.0-724 has
> some uncompatible with Equallogic in iscsi nohup
> response?? The vision 2.0-865.13.tar.gz which I find the
> scsi_transport_iscsi version is 2.0-865.
> Ubuntu

Donald Williams wrote:
> Hello,
>
> I would strongly suggest using the code version Mike mentioned. I
use ubuntu 8.04/8.10 with that code without issues w/EQL arrays.
>
> Running the older transport kernel module has caused NOOP errors. The
initiator sends out NOOPs w/different SN numbers than what the array is
expecting. Upgrading the kernel module has resolved this at quite a
few customers.


I'm working now on moving the VMs off of one of my servers so I can do
some testing. I'll hopefully be able to get back by the end of the week
or so.

Paul Koning

unread,
Dec 8, 2008, 3:49:18 PM12/8/08
to open-...@googlegroups.com, x...@mit.edu
>>>>> "Evan" == Evan Broder <bro...@MIT.EDU> writes:

Evan> Looking at the RAID configuration, we have the "Group IP
Evan> address" set to 10.5.128.128, and the member NIC's IP address
Evan> set to 10.5.128.129. The persistent portal os 10.5.128.128 on
Evan> all four servers, and the current portal is 10.5.128.129 on all
Evan> servers. It doesn't change after one of the connection errors.

That suggests the load balancer is not involved here.

Evan Broder

unread,
Dec 11, 2008, 8:19:24 AM12/11/08
to Mike Christie, open-...@googlegroups.com, Pasi Kärkkäinen, x...@mit.edu
Mike Christie wrote:
> What firmware do you have, and did yo just start seeing these with a
> target firmware upgrade or a iscsi/ubuntu upgrade?

I can't find it, but I believe we are running version 4.0.1 of the
Equallogic firmware, but we were originally running 3.3.1 and saw
problems with that version as well.

We originally saw these problems on Debian Etch, and they persisted
across our upgrade to Ubuntu Hardy.

> Have you contacted the Equallogic guys? If not I will ping them if
> they do not reply. They might be busy or have not seen the thread here.

We do have a support contract with them and were working on the issue
before. We had thought the problem went away when we moved from Etch to
Hardy. We're planning to get in touch with them again soon, but I'm not
the point of contact on that, so I've asked one of my project's
co-maintainers to do that.

> If you can play around on the box, it would be helpful to run the
> open-iscsi tarball release. Build it with
>
> make DEBUG_TCP=1 make DEBUG_SCSI=1
>
> then do
>
> make DEBUG_TCP=1 make DEBUG_SCSI=1 install
>
> This will overwrite the existing kernel modules and tools with the
> open-iscsi ones, so do not this if you cannot mess around with the box
> and reinstall the ubuntu kernel and tools.
>
> This will dump out a lot of info in /var/log/messages, but we are just
> looking for the part before you see these:
>
> Dec 8 00:50:36 aperture-science kernel: [1010621.595904]
> connection1:0: iscsi: detected conn error (1011)
>
> You might see something about a TCP_CLOSE.

Ok - I've moved all of the virtual machines off of one of our servers,
and it's currently running the latest tarball release of both the
kernelspace and userspace pieces with both debug options enabled.

Our biggest problem when we were trying to diagnose this with Equallogic
was that we didn't have a reliable way of generating the errors. They
don't seem to happen when there's no iSCSI traffic, but they also didn't
occur when we ran a program that generates rapid, random reads and writes.

I'd rather not put production VMs on this server while we're testing. Do
you have any theories on how we could trigger these connection errors,
if they're still occurring?

- Evan

Mike Christie

unread,
Dec 11, 2008, 11:37:09 AM12/11/08
to open-...@googlegroups.com, Pasi Kärkkäinen, x...@mit.edu
Evan Broder wrote:
>> If you can play around on the box, it would be helpful to run the
>> open-iscsi tarball release. Build it with
>>
>> make DEBUG_TCP=1 make DEBUG_SCSI=1
>>
>> then do
>>
>> make DEBUG_TCP=1 make DEBUG_SCSI=1 install
>>
>> This will overwrite the existing kernel modules and tools with the
>> open-iscsi ones, so do not this if you cannot mess around with the box
>> and reinstall the ubuntu kernel and tools.
>>
>> This will dump out a lot of info in /var/log/messages, but we are just
>> looking for the part before you see these:
>>
>> Dec 8 00:50:36 aperture-science kernel: [1010621.595904]
>> connection1:0: iscsi: detected conn error (1011)
>>
>> You might see something about a TCP_CLOSE.
>
> Ok - I've moved all of the virtual machines off of one of our servers,
> and it's currently running the latest tarball release of both the
> kernelspace and userspace pieces with both debug options enabled.
>
> Our biggest problem when we were trying to diagnose this with Equallogic
> was that we didn't have a reliable way of generating the errors. They
> don't seem to happen when there's no iSCSI traffic, but they also didn't
> occur when we ran a program that generates rapid, random reads and writes.
>
> I'd rather not put production VMs on this server while we're testing. Do
> you have any theories on how we could trigger these connection errors,
> if they're still occurring?
>

Not with just the above info. I t could be happening for a variety of
reasons.

It looks like the target is either disconnecting us (this is why I was
looking for a TCP_CLOSE in the debug output) because we goofed on some
iscsi protocol issue but normally I would have thought the target would
log some message.

Or for some reason the network is just getting disconnected or hitting
some problem and it is throwing an error when we try to send a packet.
For this with the newer code in that tarball we would throw a new error
code and so the "iscsi: detected conn error" output would have a
different value than 1011.

Evan Broder

unread,
Dec 11, 2008, 5:13:21 PM12/11/08
to open-...@googlegroups.com, Pasi Kärkkäinen, x...@mit.edu
Mike Christie wrote:
> Not with just the above info. I t could be happening for a variety of
> reasons.
>
> It looks like the target is either disconnecting us (this is why I was
> looking for a TCP_CLOSE in the debug output) because we goofed on some
> iscsi protocol issue but normally I would have thought the target would
> log some message.
>
> Or for some reason the network is just getting disconnected or hitting
> some problem and it is throwing an error when we try to send a packet.
> For this with the newer code in that tarball we would throw a new error
> code and so the "iscsi: detected conn error" output would have a
> different value than 1011.

Ok - I'm running a few of my VMs, and setup some simple monitoring to
let me know if an error gets logged. I'll get back to the list if/when I
find anything.

- Evan

Evan Broder

unread,
Dec 12, 2008, 2:37:40 PM12/12/08
to open-...@googlegroups.com, Pasi Kärkkäinen, x...@mit.edu
Evan Broder wrote:
> Ok - I'm running a few of my VMs, and setup some simple monitoring to
> let me know if an error gets logged. I'll get back to the list if/when I
> find anything.
>
> - Evan

I'm no longer seeing the "detected conn error", but I am still seeing
the symptoms of the problems: on a multiple of 5 minutes past the hour,
VMs become non-responsive for 30-60 seconds. The logs with all of the
debugging options turned on are a bit...overwhelming, but I've posted
the logs from an incident today. The incident happened at 14:05; I've
included the logs from 14:04-14:06 at
http://web.mit.edu/broder/Public/iscsi/citadel-station-2008-12-12-14-05.bz2

- Evan

Mike Christie

unread,
Dec 12, 2008, 6:52:20 PM12/12/08
to open-...@googlegroups.com, Pasi Kärkkäinen, x...@mit.edu
Evan Broder wrote:
> Evan Broder wrote:
>> Ok - I'm running a few of my VMs, and setup some simple monitoring to
>> let me know if an error gets logged. I'll get back to the list if/when I
>> find anything.
>>
>> - Evan
>
> I'm no longer seeing the "detected conn error", but I am still seeing
> the symptoms of the problems: on a multiple of 5 minutes past the hour,
> VMs become non-responsive for 30-60 seconds. The logs with all of the

Are you doing iscsi root?

How do you know they become unresponsive? Do you have a ssh session and
at this time you cannot contact the vm or do you have a console on the
vm and at that time you cannot run any commands?

> debugging options turned on are a bit...overwhelming, but I've posted
> the logs from an incident today. The incident happened at 14:05; I've
> included the logs from 14:04-14:06 at
> http://web.mit.edu/broder/Public/iscsi/citadel-station-2008-12-12-14-05.bz2
>


It looks from Dec 12 14:05:09 citadel-station kernel: [107789.172478]
iscsi: done [sc ffff88012d96ca40 res 0 itt 0xf]

to

Dec 12 14:05:38 citadel-station kernel: [107938.402859] iscsi: aborting
[sc ffff8801287c91c0 itt 0x7c]

The only IO we see is the target sending us a nop (iscsi ping) and us
responding. At 14:05:38 a scsi command times out and so the scsi layer
fires up its error handling which sends an abort that fails then sends a
lu reset which unjams us and the initiator then resends the commands it
thought was lost. Note that if the lu reset failed, we would have
eventually seen the conn error message you saw before so we probably are
hitting the right code paths.

The initiator thinks these two commands:
Dec 12 14:05:38 citadel-station kernel: [107938.405876] iscsi: failing
in progress sc ffff8801287c91c0 itt 0x7c
Dec 12 14:05:38 citadel-station kernel: [107938.405881] iscsi: failing
in progress sc ffff8801274b6440 itt 0x5c

did not complete. But I am betting the target thinks it completed it
since the target seems fine and is still pinging us.


The iscsi logs show we did not get the command response. So either the
target droppped a packet or the network dropped something or the
initiator dropped it in the network processing.

Could you run this again but get a ethereal/wireshark trace at the same
time? That way we can see if the network layer is getting the packet.


Could you also run with this patch? In the open-iscsi source do

patch -p1 -i path-to-patch/remove-undef-tcp.patch

then rebuild with

make DEBUG_TCP=1 DEBUG_SCSI=1
make DEBUG_TCP=1 DEBUG_SCSI=1 install


One other question. Is the iscsi initiator running in the vm or in the
host? Is this xen or something else like vmware? Could you take the
ethereal/wireshark trace from wherever iscsi is running so we can see
what its network layer is getting? If it is running in a guest and you
can get a ethereal/wireshark trace from both the guest and host that
would be nice too.

remove-undef-tcp.patch

Evan Broder

unread,
Dec 12, 2008, 9:41:18 PM12/12/08
to Mike Christie, open-...@googlegroups.com, Pasi Kärkkäinen, x...@mit.edu
Mike Christie wrote:
> Are you doing iscsi root?

iscsid is running as root - is that what you mean?

> How do you know they become unresponsive? Do you have a ssh session
> and at this time you cannot contact the vm or do you have a console on
> the vm and at that time you cannot run any commands?

ssh sessions hang, SMTP connections hang. It's all network-based
traffic; I haven't tried interacting with the console.

Note that the host itself doesn't hang when one of these failures occur,
just the VMs using iSCSI for disk.

> The iscsi logs show we did not get the command response. So either the
> target droppped a packet or the network dropped something or the
> initiator dropped it in the network processing.
>
> Could you run this again but get a ethereal/wireshark trace at the
> same time? That way we can see if the network layer is getting the packet.

I'm trying to configure something to capture network traffic, but right
now I'm having a hard time getting a functional tshark. Right now, it
looks like it's capturing accurate data, but borking the TCP/UDP
checksums. The next time I have a failure, I'll forward the dump along
to see if it looks useful.

Incidentally, to what extent is user data exposed in the dump? I've
asked some friends to let me run their VMs on citadel-station to try and
increase the probability of an incident occurring, but they presumably
wouldn't appreciate me sending out logs if data from their VM's disk
images can be recovered.

> Could you also run with this patch? In the open-iscsi source do
>
> patch -p1 -i path-to-patch/remove-undef-tcp.patch
>
> then rebuild with
>
> make DEBUG_TCP=1 DEBUG_SCSI=1
> make DEBUG_TCP=1 DEBUG_SCSI=1 install

Huh? Where do I find that patch?

> One other question. Is the iscsi initiator running in the vm or in the
> host? Is this xen or something else like vmware? Could you take the
> ethereal/wireshark trace from wherever iscsi is running so we can see
> what its network layer is getting? If it is running in a guest and you
> can get a ethereal/wireshark trace from both the guest and host that
> would be nice too.

The iscsi initiator is running on the Xen host. We then use clustering
LVM to get LVs on the target that we use as the disks for the VMs.

- Evan

Mike Christie

unread,
Dec 13, 2008, 2:52:37 AM12/13/08
to Evan Broder, open-...@googlegroups.com, Pasi Kärkkäinen, x...@mit.edu
Evan Broder wrote:
>
> Incidentally, to what extent is user data exposed in the dump? I've
> asked some friends to let me run their VMs on citadel-station to try and
> increase the probability of an incident occurring, but they presumably
> wouldn't appreciate me sending out logs if data from their VM's disk
> images can be recovered.

I think wiresharp/ethereal will spit out the data in the packets by
default and I am not sure of a way to ask it to drop the data and just
send the packet info.

>
>> Could you also run with this patch? In the open-iscsi source do
>>
>> patch -p1 -i path-to-patch/remove-undef-tcp.patch
>>
>> then rebuild with
>>
>> make DEBUG_TCP=1 DEBUG_SCSI=1
>> make DEBUG_TCP=1 DEBUG_SCSI=1 install
>
> Huh? Where do I find that patch?
>

Did you see it attached in the mail? I am attaching it again.

remove-undef-tcp.patch

Evan Broder

unread,
Dec 18, 2008, 1:08:14 PM12/18/08
to Mike Christie, open-...@googlegroups.com, Pasi Kärkkäinen, x...@mit.edu
Mike Christie wrote:
> I think wiresharp/ethereal will spit out the data in the packets by
> default and I am not sure of a way to ask it to drop the data and just
> send the packet info.

Ok - sorry for taking a while in getting this collected. I have a
tcpdump from one of the failures at
<http://web.mit.edu/broder/Public/iscsi/iscsi_traffic-2008-12-18-11-20.pcap.bz2>.
The failure should occur about halfway through the log. The
corresponding syslog is at
<http://web.mit.edu/broder/Public/iscsi/syslog-2008-12-18-11-20.bz2>


>>> Could you also run with this patch? In the open-iscsi source do
>>>
>>> patch -p1 -i path-to-patch/remove-undef-tcp.patch
>>>
>>> then rebuild with
>>>
>>> make DEBUG_TCP=1 DEBUG_SCSI=1
>>> make DEBUG_TCP=1 DEBUG_SCSI=1 install

I'll hopefully have a log with this information in the next few days.

- Evan

Evan Broder

unread,
Dec 20, 2008, 10:18:53 PM12/20/08
to Evan Broder, Mike Christie, open-...@googlegroups.com, Pasi Kärkkäinen, x...@mit.edu
Evan Broder wrote:
>>>> Could you also run with this patch? In the open-iscsi source do
>>>>
>>>> patch -p1 -i path-to-patch/remove-undef-tcp.patch
>>>>
>>>> then rebuild with
>>>>
>>>> make DEBUG_TCP=1 DEBUG_SCSI=1
>>>> make DEBUG_TCP=1 DEBUG_SCSI=1 install
>>>>
> I'll hopefully have a log with this information in the next few days.
>
> - Evan
>

Just got another round of failures with the new debugging patch in.
Traffic dump at
<http://web.mit.edu/broder/Public/iscsi/iscsi_traffic-2008-12-20-22-00.pcap.bz2>,
syslog (including the kernel log) at
<http://web.mit.edu/broder/Public/iscsi/syslog-2008-12-20-22-00.bz2>

- Evan

Evan Broder

unread,
Dec 30, 2008, 10:41:17 PM12/30/08
to Evan Broder, Mike Christie, open-...@googlegroups.com, Pasi Kärkkäinen, x...@mit.edu
Evan Broder wrote:
> Just got another round of failures with the new debugging patch in.
> Traffic dump at
> <http://web.mit.edu/broder/Public/iscsi/iscsi_traffic-2008-12-20-22-00.pcap.bz2>,
> syslog (including the kernel log) at
> <http://web.mit.edu/broder/Public/iscsi/syslog-2008-12-20-22-00.bz2>
>
> - Evan
>

I hate to press the issue, but has anyone had a chance to look at these?

- Evan

Mike Christie

unread,
Jan 1, 2009, 2:40:39 AM1/1/09
to Evan Broder, open-...@googlegroups.com, Pasi Kärkkäinen, x...@mit.edu

I have been out since the 17th and will be back on the 5th. I will check
them out when I am back. Sorry for the delay.

Mike Christie

unread,
Jan 5, 2009, 11:37:50 AM1/5/09
to open-...@googlegroups.com, Evan Broder, Pasi Kärkkäinen, x...@mit.edu

I am looking at the updated logs and pcap now. I should have something
to report later today.

Reply all
Reply to author
Forward
0 new messages