Hmm.. wondering if those are related to automatic connection loadbalancing
on Equallogic arrays.
Maybe check with iscsiadm if the connected interface (on the EQL array)
changes after those errors:
iscsiadm -m session -P3
And check the 'Current Portal' and 'Persistent Portal' values..
Persistent Portal should be your EQL group IP address, and Current Portal
should be whatever interface you're happening to use atm..
-- Pasi
>>> #define ISCSI_TRANSPORT_VERSION "2.0-724"
>
> The userspace utilities are 2.0.865.
>
Are these the Ubuntu tools or did you get them from open-iscsi.org? If
they are ubuntu ones is there more version info on the ubuntu package?
What is the iscsi package named in ubuntu btw?
Is there any other error messages in the logs other than this:
Dec 8 00:50:36 aperture-science kernel: [1010621.595904]
connection1:0: iscsi: detected conn error (1011)
Dec 8 00:50:37 aperture-science iscsid: Kernel reported iSCSI
connection 1:0 error (1011) state (3)
Is there something about a ping or nop timing out or a logout? If not
then look in the target logs for problems around that time.
We are using both the kernel modules and userspace utilities from
Ubuntu. The kernel modules are included with the stock kernels; the
userspace is in the open-iscsi package.
The userspace is patched some from the upstream sources - see
http://archive.ubuntu.com/ubuntu/pool/main/o/open-iscsi/open-iscsi_2.0.865-1ubuntu3.diff.gz
- I'm not entirely clear on what the patches do, though
> Is there any other error messages in the logs other than this:
>
> Dec 8 00:50:36 aperture-science kernel: [1010621.595904]
> connection1:0: iscsi: detected conn error (1011)
> Dec 8 00:50:37 aperture-science iscsid: Kernel reported iSCSI
> connection 1:0 error (1011) state (3)
>
>
> Is there something about a ping or nop timing out or a logout? If not
> then look in the target logs for problems around that time.
Every time we see one of these connection failures, we get the series of
5 errors I included in the first e-mail. Nothing about nops, pings, or
logouts. The target logs just say
> iSCSI session to target '10.5.128.129:3260,
iqn.2001-05.com.equallogic:0-8a0906-2b6e7d402-891497db5ca48925-xvm-volume-1'
from initiator '10.5.128.16:53712, iqn
1993-08.org.debian:01:bce099ff4d44' was closed.
> iSCSI initiator connection failure.
> Connection was closed by peer.
- Evan
We currently only have a single interface enabled, although we have
plans to add more in the future.
Looking at the RAID configuration, we have the "Group IP address" set to
10.5.128.128, and the member NIC's IP address set to 10.5.128.129. The
persistent portal os 10.5.128.128 on all four servers, and the current
portal is 10.5.128.129 on all servers. It doesn't change after one of
the connection errors.
- Evan
Are you guys using CHAP at this time?
Yeah. Our storage network is completely isolated, so we thought about
trying to disable CHAP, but we couldn't find an option to turn it off in
the Equallogic config.
- Evan
It is ok that it is on. I was just asking because it is strange that
when we try to log back in, we get a authentication error the first tim
but it looks like we retry again and it works ok, or we misread the
status and think we are in the middle of a redirect and drop the error
and log back in ok.
In your volume properties, add access with just the IP address. Remove any
username access entries.
I'm using EQL volumes without CHAP.
-- Pasi
- Evan
Hmm, ok. So it's possible to turn off CHAP, but that doesn't really seem
like it's the issue here; from my layman's reading of the logs, the
authentication failures come off as a side-effect of mis-matched state
between the initiator and the target, does it not?
- Evan
Yeah, like I said in the other thread, we should be fine with it on.
> authentication failures come off as a side-effect of mis-matched state
> between the initiator and the target, does it not?
>
This should not really happen. It has not in the past with these
targets. When the initiator starts a login/reconnect, the target will
figure out that we are trying to relogin and it will do the right thing
(this is part of the the iscsi spec). We should not get that error.
Is the only log error that bit about:
- Evan
What firmware do you have, and did yo just start seeing these with a
target firmware upgrade or a iscsi/ubuntu upgrade?
Have you contacted the Equallogic guys? If not I will ping them if they
do not reply. They might be busy or have not seen the thread here.
If you can play around on the box, it would be helpful to run the
open-iscsi tarball release. Build it with
make DEBUG_TCP=1 make DEBUG_SCSI=1
then do
make DEBUG_TCP=1 make DEBUG_SCSI=1 install
This will overwrite the existing kernel modules and tools with the
open-iscsi ones, so do not this if you cannot mess around with the box
and reinstall the ubuntu kernel and tools.
This will dump out a lot of info in /var/log/messages, but we are just
looking for the part before you see these:
Dec 8 00:50:36 aperture-science kernel: [1010621.595904]
connection1:0: iscsi: detected conn error (1011)
You might see something about a TCP_CLOSE.
Oh yeah that is here:
http://www.open-iscsi.org/bits/open-iscsi-2.0-870.1.tar.gz
> iscsiadm -m session -P3
>
Hi,
being curious: My version if iscsiadm (version 2.0-754) doesn't know about the "-
P3". What ist it expected to do?
Regards,
Ulrich
It just prints more information.. details.
-- Pasi
In older versions you might be able to do
iscsiadm -m session -i
This was used to dump all the iscsi info we had for debugging.
In later versions we stablized the interface and you can pass it how
much info you want (a number from 1 to 3), and what session you want
info on or if no session is given it will print them all.
Donald Williams wrote:
> Hello,
>
> I would strongly suggest using the code version Mike mentioned. I
use ubuntu 8.04/8.10 with that code without issues w/EQL arrays.
>
> Running the older transport kernel module has caused NOOP errors. The
initiator sends out NOOPs w/different SN numbers than what the array is
expecting. Upgrading the kernel module has resolved this at quite a
few customers.
I'm working now on moving the VMs off of one of my servers so I can do
some testing. I'll hopefully be able to get back by the end of the week
or so.
Evan> Looking at the RAID configuration, we have the "Group IP
Evan> address" set to 10.5.128.128, and the member NIC's IP address
Evan> set to 10.5.128.129. The persistent portal os 10.5.128.128 on
Evan> all four servers, and the current portal is 10.5.128.129 on all
Evan> servers. It doesn't change after one of the connection errors.
That suggests the load balancer is not involved here.
I can't find it, but I believe we are running version 4.0.1 of the
Equallogic firmware, but we were originally running 3.3.1 and saw
problems with that version as well.
We originally saw these problems on Debian Etch, and they persisted
across our upgrade to Ubuntu Hardy.
> Have you contacted the Equallogic guys? If not I will ping them if
> they do not reply. They might be busy or have not seen the thread here.
We do have a support contract with them and were working on the issue
before. We had thought the problem went away when we moved from Etch to
Hardy. We're planning to get in touch with them again soon, but I'm not
the point of contact on that, so I've asked one of my project's
co-maintainers to do that.
> If you can play around on the box, it would be helpful to run the
> open-iscsi tarball release. Build it with
>
> make DEBUG_TCP=1 make DEBUG_SCSI=1
>
> then do
>
> make DEBUG_TCP=1 make DEBUG_SCSI=1 install
>
> This will overwrite the existing kernel modules and tools with the
> open-iscsi ones, so do not this if you cannot mess around with the box
> and reinstall the ubuntu kernel and tools.
>
> This will dump out a lot of info in /var/log/messages, but we are just
> looking for the part before you see these:
>
> Dec 8 00:50:36 aperture-science kernel: [1010621.595904]
> connection1:0: iscsi: detected conn error (1011)
>
> You might see something about a TCP_CLOSE.
Ok - I've moved all of the virtual machines off of one of our servers,
and it's currently running the latest tarball release of both the
kernelspace and userspace pieces with both debug options enabled.
Our biggest problem when we were trying to diagnose this with Equallogic
was that we didn't have a reliable way of generating the errors. They
don't seem to happen when there's no iSCSI traffic, but they also didn't
occur when we ran a program that generates rapid, random reads and writes.
I'd rather not put production VMs on this server while we're testing. Do
you have any theories on how we could trigger these connection errors,
if they're still occurring?
- Evan
Not with just the above info. I t could be happening for a variety of
reasons.
It looks like the target is either disconnecting us (this is why I was
looking for a TCP_CLOSE in the debug output) because we goofed on some
iscsi protocol issue but normally I would have thought the target would
log some message.
Or for some reason the network is just getting disconnected or hitting
some problem and it is throwing an error when we try to send a packet.
For this with the newer code in that tarball we would throw a new error
code and so the "iscsi: detected conn error" output would have a
different value than 1011.
Ok - I'm running a few of my VMs, and setup some simple monitoring to
let me know if an error gets logged. I'll get back to the list if/when I
find anything.
- Evan
I'm no longer seeing the "detected conn error", but I am still seeing
the symptoms of the problems: on a multiple of 5 minutes past the hour,
VMs become non-responsive for 30-60 seconds. The logs with all of the
debugging options turned on are a bit...overwhelming, but I've posted
the logs from an incident today. The incident happened at 14:05; I've
included the logs from 14:04-14:06 at
http://web.mit.edu/broder/Public/iscsi/citadel-station-2008-12-12-14-05.bz2
- Evan
Are you doing iscsi root?
How do you know they become unresponsive? Do you have a ssh session and
at this time you cannot contact the vm or do you have a console on the
vm and at that time you cannot run any commands?
> debugging options turned on are a bit...overwhelming, but I've posted
> the logs from an incident today. The incident happened at 14:05; I've
> included the logs from 14:04-14:06 at
> http://web.mit.edu/broder/Public/iscsi/citadel-station-2008-12-12-14-05.bz2
>
It looks from Dec 12 14:05:09 citadel-station kernel: [107789.172478]
iscsi: done [sc ffff88012d96ca40 res 0 itt 0xf]
to
Dec 12 14:05:38 citadel-station kernel: [107938.402859] iscsi: aborting
[sc ffff8801287c91c0 itt 0x7c]
The only IO we see is the target sending us a nop (iscsi ping) and us
responding. At 14:05:38 a scsi command times out and so the scsi layer
fires up its error handling which sends an abort that fails then sends a
lu reset which unjams us and the initiator then resends the commands it
thought was lost. Note that if the lu reset failed, we would have
eventually seen the conn error message you saw before so we probably are
hitting the right code paths.
The initiator thinks these two commands:
Dec 12 14:05:38 citadel-station kernel: [107938.405876] iscsi: failing
in progress sc ffff8801287c91c0 itt 0x7c
Dec 12 14:05:38 citadel-station kernel: [107938.405881] iscsi: failing
in progress sc ffff8801274b6440 itt 0x5c
did not complete. But I am betting the target thinks it completed it
since the target seems fine and is still pinging us.
The iscsi logs show we did not get the command response. So either the
target droppped a packet or the network dropped something or the
initiator dropped it in the network processing.
Could you run this again but get a ethereal/wireshark trace at the same
time? That way we can see if the network layer is getting the packet.
Could you also run with this patch? In the open-iscsi source do
patch -p1 -i path-to-patch/remove-undef-tcp.patch
then rebuild with
make DEBUG_TCP=1 DEBUG_SCSI=1
make DEBUG_TCP=1 DEBUG_SCSI=1 install
One other question. Is the iscsi initiator running in the vm or in the
host? Is this xen or something else like vmware? Could you take the
ethereal/wireshark trace from wherever iscsi is running so we can see
what its network layer is getting? If it is running in a guest and you
can get a ethereal/wireshark trace from both the guest and host that
would be nice too.
iscsid is running as root - is that what you mean?
> How do you know they become unresponsive? Do you have a ssh session
> and at this time you cannot contact the vm or do you have a console on
> the vm and at that time you cannot run any commands?
ssh sessions hang, SMTP connections hang. It's all network-based
traffic; I haven't tried interacting with the console.
Note that the host itself doesn't hang when one of these failures occur,
just the VMs using iSCSI for disk.
> The iscsi logs show we did not get the command response. So either the
> target droppped a packet or the network dropped something or the
> initiator dropped it in the network processing.
>
> Could you run this again but get a ethereal/wireshark trace at the
> same time? That way we can see if the network layer is getting the packet.
I'm trying to configure something to capture network traffic, but right
now I'm having a hard time getting a functional tshark. Right now, it
looks like it's capturing accurate data, but borking the TCP/UDP
checksums. The next time I have a failure, I'll forward the dump along
to see if it looks useful.
Incidentally, to what extent is user data exposed in the dump? I've
asked some friends to let me run their VMs on citadel-station to try and
increase the probability of an incident occurring, but they presumably
wouldn't appreciate me sending out logs if data from their VM's disk
images can be recovered.
> Could you also run with this patch? In the open-iscsi source do
>
> patch -p1 -i path-to-patch/remove-undef-tcp.patch
>
> then rebuild with
>
> make DEBUG_TCP=1 DEBUG_SCSI=1
> make DEBUG_TCP=1 DEBUG_SCSI=1 install
Huh? Where do I find that patch?
> One other question. Is the iscsi initiator running in the vm or in the
> host? Is this xen or something else like vmware? Could you take the
> ethereal/wireshark trace from wherever iscsi is running so we can see
> what its network layer is getting? If it is running in a guest and you
> can get a ethereal/wireshark trace from both the guest and host that
> would be nice too.
The iscsi initiator is running on the Xen host. We then use clustering
LVM to get LVs on the target that we use as the disks for the VMs.
- Evan
I think wiresharp/ethereal will spit out the data in the packets by
default and I am not sure of a way to ask it to drop the data and just
send the packet info.
>
>> Could you also run with this patch? In the open-iscsi source do
>>
>> patch -p1 -i path-to-patch/remove-undef-tcp.patch
>>
>> then rebuild with
>>
>> make DEBUG_TCP=1 DEBUG_SCSI=1
>> make DEBUG_TCP=1 DEBUG_SCSI=1 install
>
> Huh? Where do I find that patch?
>
Did you see it attached in the mail? I am attaching it again.
Ok - sorry for taking a while in getting this collected. I have a
tcpdump from one of the failures at
<http://web.mit.edu/broder/Public/iscsi/iscsi_traffic-2008-12-18-11-20.pcap.bz2>.
The failure should occur about halfway through the log. The
corresponding syslog is at
<http://web.mit.edu/broder/Public/iscsi/syslog-2008-12-18-11-20.bz2>
>>> Could you also run with this patch? In the open-iscsi source do
>>>
>>> patch -p1 -i path-to-patch/remove-undef-tcp.patch
>>>
>>> then rebuild with
>>>
>>> make DEBUG_TCP=1 DEBUG_SCSI=1
>>> make DEBUG_TCP=1 DEBUG_SCSI=1 install
I'll hopefully have a log with this information in the next few days.
- Evan
Just got another round of failures with the new debugging patch in.
Traffic dump at
<http://web.mit.edu/broder/Public/iscsi/iscsi_traffic-2008-12-20-22-00.pcap.bz2>,
syslog (including the kernel log) at
<http://web.mit.edu/broder/Public/iscsi/syslog-2008-12-20-22-00.bz2>
- Evan
I hate to press the issue, but has anyone had a chance to look at these?
- Evan
I have been out since the 17th and will be back on the 5th. I will check
them out when I am back. Sorry for the delay.
I am looking at the updated logs and pcap now. I should have something
to report later today.