iscsi diagnosis help

134 views
Skip to first unread message

Hoot, Joseph

unread,
Nov 16, 2009, 3:41:27 PM11/16/09
to open-...@googlegroups.com
Hi all,

I'm trying to understand what I'm seeing in my /var/log/messages. Here's what I have:

Nov 13 10:49:47 oim6102506 kernel: connection5:0: ping timeout of 10 secs expired, last rx 191838122, last ping 191839372, now 191841872
Nov 13 10:49:47 oim6102506 kernel: connection5:0: detected conn error (1011)
Nov 13 10:49:47 oim6102506 iscsid: Kernel reported iSCSI connection 5:0 error (1011) state (3)
Nov 13 10:49:50 oim6102506 iscsid: Login authentication failed with target iqn.2001-05.com.equallogic:0-8a0906-e7d1dea02-786272c42554aef2-ovm-2-lun03
Nov 13 10:49:52 oim6102506 iscsid: connection5:0 is operational after recovery (1 attempts)

the first line, what is "connection5:0"? is that referenced from iscsiadm somewhere? I only ask because I'm seeing iscsid messages and kernel messages. I also have dm-multipath running, which usually shows up as dm-multipath or something like that. I understand that iscsid is the process that is logging in and out. But is the "kernel:" message just an iscsi modules that is loaded into the kernel, which is why it is being logged as "kernel:"?

Maybe one of these modules?
iscsi_tcp 19785 46
libiscsi_tcp 21829 1 iscsi_tcp
libiscsi2 41285 3 ib_iser,iscsi_tcp,libiscsi_tcp
scsi_transport_iscsi2 37197 5 ib_iser,iscsi_tcp,libiscsi2
scsi_transport_iscsi 6085 1 scsi_transport_iscsi2

I'm just trying to make sure that all of my timeout values line up properly so that, for example, OCFS2 filesystem doesn't timeout prior to iSCSI. I'd rather verify that iSCSI reconnections are attempted every 10 seconds and then dm-multipath has 30 seconds to timeout and then OCFS2 has 60 seconds before it times out (or something along those lines). So if I understand the logging a bit more, it'll help me.

Thanks
Joe


===========================
Joseph R. Hoot
Lead System Programmer/Analyst
(w) 716-878-4832
(c) 716-759-HOOT
joe....@itec.suny.edu
GPG KEY: 7145F633
===========================

Mike Christie

unread,
Nov 16, 2009, 7:18:09 PM11/16/09
to open-...@googlegroups.com
Hoot, Joseph wrote:
> Hi all,
>
> I'm trying to understand what I'm seeing in my /var/log/messages. Here's what I have:
>
> Nov 13 10:49:47 oim6102506 kernel: connection5:0: ping timeout of 10 secs expired, last rx 191838122, last ping 191839372, now 191841872
> Nov 13 10:49:47 oim6102506 kernel: connection5:0: detected conn error (1011)
> Nov 13 10:49:47 oim6102506 iscsid: Kernel reported iSCSI connection 5:0 error (1011) state (3)
> Nov 13 10:49:50 oim6102506 iscsid: Login authentication failed with target iqn.2001-05.com.equallogic:0-8a0906-e7d1dea02-786272c42554aef2-ovm-2-lun03
> Nov 13 10:49:52 oim6102506 iscsid: connection5:0 is operational after recovery (1 attempts)
>
> the first line, what is "connection5:0"? is that referenced from iscsiadm somewhere? I only ask because I'm seeing iscsid messages and kernel messages. I also have dm-multipath running, which usually shows up as dm-multipath or something like that. I understand that iscsid is the process that is logging in and out. But is the "kernel:" message just an iscsi modules that is loaded into the kernel, which is why it is being logged as "kernel:"?
>

It is the session id and connection id.

connection$SESSION_ID:$CONNECTION_ID

If you run iscsiadm -m session -P 1 or -P 3

You will see

#iscsiadm -m session -P 1
Target: iqn.1992-08.com.netapp:sn.33615311
Current Portal: 10.15.85.19:3260,3
Persistent Portal: 10.15.85.19:3260,3
Iface Transport: tcp
Iface IPaddress: 10.11.14.37
Iface HWaddress: default
Iface Netdev: default
SID: 7
iSCSI Connection State: LOGGED IN
Internal iscsid Session State: NO CHANGE


Session number is the SID value.

If you run
iscsiadm -m session
tcp [2] 10.15.84.19:3260,2 iqn.1992-08.com.netapp:sn.33615311

the session number/SID is the value in brackets.


If you run iscsiadm in session mode (iscsiadm -m session) then you can
use the -R argument and pass in a SID to do an opertaion like

iscsiadm -m session -R 2 --rescan

would rescan that session.

Connection number is currently always zero.


For the second question, iscsid handles login and logout, and error
handling, and the kernel basically passes iscsi packets around.


Nov 13 10:49:47 oim6102506 kernel: connection5:0: ping timeout of 10
secs expired, last rx 191838122, last ping 191839372, now 191841872


so here the iscsi kernel code sends a iscsi ping/nop every noop_interval
seconds, and if we do not get a response withing noop_timeout seconds it
will fire off a connection error.



Nov 13 10:49:47 oim6102506 kernel: connection5:0: detected conn error
(1011)


Here is the kernel code notifying userspace of the problem.


Nov 13 10:49:47 oim6102506 iscsid: Kernel reported iSCSI connection 5:0
error (1011) state (3)


And there iscsid is accepting the error (probably no need for the error
to be logged twice).


Nov 13 10:49:50 oim6102506 iscsid: Login authentication failed with target


And then here iscsid handled the error by killing the tcp/ip connection,
reconnection the tcp/ip connection, and then re-logging into the iscsi
target. But for some reason we could not log back in right away.


iqn.2001-05.com.equallogic:0-8a0906-e7d1dea02-786272c42554aef2-ovm-2-lun03
Nov 13 10:49:52 oim6102506 iscsid: connection5:0 is operational after
recovery (1 attempts)

But it looks like we tried again and we got back in.



> Maybe one of these modules?
> iscsi_tcp 19785 46
> libiscsi_tcp 21829 1 iscsi_tcp
> libiscsi2 41285 3 ib_iser,iscsi_tcp,libiscsi_tcp
> scsi_transport_iscsi2 37197 5 ib_iser,iscsi_tcp,libiscsi2
> scsi_transport_iscsi 6085 1 scsi_transport_iscsi2
>
> I'm just trying to make sure that all of my timeout values line up properly so that, for example, OCFS2 filesystem doesn't timeout prior to iSCSI. I'd rather verify that iSCSI reconnections are attempted every 10 seconds and then dm-multipath has 30 seconds to timeout and then OCFS2 has 60 seconds before it times out (or something along those lines). So if I understand the logging a bit more, it'll help me.
>
> Thanks
> Joe
>
>
> ===========================
> Joseph R. Hoot
> Lead System Programmer/Analyst
> (w) 716-878-4832
> (c) 716-759-HOOT
> joe....@itec.suny.edu
> GPG KEY: 7145F633
> ===========================
>
>
> --~--~---------~--~----~------------~-------~--~----~
> You received this message because you are subscribed to the Google Groups "open-iscsi" group.
> To post to this group, send email to open-...@googlegroups.com
> To unsubscribe from this group, send email to open-iscsi+...@googlegroups.com
> For more options, visit this group at http://groups.google.com/group/open-iscsi
> -~----------~----~----~----~------~----~------~--~---
>

Hoot, Joseph

unread,
Nov 16, 2009, 8:19:33 PM11/16/09
to open-...@googlegroups.com
thanks. That helps. So I know that with the EqualLogic targets, there is a "Group IP" which, I believe, responds with an iscsi login_redirect.

1) Could the "Login authentication failed" message be the response because of a login redirect messages from the EQL redirect?

and then my next question is more for curiosity sake:

2) Are there plans in the future to have more than one connection per session? and I guess in addition to that, would that mean multiple connections to a single volume over the same nic?
> --
>
> You received this message because you are subscribed to the Google Groups "open-iscsi" group.
> To post to this group, send email to open-...@googlegroups.com.
> To unsubscribe from this group, send email to open-iscsi+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=.

Hoot, Joseph

unread,
Nov 16, 2009, 9:39:00 PM11/16/09
to open-...@googlegroups.com

On Nov 16, 2009, at 8:19 PM, Hoot, Joseph wrote:

> thanks. That helps. So I know that with the EqualLogic targets, there is a "Group IP" which, I believe, responds with an iscsi login_redirect.
>
> 1) Could the "Login authentication failed" message be the response because of a login redirect messages from the EQL redirect?
>
> and then my next question is more for curiosity sake:
>
> 2) Are there plans in the future to have more than one connection per session? and I guess in addition to that, would that mean multiple connections to a single volume over the same nic?
>
>


Also Mike, I'm seeing one or two of these every 30-40 minutes if I slam our EqualLogic with roughly 7-15k IOPS (reads and writes) non stop on 3 volumes. In this type of scenario, would you expect to see timeouts like this once in awhile? If so, do you think increasing my NOOP timeouts would assist so we don't get these? maybe set it to 15 seconds instead of 10?

Pasi Kärkkäinen

unread,
Nov 17, 2009, 4:50:42 AM11/17/09
to open-...@googlegroups.com
On Mon, Nov 16, 2009 at 09:39:00PM -0500, Hoot, Joseph wrote:
>
> On Nov 16, 2009, at 8:19 PM, Hoot, Joseph wrote:
>
> > thanks. That helps. So I know that with the EqualLogic targets, there is a "Group IP" which, I believe, responds with an iscsi login_redirect.
> >
> > 1) Could the "Login authentication failed" message be the response because of a login redirect messages from the EQL redirect?
> >
> > and then my next question is more for curiosity sake:
> >
> > 2) Are there plans in the future to have more than one connection per session? and I guess in addition to that, would that mean multiple connections to a single volume over the same nic?
> >
> >
>
>
> Also Mike, I'm seeing one or two of these every 30-40 minutes if I slam our EqualLogic with roughly 7-15k IOPS (reads and writes) non stop on 3 volumes. In this type of scenario, would you expect to see timeouts like this once in awhile? If so, do you think increasing my NOOP timeouts would assist so we don't get these? maybe set it to 15 seconds instead of 10?
>

Equallogic does active loadbalancing (redirects) during operation..
dunno about the errors though.

-- Pasi

Yangkook Kim

unread,
Nov 17, 2009, 12:26:34 PM11/17/09
to open-...@googlegroups.com
> 2) Are there plans in the future to have more than one connection per session?

I dont't think so. If you want to know the reason, read the thread
titled "MC/S support
" in open-iscsi mailing list.

Kim

2009/11/17, Pasi Kärkkäinen <pa...@iki.fi>:

Mike Christie

unread,
Nov 17, 2009, 7:22:05 PM11/17/09
to open-...@googlegroups.com
Hoot, Joseph wrote:
> thanks. That helps. So I know that with the EqualLogic targets, there is a "Group IP" which, I believe, responds with an iscsi login_redirect.
>
> 1) Could the "Login authentication failed" message be the response because of a login redirect messages from the EQL redirect?
>

It could be, but then when we retry we end up handling the redirect ok.
I do not know why the first redirect would fail.

Take a wireshark trace or run iscsid by hand

iscsid -d 8

send the log output.

That will let us know where we failed, but it would not tell us why.
Normally EQL targets would leave something in their logs about why.


> and then my next question is more for curiosity sake:
>
> 2) Are there plans in the future to have more than one connection per session? and I guess in addition to that, would that mean multiple connections to a single volume over the same nic?
>

No plans for MC/s. You can do multiple sessions to the same volume
though. You can have multiple sessions over the same nic or over
difference nics or some combo.

Mike Christie

unread,
Nov 17, 2009, 7:23:31 PM11/17/09
to open-...@googlegroups.com
Hoot, Joseph wrote:
> On Nov 16, 2009, at 8:19 PM, Hoot, Joseph wrote:
>
>> thanks. That helps. So I know that with the EqualLogic targets, there is a "Group IP" which, I believe, responds with an iscsi login_redirect.
>>
>> 1) Could the "Login authentication failed" message be the response because of a login redirect messages from the EQL redirect?
>>
>> and then my next question is more for curiosity sake:
>>
>> 2) Are there plans in the future to have more than one connection per session? and I guess in addition to that, would that mean multiple connections to a single volume over the same nic?
>>
>>
>
>
> Also Mike, I'm seeing one or two of these every 30-40 minutes if I slam our EqualLogic with roughly 7-15k IOPS (reads and writes) non stop on 3 volumes. In this type of scenario, would you expect to see timeouts like this once in awhile? If so, do you think increasing my NOOP timeouts would assist so we don't get these? maybe set it to 15 seconds instead of 10?
>

It might be a bug.

What version of open-iscsi are you using? What kernel? Is it a distro or
kernel.org one? And are you using the open-iscsi kernel modules that
come with a open-iscsi.org tarball or the kernel modules that come with
your kernel?

Mike Christie

unread,
Nov 17, 2009, 7:27:29 PM11/17/09
to open-...@googlegroups.com
Pasi K�rkk�inen wrote:
> On Mon, Nov 16, 2009 at 09:39:00PM -0500, Hoot, Joseph wrote:
>> On Nov 16, 2009, at 8:19 PM, Hoot, Joseph wrote:
>>
>>> thanks. That helps. So I know that with the EqualLogic targets, there is a "Group IP" which, I believe, responds with an iscsi login_redirect.
>>>
>>> 1) Could the "Login authentication failed" message be the response because of a login redirect messages from the EQL redirect?
>>>
>>> and then my next question is more for curiosity sake:
>>>
>>> 2) Are there plans in the future to have more than one connection per session? and I guess in addition to that, would that mean multiple connections to a single volume over the same nic?
>>>
>>>
>>
>> Also Mike, I'm seeing one or two of these every 30-40 minutes if I slam our EqualLogic with roughly 7-15k IOPS (reads and writes) non stop on 3 volumes. In this type of scenario, would you expect to see timeouts like this once in awhile? If so, do you think increasing my NOOP timeouts would assist so we don't get these? maybe set it to 15 seconds instead of 10?
>>
>
> Equallogic does active loadbalancing (redirects) during operation..
> dunno about the errors though.
>

Oh yeah, forgot about that. Thanks Pasi!

Joseph, look in the EQL target logs for something about the EQL box
doing load balancing. I think normally we handle the load balancing more
gracefully, but we might be messing up. I think if EQL was load
balancing in the open-iscsi logs we would see something about getting a
async iscsi pdu from the target that asks us to logout. Then when we
relogin the target would redirect us to the optimal path.

Hoot, Joseph

unread,
Nov 17, 2009, 7:42:46 PM11/17/09
to open-...@googlegroups.com, Donald_Williams@Dell.com> <Donald_Williams@Dell.com, Wallace, William
more INLINE below...

On Nov 17, 2009, at 7:27 PM, Mike Christie wrote:
There are two things that the EQL does, I believe-- one thing is async logout, the other is login_redirect. Unfortunately, from the EQL syslog side we don't see any errors related to this. It's my understanding, however, that when a login is initially attempted to the EQL, it hits the "group ip" or an alias'd IP sitting on a real nic. The group IP looks at all the interfaces on the EQL and decides, based on some algorithm, which EQL nic the session should connect to. It then sends the initiator that made the request a login_redirect, which I thought is basically a "logout and reconnect" pdu. It would say, for example, "you're can't log into the group IP, however, you can log into this IP (a real nic) that it would prefer you be logged into."

I'm thinking that the "failed login" is actually the result of that attempt to log into the group IP and it sending a login redirect pdu back to it.

Don, does this seem like normal EQL traffic to an OiS initiator?

Mike Christie

unread,
Nov 17, 2009, 9:51:32 PM11/17/09
to open-...@googlegroups.com, Donald_Williams@Dell.com> <Donald_Williams@Dell.com, Wallace, William
Hoot, Joseph wrote:
> more INLINE below...
>
> On Nov 17, 2009, at 7:27 PM, Mike Christie wrote:
>
>> Pasi K�rkk�inen wrote:
>>> On Mon, Nov 16, 2009 at 09:39:00PM -0500, Hoot, Joseph wrote:
>>>> On Nov 16, 2009, at 8:19 PM, Hoot, Joseph wrote:
>>>>
>>>>> thanks. That helps. So I know that with the EqualLogic targets, there is a "Group IP" which, I believe, responds with an iscsi login_redirect.
>>>>>
>>>>> 1) Could the "Login authentication failed" message be the response because of a login redirect messages from the EQL redirect?
>>>>>
>>>>> and then my next question is more for curiosity sake:
>>>>>
>>>>> 2) Are there plans in the future to have more than one connection per session? and I guess in addition to that, would that mean multiple connections to a single volume over the same nic?
>>>>>
>>>>>
>>>> Also Mike, I'm seeing one or two of these every 30-40 minutes if I slam our EqualLogic with roughly 7-15k IOPS (reads and writes) non stop on 3 volumes. In this type of scenario, would you expect to see timeouts like this once in awhile? If so, do you think increasing my NOOP timeouts would assist so we don't get these? maybe set it to 15 seconds instead of 10?
>>>>
>>> Equallogic does active loadbalancing (redirects) during operation..
>>> dunno about the errors though.
>>>
>> Oh yeah, forgot about that. Thanks Pasi!
>>
>> Joseph, look in the EQL target logs for something about the EQL box
>> doing load balancing. I think normally we handle the load balancing more
>> gracefully, but we might be messing up. I think if EQL was load
>> balancing in the open-iscsi logs we would see something about getting a
>> async iscsi pdu from the target that asks us to logout. Then when we
>> relogin the target would redirect us to the optimal path.
>
>
> There are two things that the EQL does, I believe-- one thing is async logout, the other is login_redirect. Unfortunately, from the EQL syslog side we don't see any errors related to this. It's my understanding, however, that when a login is initially attempted to the EQL, it hits the "group ip" or an alias'd IP sitting on a real nic. The group IP looks at all the interfaces on the EQL and decides, based on some algorithm, which EQL nic the session should connect to. It then sends the initiator that made the request a login_redirect, which I thought is basically a "logout and reconnect" pdu. It would say, for example, "you're can't log into the group IP, however, you can log into this IP (a real nic) that it would prefer you be logged into."
>
> I'm thinking that the "failed login" is actually the result of that attempt to log into the group IP and it sending a login redirect pdu back to it.
>

If the target was load balancing us it would:

- Send a async logout pdu.
- We then send a logout pdu.
- When we get the logout response pdu we kill the tcp ip connection
- We then create a new tcp connection
- We then log in to the portal that was passed into iscsiadm/iscsid (the
one in the DB that you see when you run iscsiadm -m node, which is
probably what you call the group IP). For this process we send a login
pdu. It then sends a login response pdu with the login redirect
response. In this response we also get the new IP to log into.
- We see that response and kill the tcp connection, and create a new tcp
connection to the portal we are being redirected to.
- We then log into the portal we were redirected to. We again do this by
sending a login pdu. This time the login response pdu should be ok and
we are done.


We do not know which login pdu failed, right now. You would need a
wireshark trace or iscsid debugging. iscsid could hit the "Login
authentication failed" path for either of the login pdus sent.

You should not see that message normally even when we are being
redirected. We do the login redirect login when we initially log into
the target (like when you do iscsiadm -m node -l or service iscsi
start), and if you look in your logs you should not see a login failed
message for that. If you do it might be a clue.

Mike Christie

unread,
Nov 17, 2009, 10:06:10 PM11/17/09
to open-...@googlegroups.com, Donald_Williams@Dell.com> <Donald_Williams@Dell.com, Wallace, William
Oh yeah, I meant to also say that this is pretty much the same process
that happens we do the first login, and if we have to relogin because of
a connection problem like the nop/ping timeout. The only difference in
those cases is that we do not get the async logout and we do not do a
logout by sending a logout pdu. We start at the killing tcp ip
connection step.

So even if we are not getting load balanced we would be in the same
place in the open-iscsi code when we are getting the login failed errors.


To get back on track solving why we get the nop timeouts then if we are
not seeing load balancing messages or async logout messages, it could
be the open-iscsi bug I mentioned in the other mail. If you can send the
open-iscsi and kernel info I asked for in the other mail, we can start
down that path.

Hoot, Joseph

unread,
Nov 18, 2009, 7:15:57 AM11/18/09
to open-...@googlegroups.com
Sure. Its the OVM 2.2 environment that I talked with you about a couple of days ago. Here is the info:

[root@oim6102506 ~]# uname -r
2.6.18-128.2.1.4.9.el5xen
[root@oim6102506 ~]# rpm -qa | grep iscsi
iscsi-initiator-utils-6.2.0.871-0.7.el5

I have a tcpdump that I had sent to Don WIlliams. I'll pull that up shortly.

Mike Christie

unread,
Nov 18, 2009, 7:45:07 PM11/18/09
to Guru Anbalagane, open-iscsi
ccing Guru Anbalagane.

Hoot, Joseph wrote:
> Sure. Its the OVM 2.2 environment that I talked with you about a couple of days ago. Here is the info:
>

Doh! I forgot.

Guru, I am trying to make a patch for your Oracle VM kernel. I want to
port the patch I told you about
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=4c48a82935f833d94fcf44c2b0c5d2922acfc77a;hp=d1acfae514425d680912907c6554852f1e258551
and this one:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=d1acfae514425d680912907c6554852f1e258551
if you do not have it (could not remember if you did or not).

I found the src rpm:
http://edelivery.oracle.com/EPD/Download/get_form?egroup_aru_number=11874896

The problem is that I keep getting RPM build errors when I try to build
the project:

rpm -ivh kernel-2.6.18-128.2.1.4.9.el5.src.rpm
rpmbuild -bp --target=noarch /usr/src/redhat/SPECS/kernel-2.6.spec

... a bunch of stuff then

make[1]: *** [nonint_oldconfig] Error 15
make: *** [nonint_oldconfig] Error 2
error: Bad exit status from /var/tmp/rpm-tmp.94270 (%prep)


RPM build errors:
Bad exit status from /var/tmp/rpm-tmp.94270 (%prep)
(have also tried without the --target arg and with different archs and
always fails)

This is how I setup the RHEL kernel source rpm. Are the commands for the
Oracle VM kernel different?

Mike Christie

unread,
Nov 18, 2009, 8:15:41 PM11/18/09
to Guru Anbalagane, open-iscsi
Guru Anbalagane wrote:
> Hi Mike,
>
> Thanks.
> Can you please try --target=i686.
>

Thanks Guru.

> On the patches, yes, I will include it in our next VM kernel.


Joseph, attached is a patch that has a couple fixes from upstream in
that code path. If it does not fix your issue then I will have to send a
patch with some more debugging in it (the net layer is not waking us up
when send space is opening or it really is taking a long time to send
the data).

Patch is only compile tested.


> regards
> Guru
oracle-fix-ping.patch

Guru Anbalagane

unread,
Nov 18, 2009, 7:50:03 PM11/18/09
to Mike Christie, open-iscsi
Hi Mike,

Thanks.
Can you please try --target=i686.

On the patches, yes, I will include it in our next VM kernel.
regards
Guru
Mike Christie wrote:
Reply all
Reply to author
Forward
0 new messages