NOP-in NOP-out issues

Albert Pauw

unread,

Dec 31, 2005, 3:28:52 AM12/31/05

to open-iscsi

Comparing open-iscsi with core-iscsi I noticed that open-iscsi does not
do a regular NOP-out.

I don´t know if this is implemented yet, but I tried changing a few
parameters only to discover that open-iscsi (iscsid) went into a CPU
spin dive.

Here are the parameters I changed (assuming that the values are in
seconds):

iscsiadm -m node -r a0a7c1 --op update -n
node.conn[0].timeo.noop_out_interval -v 5
iscsiadm -m node -r a0a7c1 --op update -n
node.conn[0].timeo.noop_out_timeout -v 15

What I think I was settings was the interval timing of the nop-out ping
as 5 seconds and a timeout of 15 seconds. But iscsid started to eat 96%
of CPU seconds and no ping occurs as there was no network activity
after the first login.

Albert

Wang Zhenyu

unread,

Jan 3, 2006, 11:15:21 PM1/3/06

to open-...@googlegroups.com

Albert, I did quick work for support nop-out before. I just did one type of
nop-out, which has valid itt and no data padding, which requires target's nop-
in response.
As I remember for now, noop_out_interval might be larger than noop_out_timeout,
so you may exchange the setting and retry. I tested it with iet, it sends nop-out
and iet gives nop-in back.

This's a patch for fix conflict when tcp_state_change detect conn error.

thanks for testing, as it seems no one else tried it. :)

-zhen

---
--- usr/initiator.c.orig 2006-01-04 12:02:18.000000000 +0800
+++ usr/initiator.c 2006-01-04 12:03:19.000000000 +0800
@@ -548,6 +548,13 @@ __session_conn_cleanup(iscsi_conn_t *con
{
iscsi_session_t *session = conn->session;

+ if (conn->noop_out_interval) {
+ actor_delete(&conn->noop_out_timer);
+ actor_delete(&conn->noop_out_timeout_timer);
+ log_debug(3, "conn noop out timer %p stopped\n",
+ &conn->noop_out_timer);
+ }
+
if (ipc->destroy_conn(session->transport_handle, conn->handle,
conn->id)) {
log_error("can not safely destroy connection %d", conn->id);
@@ -638,13 +645,6 @@ __session_free(iscsi_session_t *session)
iscsi_io_disconnect(conn);
__session_conn_queue_flush(conn);

- if (conn->noop_out_interval) {
- actor_delete(&conn->noop_out_timer);
- actor_delete(&conn->noop_out_timeout_timer);
- log_debug(3, "conn noop out timer %p stopped\n",
- &conn->noop_out_timer);
- }
-
return __session_conn_cleanup(conn);
}

@@ -729,7 +729,6 @@ __conn_noop_out_timeout(void *data)
iscsi_session_t *session = conn->session;

log_debug(3, "noop out rsp timeout, closing conn...\n");
- actor_delete(&conn->noop_out_timer);
/* XXX: error handle */
__conn_error_handle(session, conn);
}

nop_out_timer.patch

Albert Pauw

unread,

Jan 4, 2006, 1:08:35 AM1/4/06

to open-iscsi

> Albert, I did quick work for support nop-out before. I just did one type of
> nop-out, which has valid itt and no data padding, which requires target's nop-
> in response.
> As I remember for now, noop_out_interval might be larger than noop_out_timeout,
> so you may exchange the setting and retry. I tested it with iet, it sends nop-out
> and iet gives nop-in back.
>
> This's a patch for fix conflict when tcp_state_change detect conn error.
>
> thanks for testing, as it seems no one else tried it. :)
>

Thanks Zhen,

I will try to set noop_out_interval > noop_out_timeout and see what it
does.
If it won't work I will send you the ethereal trace of the core-iscsi
nop-out nop-in with the Wasabi target (as that works fine). But first I
try your suggestion.

Do I understand you correctly that this patch will fix the situation
when noop_out_interval < noop_out_timeout ?

Albert

P.S. And Happy NewYear to you and all the others at this list!

Wang Zhenyu

unread,

Jan 4, 2006, 3:49:18 AM1/4/06

to open-...@googlegroups.com

Here's update patch, pls try it and ignore the last one.

---
--- usr/initiator.c.orig 2006-01-04 12:02:18.000000000 +0800

+++ usr/initiator.c 2006-01-04 16:45:50.000000000 +0800

@@ -739,10 +738,13 @@ __conn_noop_out(void *data)
{
iscsi_conn_t *conn = (iscsi_conn_t*)data;
__send_nopout(conn);
- if (conn->noop_out_timeout_timer.state != ACTOR_WAITING)
+ if (conn->noop_out_timeout_timer.state == ACTOR_NOTSCHEDULED) {
actor_timer(&conn->noop_out_timeout_timer,
conn->noop_out_timeout*1000,
__conn_noop_out_timeout, conn);
+ log_debug(3, "noop out timeout timer %p start\n",
+ &conn->noop_out_timeout_timer);
+ }
actor_timer(&conn->noop_out_timer, conn->noop_out_interval*1000,
__conn_noop_out, data);
}
@@ -964,6 +966,8 @@ __session_conn_recv_pdu(queue_item_t *it
actor_timer(&conn->noop_out_timer,
conn->noop_out_interval*1000,
__conn_noop_out, conn);
+ actor_new(&conn->noop_out_timeout_timer,
+ __conn_noop_out_timeout, conn);
log_debug(3, "noop out timer %p start\n",
&conn->noop_out_timer);
}

nop-out.patch

Albert Pauw

unread,

Jan 4, 2006, 12:08:41 PM1/4/06

to open-iscsi

No good, I am afraid. I set timeo.noop_out_interval to 15 and
noop_out_timeout -v 5, but I got exactly one ping (Nop Out - Nop In).
After that iscsid started to burn CPU cycles.

Albert

Wang Zhenyu

unread,

Jan 4, 2006, 9:35:47 PM1/4/06

to open-...@googlegroups.com

strange. could you send out iscsid log? any message in dmesg?

Albert Pauw

unread,

Jan 5, 2006, 12:57:05 AM1/5/06

to open-iscsi

I can have a look, but since it is quite unuseable I can only watch and
jot the messages down on paper.
Btw the target is a Wasabi target, not iscsitarget. Once I set the
parameters back to 0 it is fine again (after a reboot).
Even loggin out the target doesn't work (or is so slow that it doesn't
work).

Albert

Albert Pauw

unread,

Jan 5, 2006, 1:38:32 PM1/5/06

to open-iscsi

I can't get much output from the debug mode of iscsid.

But here's what I found. Of the two instances of iscsid started, the
second one starts to burn cycles, but only after the interval parameter
of seconds, not sooner. The first instance runs fine. And only one
Nop-Out <-> Nop-In is generated, nothing more. I can do a logout before
the timeout interval expires, the scsi cache is flushed, no problem.
After the timeout paramter expires the logout command just hangs.

Hope this gives a bit of a clue?

Albert

Albert Pauw

unread,

Jan 5, 2006, 2:27:39 PM1/5/06

to open-iscsi

For what it's worth. I noticed the following (using ethereal) with
core-iscsi, which does the Nop-Out Nop-In correctly:

Both the initiator and the target are sending Nops, so I see the
following:

initiator: Nop Out, response target: Nop In

target: Nop In, response initiator: Nop Out

initiator: Nop Out, response target: Nop In

.... continue ....

(target is Wasabi target).

Albert

William Studenmund

unread,

Jan 5, 2006, 3:46:05 PM1/5/06

to open-...@googlegroups.com

On Jan 5, 2006, at 11:27 AM, Albert Pauw wrote:

> For what it's worth. I noticed the following (using ethereal) with
> core-iscsi, which does the Nop-Out Nop-In correctly:
>
> Both the initiator and the target are sending Nops, so I see the
> following:
>
> initiator: Nop Out, response target: Nop In
>
> target: Nop In, response initiator: Nop Out
>
> target: Nop In, response initiator: Nop Out

Something is wrong in the NOP-Out or NOP-In processing. This target
only sends periodic NOP-INs to solicit a NOP-Out when there are
unacknowledged status events on the connection; the last received
ExpStatSN is not what the target thinks it should be. The fact that you
see more than one indicates that the NOP-Outs from the initiator did
not move ExpStatSN as needed.

Double check that ExpStatSN gets incremented for solicited NOP-INs.
This is in the fine print in RFC 3720:

10.19.2. StatSN

The StatSN field will always contain the next StatSN. However, when
the Initiator Task Tag is set to 0xffffffff, StatSN for the
connection is not advanced after this PDU is sent.

Thus if ITT != 0xffffffff, ExpStatSN needs to be incremented.

Take care,

Bill

PGP.sig

Albert Pauw

unread,

Jan 5, 2006, 4:18:23 PM1/5/06

to open-iscsi

To clear things up, this means that something is not completely right
in the Nops processing of core-iscsi.

Bill, Thanks I will post your message also in the core-iscsi mailing
list.

Albert

Wang Zhenyu

unread,

Jan 6, 2006, 3:31:53 AM1/6/06

to open-...@googlegroups.com

maybe iscsid -f will give you more clue, actually I also hit this problem.
actor.c is corrupt in some way, and noop_out callback never got called.
I set noop interval to 20, and timeout is 10. Target is Adtron, iet works fine.

Here's some log message (I changed some words in log):
...
Jun 1 16:29:57 atca3 iscsid: thread b7f15530 wait some more, current 158 diff 8
Jun 1 16:29:58 atca3 iscsid: thread b7f15530 wait some more, current 161 diff 5
Jun 1 16:29:59 atca3 iscsid: thread b7f15530 wait some more, current 164 diff 2
Jun 1 16:30:00 atca3 iscsid: thread b7f15530 was scheduled at 46, tt sched 120, curtime 167 q_forw 0xb7f15530 &pend_list 0x8062e88
Jun 1 16:30:00 atca3 iscsid: thread b7f15530 now in actor_list
Jun 1 16:30:00 atca3 iscsid: thread b7f15530 was scheduled at 46, tt sched 120, curtime 167 q_forw 0xb7f15530 &pend_list 0x8062e88
Jun 1 16:30:00 atca3 iscsid: thread b7f15530 now in actor_list
Jun 1 16:30:00 atca3 iscsid: thread b7f15530 was scheduled at 46, tt sched 120, curtime 167 q_forw 0xb7f15530 &pend_list 0x8062e88
(loop forever in actor_check()'s pend_list...)
...

You can see that q_forw is corrupt, that is wrongly still point to thread after
a remque(). In iet case, it points to &pend_list. I don't know why, hope
others have some ideas. I also wrote some simple code to test insque/remque
without problem. What might cause this?

Albert Pauw

unread,

Jan 6, 2006, 1:23:33 PM1/6/06

to open-iscsi

I actually made a type in my previous test. Yes, this is exactly the
behaviour I have.

Albert

Wang Zhenyu

unread,

Jan 12, 2006, 11:20:55 PM1/12/06

to open-...@googlegroups.com

On 2006.01.06 10:23:33 +0000, Albert Pauw wrote:
>
> I actually made a type in my previous test. Yes, this is exactly the
> behaviour I have.
>

Now I know I made a silly fault in nop-out patch, in CHAP login case,
nop-out timer got added twice. Patch below fixes it. I tested it with Adtron
hw target, no problem.

Albert, pls test it. Thanks for finding this issue!

zhen

---
--- usr/initiator.c.orig 2006-01-12 23:27:40.000000000 +0800
+++ usr/initiator.c 2006-01-13 12:13:34.000000000 +0800
@@ -546,6 +546,13 @@ __session_conn_cleanup(iscsi_conn_t *con

{
iscsi_session_t *session = conn->session;

+ if (conn->noop_out_interval) {
+ actor_delete(&conn->noop_out_timer);
+ actor_delete(&conn->noop_out_timeout_timer);
+ log_debug(3, "conn noop out timer %p stopped\n",
+ &conn->noop_out_timer);
+ }
+
if (ipc->destroy_conn(session->transport_handle, conn->handle,
conn->id)) {
log_error("can not safely destroy connection %d", conn->id);

@@ -636,13 +643,6 @@ __session_free(iscsi_session_t *session)

iscsi_io_disconnect(conn);
__session_conn_queue_flush(conn);

- if (conn->noop_out_interval) {
- actor_delete(&conn->noop_out_timer);
- actor_delete(&conn->noop_out_timeout_timer);
- log_debug(3, "conn noop out timer %p stopped\n",
- &conn->noop_out_timer);
- }
-
return __session_conn_cleanup(conn);
}

@@ -727,7 +727,6 @@ __conn_noop_out_timeout(void *data)

iscsi_session_t *session = conn->session;

log_debug(3, "noop out rsp timeout, closing conn...\n");
- actor_delete(&conn->noop_out_timer);
/* XXX: error handle */
__conn_error_handle(session, conn);
}

@@ -737,10 +736,13 @@ __conn_noop_out(void *data)

{
iscsi_conn_t *conn = (iscsi_conn_t*)data;
__send_nopout(conn);
- if (conn->noop_out_timeout_timer.state != ACTOR_WAITING)
+ if (conn->noop_out_timeout_timer.state == ACTOR_NOTSCHEDULED) {
actor_timer(&conn->noop_out_timeout_timer,
conn->noop_out_timeout*1000,
__conn_noop_out_timeout, conn);
+ log_debug(3, "noop out timeout timer %p start\n",
+ &conn->noop_out_timeout_timer);
+ }
actor_timer(&conn->noop_out_timer, conn->noop_out_interval*1000,
__conn_noop_out, data);
}

@@ -952,14 +954,16 @@ __session_conn_recv_pdu(queue_item_t *it

session->r_stage = R_STAGE_NO_CHANGE;
}
- }
- /* noop_out */
- if (conn->noop_out_interval) {
- actor_timer(&conn->noop_out_timer,
- conn->noop_out_interval*1000,
- __conn_noop_out, conn);
- log_debug(3, "noop out timer %p start\n",
- &conn->noop_out_timer);
+ /* noop_out */
+ if (conn->noop_out_interval) {
+ actor_timer(&conn->noop_out_timer,
+ conn->noop_out_interval*1000,
+ __conn_noop_out, conn);

+ actor_new(&conn->noop_out_timeout_timer,
+ __conn_noop_out_timeout, conn);

+ log_debug(3, "noop out timer %p start\n",
+ &conn->noop_out_timer);
+ }
}
} else if (conn->state == STATE_LOGGED_IN) {
struct iscsi_hdr hdr;

noop_out_fix.patch

Albert Pauw

unread,

Jan 12, 2006, 11:32:22 PM1/12/06

to open-iscsi

Just did a quick check on r450, the last one to work with, and this
works fine now!

Thanks,

Albert

Mike Christie

unread,

Jan 14, 2006, 12:59:55 AM1/14/06

to open-...@googlegroups.com

Wang Zhenyu wrote:
> On 2006.01.06 10:23:33 +0000, Albert Pauw wrote:
>
>>I actually made a type in my previous test. Yes, this is exactly the
>>behaviour I have.
>>
>
>
> Now I know I made a silly fault in nop-out patch, in CHAP login case,
> nop-out timer got added twice. Patch below fixes it. I tested it with Adtron
> hw target, no problem.
>
> Albert, pls test it. Thanks for finding this issue!
>

The patch seems to send me into recovery, and sometimes I will get stuck
in some sort of loop and I have to kill iscsid becuase iscsid takes
over. I will try to debug and see what it is. This is with a Netapp target.

Wang Zhenyu

unread,

Jan 15, 2006, 10:58:06 PM1/15/06

to open-...@googlegroups.com

Are you sure your iscsid run with that patch? Could you send iscsid's log so
I can also look into it.

Reply all

Reply to author

Forward