I don´t know if this is implemented yet, but I tried changing a few
parameters only to discover that open-iscsi (iscsid) went into a CPU
spin dive.
Here are the parameters I changed (assuming that the values are in
seconds):
iscsiadm -m node -r a0a7c1 --op update -n
node.conn[0].timeo.noop_out_interval -v 5
iscsiadm -m node -r a0a7c1 --op update -n
node.conn[0].timeo.noop_out_timeout -v 15
What I think I was settings was the interval timing of the nop-out ping
as 5 seconds and a timeout of 15 seconds. But iscsid started to eat 96%
of CPU seconds and no ping occurs as there was no network activity
after the first login.
Albert
Albert, I did quick work for support nop-out before. I just did one type of
nop-out, which has valid itt and no data padding, which requires target's nop-
in response.
As I remember for now, noop_out_interval might be larger than noop_out_timeout,
so you may exchange the setting and retry. I tested it with iet, it sends nop-out
and iet gives nop-in back.
This's a patch for fix conflict when tcp_state_change detect conn error.
thanks for testing, as it seems no one else tried it. :)
-zhen
---
--- usr/initiator.c.orig 2006-01-04 12:02:18.000000000 +0800
+++ usr/initiator.c 2006-01-04 12:03:19.000000000 +0800
@@ -548,6 +548,13 @@ __session_conn_cleanup(iscsi_conn_t *con
{
iscsi_session_t *session = conn->session;
+ if (conn->noop_out_interval) {
+ actor_delete(&conn->noop_out_timer);
+ actor_delete(&conn->noop_out_timeout_timer);
+ log_debug(3, "conn noop out timer %p stopped\n",
+ &conn->noop_out_timer);
+ }
+
if (ipc->destroy_conn(session->transport_handle, conn->handle,
conn->id)) {
log_error("can not safely destroy connection %d", conn->id);
@@ -638,13 +645,6 @@ __session_free(iscsi_session_t *session)
iscsi_io_disconnect(conn);
__session_conn_queue_flush(conn);
- if (conn->noop_out_interval) {
- actor_delete(&conn->noop_out_timer);
- actor_delete(&conn->noop_out_timeout_timer);
- log_debug(3, "conn noop out timer %p stopped\n",
- &conn->noop_out_timer);
- }
-
return __session_conn_cleanup(conn);
}
@@ -729,7 +729,6 @@ __conn_noop_out_timeout(void *data)
iscsi_session_t *session = conn->session;
log_debug(3, "noop out rsp timeout, closing conn...\n");
- actor_delete(&conn->noop_out_timer);
/* XXX: error handle */
__conn_error_handle(session, conn);
}
Thanks Zhen,
I will try to set noop_out_interval > noop_out_timeout and see what it
does.
If it won't work I will send you the ethereal trace of the core-iscsi
nop-out nop-in with the Wasabi target (as that works fine). But first I
try your suggestion.
Do I understand you correctly that this patch will fix the situation
when noop_out_interval < noop_out_timeout ?
Albert
P.S. And Happy NewYear to you and all the others at this list!
Here's update patch, pls try it and ignore the last one.
---
--- usr/initiator.c.orig 2006-01-04 12:02:18.000000000 +0800
+++ usr/initiator.c 2006-01-04 16:45:50.000000000 +0800
@@ -739,10 +738,13 @@ __conn_noop_out(void *data)
{
iscsi_conn_t *conn = (iscsi_conn_t*)data;
__send_nopout(conn);
- if (conn->noop_out_timeout_timer.state != ACTOR_WAITING)
+ if (conn->noop_out_timeout_timer.state == ACTOR_NOTSCHEDULED) {
actor_timer(&conn->noop_out_timeout_timer,
conn->noop_out_timeout*1000,
__conn_noop_out_timeout, conn);
+ log_debug(3, "noop out timeout timer %p start\n",
+ &conn->noop_out_timeout_timer);
+ }
actor_timer(&conn->noop_out_timer, conn->noop_out_interval*1000,
__conn_noop_out, data);
}
@@ -964,6 +966,8 @@ __session_conn_recv_pdu(queue_item_t *it
actor_timer(&conn->noop_out_timer,
conn->noop_out_interval*1000,
__conn_noop_out, conn);
+ actor_new(&conn->noop_out_timeout_timer,
+ __conn_noop_out_timeout, conn);
log_debug(3, "noop out timer %p start\n",
&conn->noop_out_timer);
}
Albert
strange. could you send out iscsid log? any message in dmesg?
Albert
But here's what I found. Of the two instances of iscsid started, the
second one starts to burn cycles, but only after the interval parameter
of seconds, not sooner. The first instance runs fine. And only one
Nop-Out <-> Nop-In is generated, nothing more. I can do a logout before
the timeout interval expires, the scsi cache is flushed, no problem.
After the timeout paramter expires the logout command just hangs.
Hope this gives a bit of a clue?
Albert
Both the initiator and the target are sending Nops, so I see the
following:
initiator: Nop Out, response target: Nop In
target: Nop In, response initiator: Nop Out
target: Nop In, response initiator: Nop Out
initiator: Nop Out, response target: Nop In
.... continue ....
(target is Wasabi target).
Albert
> For what it's worth. I noticed the following (using ethereal) with
> core-iscsi, which does the Nop-Out Nop-In correctly:
>
> Both the initiator and the target are sending Nops, so I see the
> following:
>
> initiator: Nop Out, response target: Nop In
>
> target: Nop In, response initiator: Nop Out
>
> target: Nop In, response initiator: Nop Out
Something is wrong in the NOP-Out or NOP-In processing. This target
only sends periodic NOP-INs to solicit a NOP-Out when there are
unacknowledged status events on the connection; the last received
ExpStatSN is not what the target thinks it should be. The fact that you
see more than one indicates that the NOP-Outs from the initiator did
not move ExpStatSN as needed.
Double check that ExpStatSN gets incremented for solicited NOP-INs.
This is in the fine print in RFC 3720:
10.19.2. StatSN
The StatSN field will always contain the next StatSN. However, when
the Initiator Task Tag is set to 0xffffffff, StatSN for the
connection is not advanced after this PDU is sent.
Thus if ITT != 0xffffffff, ExpStatSN needs to be incremented.
Take care,
Bill
Bill, Thanks I will post your message also in the core-iscsi mailing
list.
Albert
maybe iscsid -f will give you more clue, actually I also hit this problem.
actor.c is corrupt in some way, and noop_out callback never got called.
I set noop interval to 20, and timeout is 10. Target is Adtron, iet works fine.
Here's some log message (I changed some words in log):
...
Jun 1 16:29:57 atca3 iscsid: thread b7f15530 wait some more, current 158 diff 8
Jun 1 16:29:58 atca3 iscsid: thread b7f15530 wait some more, current 161 diff 5
Jun 1 16:29:59 atca3 iscsid: thread b7f15530 wait some more, current 164 diff 2
Jun 1 16:30:00 atca3 iscsid: thread b7f15530 was scheduled at 46, tt sched 120, curtime 167 q_forw 0xb7f15530 &pend_list 0x8062e88
Jun 1 16:30:00 atca3 iscsid: thread b7f15530 now in actor_list
Jun 1 16:30:00 atca3 iscsid: thread b7f15530 was scheduled at 46, tt sched 120, curtime 167 q_forw 0xb7f15530 &pend_list 0x8062e88
Jun 1 16:30:00 atca3 iscsid: thread b7f15530 now in actor_list
Jun 1 16:30:00 atca3 iscsid: thread b7f15530 was scheduled at 46, tt sched 120, curtime 167 q_forw 0xb7f15530 &pend_list 0x8062e88
(loop forever in actor_check()'s pend_list...)
...
You can see that q_forw is corrupt, that is wrongly still point to thread after
a remque(). In iet case, it points to &pend_list. I don't know why, hope
others have some ideas. I also wrote some simple code to test insque/remque
without problem. What might cause this?
Albert
Now I know I made a silly fault in nop-out patch, in CHAP login case,
nop-out timer got added twice. Patch below fixes it. I tested it with Adtron
hw target, no problem.
Albert, pls test it. Thanks for finding this issue!
zhen
---
--- usr/initiator.c.orig 2006-01-12 23:27:40.000000000 +0800
+++ usr/initiator.c 2006-01-13 12:13:34.000000000 +0800
@@ -546,6 +546,13 @@ __session_conn_cleanup(iscsi_conn_t *con
{
iscsi_session_t *session = conn->session;
+ if (conn->noop_out_interval) {
+ actor_delete(&conn->noop_out_timer);
+ actor_delete(&conn->noop_out_timeout_timer);
+ log_debug(3, "conn noop out timer %p stopped\n",
+ &conn->noop_out_timer);
+ }
+
if (ipc->destroy_conn(session->transport_handle, conn->handle,
conn->id)) {
log_error("can not safely destroy connection %d", conn->id);
@@ -636,13 +643,6 @@ __session_free(iscsi_session_t *session)
iscsi_io_disconnect(conn);
__session_conn_queue_flush(conn);
- if (conn->noop_out_interval) {
- actor_delete(&conn->noop_out_timer);
- actor_delete(&conn->noop_out_timeout_timer);
- log_debug(3, "conn noop out timer %p stopped\n",
- &conn->noop_out_timer);
- }
-
return __session_conn_cleanup(conn);
}
@@ -727,7 +727,6 @@ __conn_noop_out_timeout(void *data)
iscsi_session_t *session = conn->session;
log_debug(3, "noop out rsp timeout, closing conn...\n");
- actor_delete(&conn->noop_out_timer);
/* XXX: error handle */
__conn_error_handle(session, conn);
}
@@ -737,10 +736,13 @@ __conn_noop_out(void *data)
{
iscsi_conn_t *conn = (iscsi_conn_t*)data;
__send_nopout(conn);
- if (conn->noop_out_timeout_timer.state != ACTOR_WAITING)
+ if (conn->noop_out_timeout_timer.state == ACTOR_NOTSCHEDULED) {
actor_timer(&conn->noop_out_timeout_timer,
conn->noop_out_timeout*1000,
__conn_noop_out_timeout, conn);
+ log_debug(3, "noop out timeout timer %p start\n",
+ &conn->noop_out_timeout_timer);
+ }
actor_timer(&conn->noop_out_timer, conn->noop_out_interval*1000,
__conn_noop_out, data);
}
@@ -952,14 +954,16 @@ __session_conn_recv_pdu(queue_item_t *it
session->r_stage = R_STAGE_NO_CHANGE;
}
- }
- /* noop_out */
- if (conn->noop_out_interval) {
- actor_timer(&conn->noop_out_timer,
- conn->noop_out_interval*1000,
- __conn_noop_out, conn);
- log_debug(3, "noop out timer %p start\n",
- &conn->noop_out_timer);
+ /* noop_out */
+ if (conn->noop_out_interval) {
+ actor_timer(&conn->noop_out_timer,
+ conn->noop_out_interval*1000,
+ __conn_noop_out, conn);
+ actor_new(&conn->noop_out_timeout_timer,
+ __conn_noop_out_timeout, conn);
+ log_debug(3, "noop out timer %p start\n",
+ &conn->noop_out_timer);
+ }
}
} else if (conn->state == STATE_LOGGED_IN) {
struct iscsi_hdr hdr;
Thanks,
Albert
The patch seems to send me into recovery, and sometimes I will get stuck
in some sort of loop and I have to kill iscsid becuase iscsid takes
over. I will try to debug and see what it is. This is with a Netapp target.
Are you sure your iscsid run with that patch? Could you send iscsid's log so
I can also look into it.