Is there anything more to the log? Is there anything from iscsid?
Something about not being able to connect/reconnect to the target?
If you just see that, then it means there was some connection problem.
We do not know exactly what it was, but we disconnected the connection,
then tried to reconnect. We tried to reconnect for 400 seconds but could
not, so at that point we mark the session as bad and start to fail IO
until we can log back in.
It is normally due to a problem in the network if the target is ok.
On 07/13/2010 09:33 PM, Sean S wrote:
> Nothing else in the log from iscsid. No mention of a failed reconnect,
> although the only log I'm really able to access post failure is dmesg.
> Since I'm running a root iscsi, I couldn't get to /var/log/messages
> which maybe was a little more verbose? What sort of network problems
Yeah, by default the iscsid messages go there. iscsid should be spitting
out a cannot connect $some_error_value_or_string that would help tell us
why we cannot reach the target anymore.
> might cause this? The "network" in this situation is a simple gigE
> switch with about 3 or 4 systems on it. The target and initiator are
> on the same subnet, nothing fancy. Is there some additional debug
> you'd recommend turning on? Any tips or tricks when running with a
> root iscsi drive?
Not that I can think of at the iscsi layer.
>
> Curiously, if I physically disconnect the ethernet from the initiator
> while running, all I/O access is correctly paused without returning I/
> O errors. If I then reconnect before the 400s is up things go back to
> normal. I don't however see the "detected conn error (1011)" message
> in this situation however. Not sure if that really means anything.
You should see the conn error 1011 message if
1. you have nops on and they timeout and that causes us to log that error.
2. the network layer figures out there is a problem and notifies us. It
is possible that you pull a cable and plug it back in before the network
throws an error.
3. iscsi driver or protocol error. In this case we should relogin quickly.
Hi!
I cannot answer your question, but that brings up something I wanted to talk about. Please apologize if something already exists, but I don't know:
In HP-UX 11.31 you can print "scan times" per device (i.e. LUN). Here's an example for a true FC-SAN:
Class I H/W Path ms_scan_time
===============================
lunpath 3 0/3/1/0.0x50001fe1500c1f28.0x0 0 min 0 sec 13 ms
lunpath 24 0/3/1/0.0x50001fe1500c1f28.0x4001000000000000 0 min 0 sec 88 ms
lunpath 73 0/3/1/0.0x50001fe1500c1f28.0x4002000000000000 0 min 0 sec 88 ms
lunpath 25 0/3/1/0.0x50001fe1500c1f28.0x4003000000000000 0 min 0 sec 88 ms
lunpath 74 0/3/1/0.0x50001fe1500c1f28.0x4009000000000000 0 min 0 sec 88 ms
lunpath 26 0/3/1/0.0x50001fe1500c1f28.0x4033000000000000 0 min 0 sec 88 ms
lunpath 88 0/3/1/0.0x50001fe1500c1f28.0x4037000000000000 0 min 0 sec 88 ms
lunpath 79 0/3/1/0.0x50001fe1500c1f28.0x403d000000000000 0 min 0 sec 91 ms
lunpath 27 0/3/1/0.0x50001fe1500c1f28.0x4047000000000000 0 min 0 sec 91 ms
[...]
lunpath 63 0/7/1/0.0x500308c001d83803.0x4001000000000000 0 min 0 sec 11 ms
lunpath 64 0/7/1/0.0x500308c001d83803.0x4002000000000000 0 min 0 sec 11 ms
lunpath 65 0/7/1/0.0x500308c001d83803.0x4003000000000000 0 min 0 sec 11 ms
lunpath 66 0/7/1/0.0x500308c001d83803.0x4004000000000000 0 min 0 sec 536 ms
If Linux/open-iscsi had something similar, one could periodically watch the times to find bottlenecks. AFAIK, the "scan time" in HP-UX is the round-trip delay for querying a LUN or a controller (a target?).
Ulrich
Remember that syslogd can also write the log to a terminal or serial line. For SUSE Linux it's on tty10 (Ctrl+Alt+F10), but not very verbose. You could try to set it up similar with more verbosity.
[...]
Ulrich
Did you want to find bottlenecks in the network or between the initiator
and actual device or initiator and target?
Erez, was adding some code where it exports the iscsi nop/ping times.
The nop/ping we send has a header of 48 bytes and no data payload. It
does not have do any disk/device IO. So this is nice for testing the
network.
Each scsi command has a timeout (see /sys/block/sdX/device/timeout). The
above dump shows that a scsi command is timing out. This causes the scsi
layer to have the driver, iscsi_tcp in this case, to try and abort the
command. It looks like the abort timed out too, and so the iscsi layer
decided to escalate the eh and failed the iscsi session/connection.
>
> session1: session recovery timed out after 400 secs
The iscsi layer tried to log back in for recovery/replacement timeout
seconds, but could not.
Did you see anything from iscsid about why it could not log in? iscsid
writes to /var/log/messages by default.
>
> sd 0:0:0:0: scsi: Device offlined - no ready after error recovery
>
Because the replacement/recovery timeout fired, the iscsi layer decided
it was time to give up and tells the scsi layer the disks are not
recoverable, and so we these messages:
> sd 0:0:0:0: scsi: Device offlined - no ready after error recovery
>
Does the session/connection ever re-login (you would see some message in
/var/log/messages about connection X:Y is operational after recovery (Z
attempts)?
On the target box check out /var/log/messages. Is the target even up
still? Did it segfault?
When using GRUB, use something like
[...]
#serial --unit=0 --speed=19200
terminal serial console
[...]
and add options "vga=normal console=tty0 console=ttyS0,19200" to the kernel command line, just like:
###Don't change this comment - YaST2 identifier: Original name: linux###
title SUSE Linux Enterprise Server 10 - 2.6.16.60-0.54.5 (smp)
root (hd0,0)
kernel /vmlinuz-2.6.16.60-0.54.5-smp root=/dev/system/root vga=normal console=tty0 console=ttyS0,19200 splash=silent showopts
initrd /initrd-2.6.16.60-0.54.5-smp
Ulrich
I forgot you are doing root on iscsi.
How did you get those other kernel messages? If you can just get the
iscsid log info that is sent after lines like this
[<f8bf0876>] iscsi_conn_failure+0x10/0x69 [libiscsi]
[<f9bf202d>] iscsi_eh_abort+0x2f1/0x406 [libiscsi]
but before this line
session1: session recovery timed out after 400 secs
it would be helpful. I am just looking for something about iscsid
segfaulting or hanging or not being able to connect.
What version of open-iscsi-871 are you using is it 871.1 or .2 .3?
>
>
>> On the target box check out /var/log/messages. Is the target even up
>> still? Did it segfault?
>
> I don't believe anything is going wrong on the target. The target
Do you see anything in /var/log/messages for the target though?
Something about aborts or lun resets completing?
dmesg just print the kernel message buffer (/proc/kmsg), while syslog can capture messages from applications as well.
I have a sample for a syslog-ng configuration file:
destination d_tty_root { usertty("root"); };
destination d_console { file("/dev/ttyS0"); };
destination d_messages { file("/var/log/messages"); };
filter f_error {
level(alert .. err) and not match('S15.modem: initchat failed.');
};
filter f_kernel { level(alert .. err); };
filter f_auth { facility(auth, authpriv) and level(alert .. info); };
filter f_debug { level(alert .. debug); };
# send criticals messages to logged root user and /var/log/messages
log {
source(s_intern);
source(s_dev_log);
source(s_kernel);
filter(f_error);
destination(d_tty_root);
destination(d_messages);
};
# save auth-related messages
log {
source(s_dev_log);
source(s_kernel);
filter(f_auth);
destination(d_messages);
};
### Just to get you started. The older syslog is less powerful, but easier to configure.
Maybe this is interesting for you:
# 6) To send message to remote syslogd server :
# destination d_udp { udp("<remote IP address>" port(514)); };
# Example to send syslogs to syslogd located at 10.0.0.1 :
# destination d_udp1 { udp("10.0.0.1" port(514)); };
Maybe this helps a bit.
Regards,
Ulrich
Samples for "sources" are missing, sorry:
source s_intern { internal(); };
source s_dev_log { unix-stream("/dev/log"); };
source s_kernel { file("/proc/kmsg"); };
Yeah, try:
http://kernel.org/pub/linux/kernel/people/mnc/open-iscsi/releases/open-iscsi-2.0-871.3.tar.gz
It has a fix for recovery. There was a problem where recovery hung for
several minutes.
If you are using the open-iscsi.org tarballs then it is a fun manual
process. You would basically have to read the Makefile and delete all
the programs it installed.
Do makefiles normally have a uinstall option? If so I can add one, or I
can add another script.
would usually be removing all binaries introduced in
make install
Regards,
- Kun
No: "make clean" would remove the compiled or otherwise generated objects to leave only valuable sources around. As "mak install" copies generated objects to elewhere usually, "make clean" won't undo that (in the general case).
Ulrich
How about "make uninstall" then?