RHEL 5.2 and 5.3 - ISCSI Errors impacting database performance?

190 views
Skip to first unread message

bigcatxjs

unread,
Mar 12, 2009, 7:42:30 AM3/12/09
to open-iscsi, richard_...@hotmail.com
Hi,
This is my first post on this Forum, so apologies in advance if I have
missed something or not found an existing post that covers this topic.

Situation:
We have a number of hosts running RHEL 5.2 (x86_64) for our Oracle
database estate. A typical deployment could comprise a DELL 1955
Blade with RAID 1 local disks for O/S, swap and binaries, and iscsi
attached SAN volumes for Oracle database files and disk backups. The
Blade has four NIC's; two set-up for "Public" traffic (192.168.**.**)
from our domain and the other two set-up on our SAN network
(172.16.***.**). The four NICs are mated to the Blade Chassis that
has teamed NIC's to our Domain switches and SAN switches. Our SAN is
DataCore San Melody using two SM server nodes that manage two DELL MD
1000 arrays (deployed as a mirror). Each MD 1000 contains 3 groups of
(5 disks configured as RAID 5). These are presented as three separate
storage groups.

Issue:
Since deploying this set-up last year, we repeately get errors within
the host logs and the SM server nodes (see "Logs" later). I was
hoping that the latest RHEL 5.3 Kernel improvements would address most
of these errors. I have deployed RHEL 5.3 (x86) onto one of our TEST
boxes, but continue to see errors.

Impact:
I suspect that the connection hang-ups/disk I/O re-trys are causing
cumulative database waits on some of our busier databases resulting in
degraded performance. I am concerned that this current situation will
cause us further issues when we build our planned Oracle 11G RAC (5-
node) system. Oracle RAC relies heavily on multi-plexed voting and
registry disks (shared volumes) to maintain cohesion within the RAC
cluster. Slow disk I/O / time-outs can cause one or more database
nodes to go off-line (and thus force an auto-restart of the impacted
host's Oracle services).

LOGS:


From RHEL 5.2 x86_64 Host;

Kernel:
Linux MYHOST52.MYDOMAIN.com 2.6.18-92.el5 #1 SMP Fri May 23 23:40:43
EDT 2008 x86_64 x86_64 x86_64 GNU/Linux

fstab:
/dev/VolGroup00/LogVol00 / ext3
defaults 1 1
LABEL=/boot /boot ext3
defaults 1 2
tmpfs /dev/shm tmpfs
defaults 0 0
devpts /dev/pts devpts
gid=5,mode=620 0 0
sysfs /sys sysfs
defaults 0 0
proc /proc proc
defaults 0 0
/dev/VolGroup00/LogVol01 swap swap
defaults 0 0
LABEL=data1 /U02 ext3 _netdev 0 0
LABEL=data2 /U03 ext3 _netdev 0 0
LABEL=data3 /U04 ext3 _netdev 0 0
LABEL=data4 /U05 ext3 _netdev 0 0
LABEL=data5 /U06 ext3 _netdev 0 0

iscsiadm:
iSCSI Transport Class version 2.0-724
iscsiadm version 2.0-868
Target: iqn.2000-08.com.datacore:sm2-3
Current Portal: 172.16.200.9:3260,1
Persistent Portal: 172.16.200.9:3260,1
**********
Interface:
**********
Iface Name: iface0
Iface Transport: tcp
Iface Initiatorname: iqn.1994-05.com.redhat:7fe2f44ea9de
Iface IPaddress: 172.16.200.39
Iface HWaddress: 00:14:22:0d:0a:fa
Iface Netdev: default
SID: 1
iSCSI Connection State: LOGGED IN
iSCSI Session State: Unknown
Internal iscsid Session State: NO CHANGE
************************
Negotiated iSCSI params:
************************
HeaderDigest: None
DataDigest: None
MaxRecvDataSegmentLength: 131072
MaxXmitDataSegmentLength: 262144
FirstBurstLength: 0
MaxBurstLength: 1048576
ImmediateData: No
InitialR2T: Yes
MaxOutstandingR2T: 1
************************
Attached SCSI devices:
************************
Host Number: 1 State: running
scsi1 Channel 00 Id 0 Lun: 0
Attached scsi disk sdb State: running
scsi1 Channel 00 Id 0 Lun: 1
Attached scsi disk sde State: running
scsi1 Channel 00 Id 0 Lun: 2
Attached scsi disk sdf State: running
Target: iqn.2000-08.com.datacore:sm2-4
Current Portal: 172.16.200.10:3260,1
Persistent Portal: 172.16.200.10:3260,1
**********
Interface:
**********
Iface Name: iface2
Iface Transport: tcp
Iface Initiatorname: iqn.1994-05.com.redhat:7fe2f44ea9de
Iface IPaddress: 172.16.200.56
Iface HWaddress: 00:14:22:b1:d6:a6
Iface Netdev: default
SID: 2
iSCSI Connection State: LOGGED IN
iSCSI Session State: Unknown
Internal iscsid Session State: NO CHANGE
************************
Negotiated iSCSI params:
************************
HeaderDigest: None
DataDigest: None
MaxRecvDataSegmentLength: 131072
MaxXmitDataSegmentLength: 262144
FirstBurstLength: 0
MaxBurstLength: 1048576
ImmediateData: No
InitialR2T: Yes
MaxOutstandingR2T: 1
************************
Attached SCSI devices:
************************
Host Number: 2 State: running
scsi2 Channel 00 Id 0 Lun: 0
Attached scsi disk sdc State: running
scsi2 Channel 00 Id 0 Lun: 1
Attached scsi disk sdd State: running

Log Errors;
Mar 12 09:30:48 MYHOST52 last message repeated 2 times
Mar 12 09:30:48 MYHOST52 iscsid: connection2:0 is operational after
recovery (1 attempts)
Mar 12 09:32:52 MYHOST52 kernel: ping timeout of 5 secs expired, last
rx 19592296349, last ping 19592301349, now 19592306349
Mar 12 09:32:52 MYHOST52 kernel: connection1:0: iscsi: detected conn
error (1011)
Mar 12 09:32:53 MYHOST52 iscsid: Kernel reported iSCSI connection 1:0
error (1011) state (3)
Mar 12 09:33:19 MYHOST52 iscsid: received iferror -38
Mar 12 09:33:19 MYHOST52 last message repeated 2 times
Mar 12 09:33:19 MYHOST52 iscsid: connection1:0 is operational after
recovery (2 attempts)
Mar 12 09:43:25 MYHOST52 kernel: ping timeout of 5 secs expired, last
rx 19592929091, last ping 19592934091, now 19592939091
Mar 12 09:43:25 MYHOST52 kernel: connection1:0: iscsi: detected conn
error (1011)
Mar 12 09:43:26 MYHOST52 iscsid: Kernel reported iSCSI connection 1:0
error (1011) state (3)
Mar 12 09:43:59 MYHOST52 iscsid: received iferror -38
Mar 12 09:43:59 MYHOST52 last message repeated 2 times
Mar 12 09:43:59 MYHOST52 iscsid: connection1:0 is operational after
recovery (3 attempts)
Mar 12 09:50:50 MYHOST52 kernel: connection2:0: iscsi: detected conn
error (1011)
Mar 12 09:50:50 MYHOST52 iscsid: Kernel reported iSCSI connection 2:0
error (1011) state (3)
Mar 12 09:50:53 MYHOST52 iscsid: received iferror -38
Mar 12 09:50:53 MYHOST52 last message repeated 2 times
Mar 12 09:50:53 MYHOST52 iscsid: connection2:0 is operational after
recovery (1 attempts)
Mar 12 09:54:06 MYHOST52 kernel: ping timeout of 5 secs expired, last
rx 19593570520, last ping 19593575520, now 19593580520
Mar 12 09:54:06 MYHOST52 kernel: connection1:0: iscsi: detected conn
error (1011)
Mar 12 09:54:07 MYHOST52 iscsid: Kernel reported iSCSI connection 1:0
error (1011) state (3)
Mar 12 09:54:34 MYHOST52 iscsid: received iferror -38
Mar 12 09:54:34 MYHOST52 last message repeated 2 times
Mar 12 09:54:34 MYHOST52 iscsid: connection1:0 is operational after
recovery (2 attempts)
Mar 12 10:00:54 MYHOST52 kernel: connection2:0: iscsi: detected conn
error (1011)
Mar 12 10:00:55 MYHOST52 iscsid: Kernel reported iSCSI connection 2:0
error (1011) state (3)
Mar 12 10:00:58 MYHOST52 iscsid: received iferror -38
Mar 12 10:00:58 MYHOST52 last message repeated 2 times
Mar 12 10:00:58 MYHOST52 iscsid: connection2:0 is operational after
recovery (1 attempts)

END


From RHEL 5.3 x86 Host;

Kernel:
Linux MYHOST53.MYDOMAIN.com 2.6.18-128.el5 #1 SMP Wed Jan 21 07:58:05
EST 2009 i686 i686 i386 GNU/Linux

fstab;
/dev/VolGroup00/LogVol00 / ext3
defaults 1 1
LABEL=/boot /boot ext3
defaults 1 2
tmpfs /dev/shm tmpfs
defaults 0 0
devpts /dev/pts devpts
gid=5,mode=620 0 0
sysfs /sys sysfs
defaults 0 0
proc /proc proc
defaults 0 0
/dev/VolGroup00/LogVol01 swap swap
defaults 0 0
/dev/sdc1 /sandisk1 ext3 _netdev 0 0

iscsiadm;
iSCSI Transport Class version 2.0-724
iscsiadm version 2.0-868
Target: iqn.2000-08.com.datacore:sm2-3
Current Portal: 172.16.200.9:3260,1
Persistent Portal: 172.16.200.9:3260,1
**********
Interface:
**********
Iface Name: default
Iface Transport: tcp
Iface Initiatorname: iqn.2005-03.com.redhat:01.406e5fd710e2
Iface IPaddress: 172.16.200.69
Iface HWaddress: default
Iface Netdev: default
SID: 1
iSCSI Connection State: LOGGED IN
iSCSI Session State: Unknown
Internal iscsid Session State: NO CHANGE
************************
Negotiated iSCSI params:
************************
HeaderDigest: None
DataDigest: None
MaxRecvDataSegmentLength: 131072
MaxXmitDataSegmentLength: 262144
FirstBurstLength: 0
MaxBurstLength: 1048576
ImmediateData: No
InitialR2T: Yes
MaxOutstandingR2T: 1
************************
Attached SCSI devices:
************************
Host Number: 2 State: running
scsi2 Channel 00 Id 0 Lun: 0
Attached scsi disk sdc State: running

Log Errors;
Mar 11 18:12:03 MYHOST53 kernel: md: Autodetecting RAID arrays.
Mar 11 18:12:03 MYHOST53 kernel: md: autorun ...
Mar 11 18:12:03 MYHOST53 kernel: md: ... autorun DONE.
Mar 11 18:12:03 MYHOST53 kernel: device-mapper: multipath: version
1.0.5 loaded
Mar 11 18:12:03 MYHOST53 kernel: EXT3 FS on dm-0, internal journal
Mar 11 18:12:03 MYHOST53 kernel: kjournald starting. Commit interval
5 seconds
Mar 11 18:12:03 MYHOST53 kernel: EXT3 FS on sda1, internal journal
Mar 11 18:12:03 MYHOST53 kernel: EXT3-fs: mounted filesystem with
ordered data mode.
Mar 11 18:12:03 MYHOST53 kernel: Adding 2031608k swap on /dev/
VolGroup00/LogVol01. Priority:-1 extents:1 across:2031608k
Mar 11 18:12:03 MYHOST53 kernel: IA-32 Microcode Update Driver: v1.14a
<tig...@veritas.com>
Mar 11 18:12:03 MYHOST53 kernel: microcode: CPU1 updated from revision
0x7 to 0xc, date = 04212005
Mar 11 18:12:03 MYHOST53 kernel: microcode: CPU0 updated from revision
0x7 to 0xc, date = 04212005
Mar 11 18:12:03 MYHOST53 kernel: Loading iSCSI transport class
v2.0-724.
Mar 11 18:12:03 MYHOST53 kernel: iscsi: registered transport (tcp)
Mar 11 18:12:03 MYHOST53 kernel: iscsi: registered transport (iser)
Mar 11 18:12:03 MYHOST53 kernel: ADDRCONF(NETDEV_UP): eth0: link is
not ready
Mar 11 18:12:03 MYHOST53 kernel: e1000: eth0: e1000_watchdog_task: NIC
Link is Up 1000 Mbps Full Duplex, Flow Control: RX
Mar 11 18:12:03 MYHOST53 kernel: ADDRCONF(NETDEV_CHANGE): eth0: link
becomes ready
Mar 11 18:12:03 MYHOST53 kernel: ADDRCONF(NETDEV_UP): eth1: link is
not ready
Mar 11 18:12:03 MYHOST53 kernel: e1000: eth1: e1000_watchdog_task: NIC
Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
Mar 11 18:12:03 MYHOST53 kernel: ADDRCONF(NETDEV_CHANGE): eth1: link
becomes ready
Mar 11 18:12:03 MYHOST53 kernel: scsi2 : iSCSI Initiator over TCP/IP
Mar 11 18:12:03 MYHOST53 kernel: Vendor: DataCore Model:
SANmelody Rev: DCS
Mar 11 18:12:03 MYHOST53 kernel: Type: Direct-
Access ANSI SCSI revision: 04
Mar 11 18:12:03 MYHOST53 kernel: SCSI device sdc: 41943040 512-byte
hdwr sectors (21475 MB)
Mar 11 18:12:03 MYHOST53 kernel: sdc: Write Protect is off
Mar 11 18:12:03 MYHOST53 kernel: SCSI device sdc: drive cache: write
back w/ FUA
Mar 11 18:12:03 MYHOST53 kernel: SCSI device sdc: 41943040 512-byte
hdwr sectors (21475 MB)
Mar 11 18:12:03 MYHOST53 kernel: sdc: Write Protect is off
Mar 11 18:12:03 MYHOST53 kernel: SCSI device sdc: drive cache: write
back w/ FUA
Mar 11 18:12:03 MYHOST53 kernel: sdc: sdc1
Mar 11 18:12:03 MYHOST53 kernel: sd 2:0:0:0: Attached scsi disk sdc
Mar 11 18:12:03 MYHOST53 kernel: sd 2:0:0:0: Attached scsi generic sg2
type 0
Mar 11 18:12:03 MYHOST53 rpc.statd[2160]: Version 1.0.9 Starting
Mar 11 18:12:03 MYHOST53 iscsid: received iferror -38
Mar 11 18:12:03 MYHOST53 last message repeated 2 times
Mar 11 18:12:03 MYHOST53 iscsid: connection1:0 is operational now
Mar 11 18:12:04 MYHOST53 kdump: kexec: loaded kdump kernel
Mar 11 18:12:04 MYHOST53 kdump: started up
Mar 11 18:12:04 MYHOST53 kernel: symev_rh_ES_5_2.6.18_53.el5_i686:
module license 'Proprietary' taints kernel.
Mar 11 18:12:04 MYHOST53 symev: loaded (symev-rh-ES-5-2.6.18-53.el5-
i686.ko)
Mar 11 18:12:04 MYHOST53 symap: loaded (symap-rh-ES-5-2.6.18-53.el5-
i686.ko)

END


Any help / suggestions gratefully received. I can change the config
of the RHEL 5.3 x86 host on demand, but not the RHEL 5.2 x86_64 host
(prod box).

Many thanks,
Rich.

Ulrich Windl

unread,
Mar 12, 2009, 10:56:55 AM3/12/09
to open-...@googlegroups.com, richard_...@hotmail.com
Hi,

I haven't investigated, but I see similar short "offline periods" for iSCSI here.
For your situation I'd recommend to move to Fibre Channel technology for Oracle
databases. Just MHO...

Regards,
Ulrich

bigcatxjs

unread,
Mar 12, 2009, 11:07:41 AM3/12/09
to open-iscsi
Thanks Ulrich,
Unfortunately, budgetary restrictions prevent us from moving to Fibre
Channel :>(

Rich.

On Mar 12, 2:56 pm, "Ulrich Windl" <ulrich.wi...@rz.uni-regensburg.de>
wrote:
> ...
>
> read more »

Mike Christie

unread,
Mar 12, 2009, 1:53:47 PM3/12/09
to open-...@googlegroups.com, richard_...@hotmail.com


bigcatxjs wrote:


For this RHEL 5.2 setup, does it make a difference if you do not use
ifaces and setup the box like in 5.3 below?
> Log Errors;co
> Mar 12 09:30:48 MYHOST52 last message repeated 2 times
> Mar 12 09:30:48 MYHOST52 iscsid: connection2:0 is operational after
> recovery (1 attempts)
> Mar 12 09:32:52 MYHOST52 kernel: ping timeout of 5 secs expired, last
> rx 19592296349, last ping 19592301349, now 19592306349



There was a bug in 5.2 where the initiator would think it detected a
timeout when it did not. It is fixed in 5.3.

The messages can also occur when there really is a problem with the
network or if the target is bogged down.

At these times is there lots of disk IO? Is there anything in the target
logs?


I am also not sure how well some targets handle bonding plus ifaces. Is
iface* using a bonded interface?


Can you replicate this pretty easily? If you just login the session,
then let it sit (do not run the db or any disk IO), will you see the
ping timeout errors?

It might be helpful to run ethereal/wireshark while you run your test
then send the /var/log/messages and trace so I can check and see if the
ping is really timing out or not. For the test you only need one session
logged in (this will reduce log and trace info), and once you see the
first ping timeout error you can stop tracing/logging and send it.



>
> From RHEL 5.3 x86 Host;
>

So the RHEL5.3 box is having troubles too? There is nothing in the log
below.

bigcatxjs

unread,
Mar 13, 2009, 6:01:17 AM3/13/09
to open-iscsi
Thanks Mike,

> For this RHEL 5.2 setup, does it make a difference if you do not use
> ifaces and setup the box like in 5.3 below?
I have used bonded ifaces so that the I/O requests can be split across
multiple NICS (both Server-side and on the Datacore San Melody SM node
NICS). This split is acheived by ensuring that the volumes used by
Oracle containing DATA and INDEX datafiles route through one named
Iface and that volumes used by Oracle for SYSTEM, BACKUP, and REDO
data / logs etc route through the other. We have seen a performance
uplift by maintaining this split despite the time-out issues. We have
a W2K3 x86_64 STD Oracle host that runs on one iface - this is much
slower than the RHEL 5.2 x86_64 host even though the hardware is
identical. We did have RHEL 5.1 x86_64 Oracle hosts running on one
iface - again, this was noticibly slower than the bonded ifaces
approach. This have been upgraded to RHEL 5.2 with the multiple
ifaces.

> There was a bug in 5.2 where the initiator would think it detected a
> timeout when it did not. It is fixed in 5.3.
Good. Then I should expect to see less errors.

> The messages can also occur when there really is a problem with the
> network or if the target is bogged down.
We have spread the primary volumes across both SM nodes. The nodes
are WK23K x86 (no x64 option for the DataCore Software) DELL 2850's.
There are two switches (one for SM1, one for SM2) that are linked
using teamed Fibre (2GB sec capacity). Thus I/O should route evenly
across both switches. The SM mirroring takes advantage of the Fibre.
With the RHEL 5.2 host, you will note that both ifaces are goiing to
SM2 node, but utilising different NICS on the SM2 node. These volumes
are then mirrored to SM1 (except the BACKUP volume, which is a linear
volume). We know that the switches aren't congested, but we don't
accurately know if SM1 or SM2 are congested. We only have a logical
spread of volumes presented across multiple NICS to at least try and
minimise congestion.

> At these times is there lots of disk IO? Is there anything in the target
> logs?
It is fair to say that all these volumes take a heavy hit, in terms of
I/O. Each host (excluding the RHEL 5.3. test host) run two Oracle
databases, of which some have intra-database replication (Oracle
Streams) enabled. The issue on the RHEL 5.2 host occures every 10
secs or so during Office Hours when it is being utilised.

> So the RHEL5.3 box is having troubles too? There is nothing in the log
> below.
The error with the RHEL 5.3 host was as follows;

> Mar 11 18:12:03 MYHOST53 iscsid: received iferror -38
> Mar 11 18:12:03 MYHOST53 last message repeated 2 times
> Mar 11 18:12:03 MYHOST53 iscsid: connection1:0 is operational now

This looked similar to previous RHEL 5.2 errors.

> Can you replicate this pretty easily? If you just login the session,
> then let it sit (do not run the db or any disk IO), will you see the
> ping timeout errors?
I can test this with the RHEL 5.3 host. Unfortunately, it will be
difficult to down the RHEL 5.2 host's database services until we have
a scheduled outage window.

Today, there have been no further errors on RHEL 5.3 host :>).

> It might be helpful to run ethereal/wireshark while you run your test
> then send the /var/log/messages and trace so I can check and see if the
> ping is really timing out or not. For the test you only need one session
> logged in (this will reduce log and trace info), and once you see the
> first ping timeout error you can stop tracing/logging and send it.
Yes; there is also an Oracle tool (Orion) that we could also use.

I think that I will monitor the RHEL 5.3 host for any further errors.
If the incidence of errors is reduced, then this gives justification
to upgrading the RHEL 5.2 host to 5.3. Such an outage would provide
me with an opportunity to perform the tests above as well.



Many thanks,
Richard.

END.

With the RHEL 5.2 host
> > Rich.- Hide quoted text -
>
> - Show quoted text -- Hide quoted text -
>
> - Show quoted text -
Message has been deleted

bigcatxjs

unread,
Mar 13, 2009, 8:11:39 AM3/13/09
to open-iscsi
UPDATE: RHEL 5.3 Host is showing errors. No Disk I/O to SAN volume
(last I/O Thursday 12th March);

Mar 13 10:38:49 MYHOST53 kernel: connection1:0: iscsi: detected conn
error (1011)
Mar 13 10:38:49 MYHOST53 iscsid: Kernel reported iSCSI connection 1:0
error (1011) state (3)
Mar 13 10:38:52 MYHOST53 iscsid: received iferror -38
Mar 13 10:38:52 MYHOST53 last message repeated 2 times
Mar 13 10:38:52 MYHOST53 iscsid: connection1:0 is operational after
recovery (1 attempts)
Mar 13 11:00:06 MYHOST53 kernel: connection1:0: iscsi: detected conn
error (1011)
Mar 13 11:00:06 MYHOST53 iscsid: Kernel reported iSCSI connection 1:0
error (1011) state (3)
Mar 13 11:00:09 MYHOST53 iscsid: received iferror -38
Mar 13 11:00:09 MYHOST53 last message repeated 2 times
Mar 13 11:00:09 MYHOST53 iscsid: connection1:0 is operational after
recovery (1 attempts)

Thanks, Rich.

END.

On Mar 13, 10:01 am, bigcatxjs <ad...@richardjamestrading.co.uk>
wrote:
> > >          - Hide quoted text -
>
> - Show quoted text -...
>
> read more »

Mike Christie

unread,
Mar 13, 2009, 4:44:19 PM3/13/09
to open-...@googlegroups.com
bigcatxjs wrote:
> UPDATE: RHEL 5.3 Host is showing errors. No Disk I/O to SAN volume
> (last I/O Thursday 12th March);
>

Is there anything in the log before this? Something about a ping or nop
timing out?

Mike Christie

unread,
Mar 13, 2009, 4:45:53 PM3/13/09
to open-...@googlegroups.com
bigcatxjs wrote:
>> At these times is there lots of disk IO? Is there anything in the target
>> logs?
> It is fair to say that all these volumes take a heavy hit, in terms of
> I/O. Each host (excluding the RHEL 5.3. test host) run two Oracle
> databases, of which some have intra-database replication (Oracle
> Streams) enabled. The issue on the RHEL 5.2 host occures every 10
> secs or so during Office Hours when it is being utilised.

Do you mean every 10 seconds you see the conn error then conn operation
messages? That sounds like the nop bug in 5.2.

Are you using multipath over iscsi btw?


One thing you can try is turn pings off. In iscsid.conf set:

node.conn[0].timeo.noop_out_interval = 0
node.conn[0].timeo.noop_out_timeout = 0

then logout the session and rerun the discovery command then relogin or
you can logout then do:

iscsiadm -m node -o update -n node.conn[0].timeo.noop_out_interval -v 0
iscsiadm -m node -o update -n node.conn[0].timeo.noop_out_timeout -v 0

then relogin.

If with these new settings you see host reset succeeded messages then we
will want to turn down the queueing limits.

bigcatxjs

unread,
Mar 17, 2009, 9:14:31 AM3/17/09
to open-iscsi
Thanks Mike...

On Mar 13, 8:45 pm, Mike Christie <micha...@cs.wisc.edu> wrote:
> bigcatxjs wrote:
> >> At these times is there lots of disk IO? Is there anything in the target
> >> logs?
> > It is fair to say that all these volumes take a heavy hit, in terms of
> > I/O.  Each host (excluding the RHEL 5.3. test host) run two Oracle
> > databases, of which some have intra-database replication (Oracle
> > Streams) enabled.  The issue on the RHEL 5.2 host occures every 10
> > secs or so during Office Hours when it is being utilised.
>
> Do you mean every 10 seconds you see the conn error then conn operation
> messages? That sounds like the nop bug in 5.2.

Yes - this is occuring on the RHEL 5.2 host.

>
> Are you using multipath over iscsi btw?

No - we are not using multipath over iscsi for a our linux hosts

>
> One thing you can try is turn pings off. In iscsid.conf set:
>
> node.conn[0].timeo.noop_out_interval = 0
> node.conn[0].timeo.noop_out_timeout = 0
>
> then logout the session and rerun the discovery command then relogin or
> you can logout then do:
>
> iscsiadm -m node -o update -n node.conn[0].timeo.noop_out_interval -v 0
> iscsiadm -m node -o update -n node.conn[0].timeo.noop_out_timeout -v 0
>
> then relogin.

I have applied these changes just now. Thanks. Received an error
logging back into iscsi;

Mar 17 12:40:47 MYHOST53 kernel: scsi5 : iSCSI Initiator over TCP/IP
Mar 17 12:40:47 MYHOST53 kernel: Vendor: DataCore Model:
SANmelody Rev: DCS
Mar 17 12:40:47 MYHOST53 kernel: Type: Direct-
Access ANSI SCSI revision: 04
Mar 17 12:40:47 MYHOST53 kernel: SCSI device sdd: 41943040 512-byte
hdwr sectors (21475 MB)
Mar 17 12:40:47 MYHOST53 kernel: sdd: Write Protect is off
Mar 17 12:40:47 MYHOST53 kernel: SCSI device sdd: drive cache: write
back w/ FUA
Mar 17 12:40:47 MYHOST53 kernel: SCSI device sdd: 41943040 512-byte
hdwr sectors (21475 MB)
Mar 17 12:40:47 MYHOST53 kernel: sdd: Write Protect is off
Mar 17 12:40:47 MYHOST53 kernel: SCSI device sdd: drive cache: write
back w/ FUA
Mar 17 12:40:47 MYHOST53 kernel: sdd: sdd1
Mar 17 12:40:47 MYHOST53 kernel: sd 5:0:0:0: Attached scsi disk sdd
Mar 17 12:40:47 MYHOST53 kernel: sd 5:0:0:0: Attached scsi generic sg2
type 0
Mar 17 12:40:47 MYHOST53 iscsid: received iferror -38
Mar 17 12:40:47 MYHOST53 last message repeated 2 times
Mar 17 12:40:47 MYHOST53 iscsid: connection4:0 is operational now

Here is the iscsi.conf file post-reconfig;

#
# Open-iSCSI default configuration.
# Could be located at /etc/iscsi/iscsid.conf or ~/.iscsid.conf
#
# Note: To set any of these values for a specific node/session run
# the iscsiadm --mode node --op command for the value. See the README
# and man page for iscsiadm for details on the --op command.
#

################
# iSNS settings
################
# Address of iSNS server
#isns.address = 192.168.*.* -- edited-out for security reasons
#isns.port = 3205

#############################
# NIC/HBA and driver settings
#############################
# open-iscsi can create a session and bind it to a NIC/HBA.
# To set this up see the example iface config file.

#*****************
# Startup settings
#*****************

# To request that the iscsi initd scripts startup a session set to
"automatic".
# node.startup = automatic
#
# To manually startup the session set to "manual". The default is
automatic.
node.startup = automatic

# *************
# CHAP Settings
# *************

# To enable CHAP authentication set node.session.auth.authmethod
# to CHAP. The default is None.
#node.session.auth.authmethod = CHAP

# To set a CHAP username and password for initiator
# authentication by the target(s), uncomment the following lines:
#node.session.auth.username = username
#node.session.auth.password = password

# To set a CHAP username and password for target(s)
# authentication by the initiator, uncomment the following lines:
#node.session.auth.username_in = username_in
#node.session.auth.password_in = password_in

# To enable CHAP authentication for a discovery session to the target
# set discovery.sendtargets.auth.authmethod to CHAP. The default is
None.
#discovery.sendtargets.auth.authmethod = CHAP

# To set a discovery session CHAP username and password for the
initiator
# authentication by the target(s), uncomment the following lines:
#discovery.sendtargets.auth.username = username
#discovery.sendtargets.auth.password = password

# To set a discovery session CHAP username and password for target(s)
# authentication by the initiator, uncomment the following lines:
#discovery.sendtargets.auth.username_in = username_in
#discovery.sendtargets.auth.password_in = password_in

# ********
# Timeouts
# ********
#
# See the iSCSI REAME's Advanced Configuration section for tips
# on setting timeouts when using multipath or doing root over iSCSI.
#
# To specify the length of time to wait for session re-establishment
# before failing SCSI commands back to the application when running
# the Linux SCSI Layer error handler, edit the line.
# The value is in seconds and the default is 120 seconds.
node.session.timeo.replacement_timeout = 120

# To specify the time to wait for login to complete, edit the line.
# The value is in seconds and the default is 15 seconds.
node.conn[0].timeo.login_timeout = 15

# To specify the time to wait for logout to complete, edit the line.
# The value is in seconds and the default is 15 seconds.
node.conn[0].timeo.logout_timeout = 15

# Time interval to wait for on connection before sending a ping.
node.conn[0].timeo.noop_out_interval = 0

# To specify the time to wait for a Nop-out response before failing
# the connection, edit this line. Failing the connection will
# cause IO to be failed back to the SCSI layer. If using dm-multipath
# this will cause the IO to be failed to the multipath layer.
node.conn[0].timeo.noop_out_timeout = 0

#******
# Retry
#******

# To speficy the number of times iscsiadm should retry a login
# to the target when we first login, modify the following line.
# The default is 4. Valid values are any integer value. This only
# affects the initial login. Setting it to a high value can slow
# down the iscsi service startup. Setting it to a low value can
# cause a session to not get logged into, if there are distuptions
# during startup or if the network is not ready at that time.
node.session.initial_login_retry_max = 4

################################
# session and device queue depth
################################

# To control how many commands the session will queue set
# node.session.cmds_max to an integer between 2 and 2048 that is also
# a power of 2. The default is 128.
node.session.cmds_max = 128

# To control the device's queue depth set node.session.queue_depth
# to a value between 1 and 128. The default is 32.
node.session.queue_depth = 32

#***************
# iSCSI settings
#***************

# To enable R2T flow control (i.e., the initiator must wait for an R2T
# command before sending any data), uncomment the following line:
#
#node.session.iscsi.InitialR2T = Yes
#
# To disable R2T flow control (i.e., the initiator has an implied
# initial R2T of "FirstBurstLength" at offset 0), uncomment the
following line:
#
# The defaults is No.
node.session.iscsi.InitialR2T = No

#
# To disable immediate data (i.e., the initiator does not send
# unsolicited data with the iSCSI command PDU), uncomment the
following line:
#
#node.session.iscsi.ImmediateData = No
#
# To enable immediate data (i.e., the initiator sends unsolicited data
# with the iSCSI command packet), uncomment the following line:
#
# The default is Yes
node.session.iscsi.ImmediateData = Yes

# To specify the maximum number of unsolicited data bytes the
initiator
# can send in an iSCSI PDU to a target, edit the following line.
#
# The value is the number of bytes in the range of 512 to (2^24-1) and
# the default is 262144
node.session.iscsi.FirstBurstLength = 262144

# To specify the maximum SCSI payload that the initiator will
negotiate
# with the target for, edit the following line.
#
# The value is the number of bytes in the range of 512 to (2^24-1) and
# the defauls it 16776192
node.session.iscsi.MaxBurstLength = 16776192

# To specify the maximum number of data bytes the initiator can
receive
# in an iSCSI PDU from a target, edit the following line.
#
# The value is the number of bytes in the range of 512 to (2^24-1) and
# the default is 131072
node.conn[0].iscsi.MaxRecvDataSegmentLength = 131072


# To specify the maximum number of data bytes the initiator can
receive
# in an iSCSI PDU from a target during a discovery session, edit the
# following line.
#
# The value is the number of bytes in the range of 512 to (2^24-1) and
# the default is 32768
#
discovery.sendtargets.iscsi.MaxRecvDataSegmentLength = 32768

# To allow the targets to control the setting of the digest checking,
# with the initiator requesting a preference of enabling the checking,
uncommen
# the following lines (Data digests are not supported and on ppc/ppc64
# both header and data digests are not supported.):
#node.conn[0].iscsi.HeaderDigest = CRC32C,None
#
# To allow the targets to control the setting of the digest checking,
# with the initiator requesting a preference of disabling the
checking,
# uncomment the following lines:
#node.conn[0].iscsi.HeaderDigest = None,CRC32C
#
# To enable CRC32C digest checking for the header and/or data part of
# iSCSI PDUs, uncomment the following lines:
#node.conn[0].iscsi.HeaderDigest = CRC32C
#
# To disable digest checking for the header and/or data part of
# iSCSI PDUs, uncomment the following lines:
#node.conn[0].iscsi.HeaderDigest = None
#
# The default is to never use DataDigests and to allow the target to
control
# the setting of the HeaderDigest checking with the initiator
requesting
# a preference of disabling the checking.

Many thanks,
Rich.

END.

Mike Christie

unread,
Mar 17, 2009, 1:06:23 PM3/17/09
to open-...@googlegroups.com
bigcatxjs wrote:
> Thanks Mike...
>
> On Mar 13, 8:45 pm, Mike Christie <micha...@cs.wisc.edu> wrote:
>> bigcatxjs wrote:
>>>> At these times is there lots of disk IO? Is there anything in the target
>>>> logs?
>>> It is fair to say that all these volumes take a heavy hit, in terms of
>>> I/O. Each host (excluding the RHEL 5.3. test host) run two Oracle
>>> databases, of which some have intra-database replication (Oracle
>>> Streams) enabled. The issue on the RHEL 5.2 host occures every 10
>>> secs or so during Office Hours when it is being utilised.
>> Do you mean every 10 seconds you see the conn error then conn operation
>> messages? That sounds like the nop bug in 5.2.
>
> Yes - this is occuring on the RHEL 5.2 host.


Ok then upgrading to 5.3 should help.

>
> I have applied these changes just now. Thanks. Received an error
> logging back into iscsi;

You mean the iferror?

>
> Mar 17 12:40:47 MYHOST53 iscsid: received iferror -38

You can ignore this. It just means the userspace tools wanted to set
value in the kernel but could not because the kernel did not support it.
The userspace tools should then do it in userspace instead. If it is
something that the tools cannot work around then it will fail the operation.

bigcatxjs

unread,
Mar 18, 2009, 1:23:09 PM3/18/09
to open-iscsi
Hi,
We have encountered this error below. This is the first time I have
seen this before;


Mar 17 12:40:47 MYHOST53 kernel: Vendor: DataCore Model:
SANmelody Rev: DCS
Mar 17 12:40:47 MYHOST53 kernel: Type: Direct-
Access ANSI SCSI revision: 04
Mar 17 12:40:47 MYHOST53 kernel: SCSI device sdd: 41943040 512-byte
hdwr sectors (21475 MB)
Mar 17 12:40:47 MYHOST53 kernel: sdd: Write Protect is off
Mar 17 12:40:47 MYHOST53 kernel: SCSI device sdd: drive cache: write
back w/ FUA
Mar 17 12:40:47 MYHOST53 kernel: SCSI device sdd: 41943040 512-byte
hdwr sectors (21475 MB)
Mar 17 12:40:47 MYHOST53 kernel: sdd: Write Protect is off
Mar 17 12:40:47 MYHOST53 kernel: SCSI device sdd: drive cache: write
back w/ FUA
Mar 17 12:40:47 MYHOST53 kernel: sdd: sdd1
Mar 17 12:40:47 MYHOST53 kernel: sd 5:0:0:0: Attached scsi disk sdd
Mar 17 12:40:47 MYHOST53 kernel: sd 5:0:0:0: Attached scsi generic sg2
type 0
Mar 17 12:40:47 MYHOST53 iscsid: received iferror -38
Mar 17 18:21:39 MYHOST53 last message repeated 20 times
Mar 17 18:27:59 MYHOST53 kernel: scsi 2:0:0:0: rejecting I/O to dead
device
Mar 17 18:28:04 MYHOST53 kernel: scsi 2:0:0:0: rejecting I/O to dead
device
Mar 17 18:28:04 MYHOST53 kernel: journal_bmap: journal block not found
at offset 2616 on sdc1
Mar 17 18:28:04 MYHOST53 kernel: Aborting journal on device sdc1.
Mar 17 18:28:04 MYHOST53 kernel: scsi 2:0:0:0: rejecting I/O to dead
device
Mar 17 18:28:04 MYHOST53 kernel: Buffer I/O error on device sdc1,
logical block 1545
Mar 17 18:28:04 MYHOST53 kernel: lost page write due to I/O error on
sdc1
Mar 17 23:03:40 MYHOST53 kernel: connection4:0: iscsi: detected conn
error (1011)
Mar 17 23:03:41 MYHOST53 iscsid: Kernel reported iSCSI connection 4:0
error (1011) state (3)
Mar 17 23:03:44 MYHOST53 iscsid: received iferror -38
Mar 17 23:03:44 MYHOST53 last message repeated 2 times
Mar 17 23:03:44 MYHOST53 iscsid: connection4:0 is operational after
recovery (1 attempts)
Mar 17 23:46:17 MYHOST53 kernel: connection4:0: iscsi: detected conn
error (1011)
Mar 17 23:46:18 MYHOST53 iscsid: Kernel reported iSCSI connection 4:0
error (1011) state (3)
Mar 17 23:46:20 MYHOST53 iscsid: received iferror -38
Mar 17 23:46:20 MYHOST53 last message repeated 2 times
Mar 17 23:46:20 MYHOST53 iscsid: connection4:0 is operational after
recovery (1 attempts)
Mar 18 04:04:27 MYHOST53 kernel: scsi 2:0:0:0: rejecting I/O to dead
device
Mar 18 04:04:27 MYHOST53 kernel: EXT3-fs error (device sdc1):
ext3_find_entry: reading directory #2 offset 0
Mar 18 04:04:27 MYHOST53 kernel: scsi 2:0:0:0: rejecting I/O to dead
device
Mar 18 04:04:27 MYHOST53 kernel: Buffer I/O error on device sdc1,
logical block 0
Mar 18 04:04:27 MYHOST53 kernel: lost page write due to I/O error on
sdc1
Mar 18 04:04:27 MYHOST53 kernel: scsi 2:0:0:0: rejecting I/O to dead
device
Mar 18 04:04:27 MYHOST53 kernel: EXT3-fs error (device sdc1):
ext3_find_entry: reading directory #2 offset 0
Mar 18 04:04:27 MYHOST53 kernel: scsi 2:0:0:0: rejecting I/O to dead
device
Mar 18 04:04:27 MYHOST53 kernel: Buffer I/O error on device sdc1,
logical block 0
Mar 18 04:04:27 MYHOST53 kernel: lost page write due to I/O error on
sdc1
Mar 18 14:56:49 MYHOST53 kernel: scsi 2:0:0:0: rejecting I/O to dead
device
Mar 18 14:56:49 MYHOST53 kernel: ext3_abort called.
Mar 18 14:56:49 MYHOST53 kernel: EXT3-fs error (device sdc1):
ext3_journal_start_sb: Detected aborted journal

So quite a serious error. I'm assuming that it would not be anything
to do with the iscsi time-out parm changes we made previosuly.... the
disk was not under any i/o stress at all when the error occurred.


Thanks,
Richard.

bigcatxjs

unread,
Mar 18, 2009, 1:23:52 PM3/18/09
to open-iscsi
Hi,
We have encountered this error below. This is the first time I have
seen this before;


Mar 17 12:40:47 MYHOST53 kernel: Vendor: DataCore Model:
SANmelody Rev: DCS
Mar 17 12:40:47 MYHOST53 kernel: Type: Direct-
Access ANSI SCSI revision: 04
Mar 17 12:40:47 MYHOST53 kernel: SCSI device sdd: 41943040 512-byte
hdwr sectors (21475 MB)
Mar 17 12:40:47 MYHOST53 kernel: sdd: Write Protect is off
Mar 17 12:40:47 MYHOST53 kernel: SCSI device sdd: drive cache: write
back w/ FUA
Mar 17 12:40:47 MYHOST53 kernel: SCSI device sdd: 41943040 512-byte
hdwr sectors (21475 MB)
Mar 17 12:40:47 MYHOST53 kernel: sdd: Write Protect is off
Mar 17 12:40:47 MYHOST53 kernel: SCSI device sdd: drive cache: write
back w/ FUA
Mar 17 12:40:47 MYHOST53 kernel: sdd: sdd1
Mar 17 12:40:47 MYHOST53 kernel: sd 5:0:0:0: Attached scsi disk sdd
Mar 17 12:40:47 MYHOST53 kernel: sd 5:0:0:0: Attached scsi generic sg2
type 0
Mar 17 12:40:47 MYHOST53 iscsid: received iferror -38
Mar 17 18:21:39 MYHOST53 last message repeated 20 times
Mar 17 18:27:59 MYHOST53 kernel: scsi 2:0:0:0: rejecting I/O to dead
device
Mar 17 18:28:04 MYHOST53 kernel: scsi 2:0:0:0: rejecting I/O to dead
device
Mar 17 18:28:04 MYHOST53 kernel: journal_bmap: journal block not found
at offset 2616 on sdc1
Mar 17 18:28:04 MYHOST53 kernel: Aborting journal on device sdc1.
Mar 17 18:28:04 MYHOST53 kernel: scsi 2:0:0:0: rejecting I/O to dead
device
Mar 17 18:28:04 MYHOST53 kernel: Buffer I/O error on device sdc1,
logical block 1545
Mar 17 18:28:04 MYHOST53 kernel: lost page write due to I/O error on
sdc1
Mar 17 23:03:40 MYHOST53 kernel: connection4:0: iscsi: detected conn
error (1011)
Mar 17 23:03:41 MYHOST53 iscsid: Kernel reported iSCSI connection 4:0
error (1011) state (3)
Mar 17 23:03:44 MYHOST53 iscsid: received iferror -38
Mar 17 23:03:44 MYHOST53 last message repeated 2 times
Mar 17 23:03:44 MYHOST53 iscsid: connection4:0 is operational after
recovery (1 attempts)
Mar 17 23:46:17 MYHOST53 kernel: connection4:0: iscsi: detected conn
error (1011)
Mar 17 23:46:18 MYHOST53 iscsid: Kernel reported iSCSI connection 4:0
error (1011) state (3)
Mar 17 23:46:20 MYHOST53 iscsid: received iferror -38
Mar 17 23:46:20 MYHOST53 last message repeated 2 times
Mar 17 23:46:20 MYHOST53 iscsid: connection4:0 is operational after
recovery (1 attempts)

Mike Christie

unread,
Mar 18, 2009, 1:45:56 PM3/18/09
to open-...@googlegroups.com
bigcatxjs wrote:
> Hi,
> We have encountered this error below. This is the first time I have
> seen this before;


This is with the noop settings set to 0 right? Was this the RHEL 5.3 or
5.2 setup?

Could you do

rpm -q iscsi-initiator-utils


>
>
> Mar 17 12:40:47 MYHOST53 kernel: Vendor: DataCore Model:
> SANmelody Rev: DCS
> Mar 17 12:40:47 MYHOST53 kernel: Type: Direct-
> Access ANSI SCSI revision: 04
> Mar 17 12:40:47 MYHOST53 kernel: SCSI device sdd: 41943040 512-byte
> hdwr sectors (21475 MB)
> Mar 17 12:40:47 MYHOST53 kernel: sdd: Write Protect is off
> Mar 17 12:40:47 MYHOST53 kernel: SCSI device sdd: drive cache: write
> back w/ FUA
> Mar 17 12:40:47 MYHOST53 kernel: SCSI device sdd: 41943040 512-byte
> hdwr sectors (21475 MB)
> Mar 17 12:40:47 MYHOST53 kernel: sdd: Write Protect is off
> Mar 17 12:40:47 MYHOST53 kernel: SCSI device sdd: drive cache: write
> back w/ FUA
> Mar 17 12:40:47 MYHOST53 kernel: sdd: sdd1
> Mar 17 12:40:47 MYHOST53 kernel: sd 5:0:0:0: Attached scsi disk sdd
> Mar 17 12:40:47 MYHOST53 kernel: sd 5:0:0:0: Attached scsi generic sg2
> type 0
> Mar 17 12:40:47 MYHOST53 iscsid: received iferror -38
> Mar 17 18:21:39 MYHOST53 last message repeated 20 times


> Mar 17 18:27:59 MYHOST53 kernel: scsi 2:0:0:0: rejecting I/O to dead
> device


It looks like one of the following is happening:

1. were using RHEL 5.2 and the target logged us out or dropped the
session and when we tried to login we got what we thought was a fatal
error (but may be a transient error) from the target so iscsid destroyed
the session. When this happens the devices will be removed and IO to the
device will get failed like you see below with the rejecting to dead device.

In RHEL 5.3 this should be fixed. We will retry the login error instead
of giving up right away.

2. someone ran a iscsiadm logout command.

3. iscsid bugged out and killed the session. I do not think this happens
because I see below for the session4 (connection4:0) we get an error and
end up logging back in so iscsid is up and running.


But if it is #1, it makes me think maybe the target is dropping the
session or logging is out. This would explain some nops timing out or
failing or the conn failures in the other logs and below.

Was there anything in the target logs at this time? Maybe something
about a protocol error or something about rebalancing IO or was there
anything going on on the target like a firmware upgrade?

I am afraid I do not know much about these targets. I have never used
one. Have you made any requests to the data core people? Do you have a
support guy that you can send me a email address for? Even a tech sales
guy there or something might be useful to try and find someone.

Does anyone know anyone there?
Yeah, it should not. Turning them off though may have changed where the
problem was detected and so we took a different error handling path.

bigcatxjs

unread,
Mar 19, 2009, 6:31:46 AM3/19/09
to open-iscsi
Thanks Mike...

On Mar 18, 5:45 pm, Mike Christie <micha...@cs.wisc.edu> wrote:
> bigcatxjs wrote:
> > Hi,
> > We have encountered this error below.  This is the first time I have
> > seen this before;
>
> This is with the noop settings set to 0 right? Was this the RHEL 5.3 or
> 5.2 setup?

It is our RHEL 5.3 host.

>
> Could you do
>
> rpm -q iscsi-initiator-utils

Sure...
- rpm -q iscsi-initiator-utils;

iscsi-initiator-utils-6.2.0.868-0.18.el5
Unlikely, I am the only person working with this host currently.

>
> 3. iscsid bugged out and killed the session. I do not think this happens
> because I see below for the session4 (connection4:0) we get an error and
> end up logging back in so iscsid is up and running.

Yes - iscsidm -m session -P3 showed ISCSI as running. BUT the device
SDC1;
/dev/sdc1 /sandisk1 ext3
_netdev 0 0

It disappeared! in /DEV the SDC re-appeared as SDD!! So I needed to
update our FSTAB to
/dev/sdd1 /sandisk1 ext3
_netdev 0 0

... then remount the volume as /sandisk1, then log-out and log-back
into ISCSI.

On our prod boxes (such as the RHEL 5.2 box) we use LABELS.

>
> But if it is #1, it makes me think maybe the target is dropping the
> session or logging is out. This would explain some nops timing out or
> failing or the conn failures in the other logs and below.
>
> Was there anything in the target logs at this time? Maybe something
> about a protocol error or something about rebalancing IO or was there
> anything going on on the target like a firmware upgrade?

I have checked the logs on the SM node - unfortunately the logs are
circular so the history has already been overwritten (own-goal on my
part!). I have checked this morning and so far only informational
messages (no errors reported).

>
> I am afraid I do not know much about these targets. I have never used
> one. Have you made any requests to the data core people? Do you have a
> support guy that you can send me a email address for? Even a tech sales
> guy there or something might be useful to try and find someone.
>

We have support with DataCore Europe and have logged support bundles
with them in the past. I am looking to raise a new one shortly.

Thanks, Rich.

END.

Mike Christie

unread,
Mar 19, 2009, 1:10:51 PM3/19/09
to open-...@googlegroups.com
bigcatxjs wrote:
>>> Mar 17 18:27:59 MYHOST53 kernel: scsi 2:0:0:0: rejecting I/O to dead
>>> device
>> It looks like one of the following is happening:
>>
>> 1. were using RHEL 5.2 and the target logged us out or dropped the
>> session and when we tried to login we got what we thought was a fatal
>> error (but may be a transient error) from the target so iscsid destroyed
>> the session. When this happens the devices will be removed and IO to the
>> device will get failed like you see below with the rejecting to dead device.
>>
>> In RHEL 5.3 this should be fixed. We will retry the login error instead
>> of giving up right away.
>>
>> 2. someone ran a iscsiadm logout command.
>
> Unlikely, I am the only person working with this host currently.
>
>> 3. iscsid bugged out and killed the session. I do not think this happens
>> because I see below for the session4 (connection4:0) we get an error and
>> end up logging back in so iscsid is up and running.
>
> Yes - iscsidm -m session -P3 showed ISCSI as running. BUT the device
> SDC1;
> /dev/sdc1 /sandisk1 ext3
> _netdev 0 0
>
> It disappeared! in /DEV the SDC re-appeared as SDD!! So I needed to
> update our FSTAB to
> /dev/sdd1 /sandisk1 ext3
> _netdev 0 0

Did this happen automatically. So the reject messages appeared then you
saw sdc switch to sdd? Or did you run some iscsiadm command after you
saw the reject messages?

Did this happen the time you sent the log output for?

In the log out put you got sdd here:

> > Mar 17 12:40:47 MYHOST53 kernel: sdd: Write Protect is off
> > Mar 17 12:40:47 MYHOST53 kernel: SCSI device sdd: drive cache: write
> > back w/ FUA

Was this where sdd got added after it was sdc?


Was sdc this disk:


Mar 17 18:27:59 MYHOST53 kernel: scsi 2:0:0:0: rejecting I/O to dead
device

The thing is that 2:0:0:0 is on a completely different
host/session/connection than sdd.

In the original mail you only had one session running on the rhel 5.3 box:

is that still the same?

In general disks can get any name from restart to restart of the iscsi
service. So during one boot you will get sda and then when you reboot
that disk might now be sdb. You should use labels or udev names.

For iscsi you normally do not want to use sdX names because we do our
login and scanning asynchronously, so the sdX names are pretty random.

bigcatxjs

unread,
Mar 20, 2009, 12:58:10 PM3/20/09
to open-iscsi
Thanks Mike. On our Prod boxes we use labels, so I will implement the
same on the RHEL 53 host.

Incidentally, we have been looking at the SM2 server logs and we think
that we have identified a driver / NIC issue that could well be
impacting the RHEL 53 server and the RHEL 52 (prod box). We have an
outage planned for next Monday night to replace the quad network card,
so will report on how this progresses.

Thanks for all your help so far.

Richard.
Reply all
Reply to author
Forward
0 new messages