Re: troubleshooting sporadic write problems from KVM

441 views
Skip to first unread message

Mike Christie

unread,
Mar 21, 2013, 4:26:36 PM3/21/13
to open-...@googlegroups.com, zarno...@gmail.com
On 03/15/2013 06:02 AM, zarno...@gmail.com wrote:
> I assume that I'm not hitting any of those timeouts, otherwise I should
> see something in debug output. Do any of those values affect the kernel
> part of iSCSI client ?

If you were hitting a iscsi issue then you should see something in
/var/log/messages from the sdX device and/or the iscsi layer before the
FS error messages. You would see something about a
DID_TRANSPORT_DISRUPTED or DID_ERROR or DID_TRANSPORT_FAILFAST. You
might also see something about the iscsi replacement/recovery timeout
expiring. The iscsiadm session state would also show not logged in.

>
> Do the counts (iorequest_cnt, iodone_cnt, ioerr_cnt) mean that every
> request sent out was successfully completed ? ( iorequest_cnt
> == iodone_cnt )

If they are equal then it means everything has been going smoothly and
the scsi eh has not had to run. If a command timesout and the scsi eh
runs then those values would be different. The iodone_cnt is not
incremented in that case.

> I would like to know if the response to write is an actual error
> response sent by the target or is it a (network) problem related to client.
>

ioerr_cnt = 0 indicates the iscsi layer has not failed any commands due
to something like connection error or due to the target/device sending
back a failure.

The thing is that it could be non-zero and we could have got check
conditions from the device that were just telling us that the device is
ready or something informative like that.



> Thanks in advance for any tips. I'm desperate enough to start
> tcpdump-ing the whole thing.. ;)
>
> Regards,
>
> Brano Zarnovican
>
> Host/Guest OS: Scientific Linux release 6.3
> Host kernel: 2.6.32-358.0.1.el6.x86_64
> Guest kernel: 2.6.32-279.9.1.el6.x86_64
> iSCSI client: iscsi-initiator-utils-6.2.0.872-41.el6.x86_64
>
> --
> You received this message because you are subscribed to the Google
> Groups "open-iscsi" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to open-iscsi+...@googlegroups.com.
> To post to this group, send email to open-...@googlegroups.com.
> Visit this group at http://groups.google.com/group/open-iscsi?hl=en.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Fred Smit

unread,
Mar 22, 2013, 11:03:47 PM3/22/13
to open-...@googlegroups.com


On Friday, March 15, 2013 6:02:05 AM UTC-5, zarno...@gmail.com wrote:
Hi,

we are using KVMs with root backed by iSCSI LUNs mapped from Netapp. Occasionally, the device get write errors and it is remounted read-only

Mar 15 10:16:59 rb-vertica-hds2-devel dhclient[5053]: DHCPACK from 172.30.40.175 (xid=0x47a97e90)
Mar 15 10:17:00 rb-vertica-hds2-devel dhclient[5053]: bound to 172.30.40.92 -- renewal in 47 seconds.
Mar 15 10:17:03 rb-vertica-hds2-devel kernel: Buffer I/O error on device vda1, logical block 708624
Mar 15 10:17:03 rb-vertica-hds2-devel kernel: lost page write due to I/O error on vda1
..
Mar 15 10:17:32 rb-vertica-hds2-devel kernel: Buffer I/O error on device vda1, logical block 903881
Mar 15 10:17:32 rb-vertica-hds2-devel kernel: lost page write due to I/O error on vda1
Mar 15 10:17:32 rb-vertica-hds2-devel kernel: Buffer I/O error on device vda1, logical block 1705084
Mar 15 10:17:32 rb-vertica-hds2-devel kernel: lost page write due to I/O error on vda1
Mar 15 10:17:32 rb-vertica-hds2-devel kernel: JBD2: Detected IO errors while flushing file data on vda1-8

When the problem happens there are NO errors in logs on compute node. I'm running 'iscsiadm -m session -P3' every 5s.
It shows no change in session or LUN state. I did also 'iscsid' with -d8 which also shows nothing around the time.

How do I identify where are these write errors coming from ?
* problem on virtio-blk ?? Not likely.
* iscsi client problem connecting to target
* actual write problem on target

Example KVM device definition

    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='none'/>
      <source dev='/dev/disk/by-path/ip-172.30.128.3:3260-iscsi-iqn.1992-08.com.netapp:node.netapp02-lun-17'/>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </disk>

iSCSI session has default configuration

iscsiadm -m node -T iqn.1992-08.com.netapp:node.netapp02
...
node.session.cmds_max = 128
node.session.queue_depth = 32
node.session.timeo.replacement_timeout = 120
node.session.err_timeo.abort_timeout = 15
node.session.err_timeo.lu_reset_timeout = 30
node.session.err_timeo.tgt_reset_timeout = 30
node.session.err_timeo.host_reset_timeout = 60
node.conn[0].timeo.logout_timeout = 15
node.conn[0].timeo.login_timeout = 15
node.conn[0].timeo.auth_timeout = 45
node.conn[0].timeo.noop_out_interval = 5
node.conn[0].timeo.noop_out_timeout = 5

iscsiadm -m session -P3
...
Recovery Timeout: 120
Target Reset Timeout: 30
LUN Reset Timeout: 30
Abort Timeout: 15

This is the device which had IO errors few hours ago..
grep . /sys/block/sdk/device/*
grep: /sys/block/sdk/device/delete: Permission denied
/sys/block/sdk/device/device_blocked:0
/sys/block/sdk/device/dh_state:detached
/sys/block/sdk/device/evt_media_change:0
/sys/block/sdk/device/iocounterbits:32
/sys/block/sdk/device/iodone_cnt:0x29a
/sys/block/sdk/device/ioerr_cnt:0x0                             <-- error count ?
/sys/block/sdk/device/iorequest_cnt:0x29a
/sys/block/sdk/device/modalias:scsi:t-0x00
/sys/block/sdk/device/model:LUN             
/sys/block/sdk/device/queue_depth:32
/sys/block/sdk/device/queue_ramp_up_period:120000
/sys/block/sdk/device/queue_type:none
grep: /sys/block/sdk/device/rescan: Permission denied
/sys/block/sdk/device/rev:7360
/sys/block/sdk/device/scsi_level:5
/sys/block/sdk/device/state:running
/sys/block/sdk/device/timeout:30
/sys/block/sdk/device/type:0
/sys/block/sdk/device/uevent:DEVTYPE=scsi_device
/sys/block/sdk/device/uevent:DRIVER=sd
/sys/block/sdk/device/uevent:MODALIAS=scsi:t-0x00
/sys/block/sdk/device/vendor:NETAPP  

I assume that I'm not hitting any of those timeouts, otherwise I should see something in debug output. Do any of those values affect the kernel part of iSCSI client ?

Do the counts (iorequest_cnt, iodone_cnt, ioerr_cnt) mean that every request sent out was successfully completed ? ( iorequest_cnt == iodone_cnt )
I would like to know if the response to write is an actual error response sent by the target or is it a (network) problem related to client.

BranoZ

unread,
Mar 24, 2013, 4:27:31 AM3/24/13
to open-iscsi
On Mar 21, 10:26 pm, Mike Christie <micha...@cs.wisc.edu> wrote:
> On 03/15/2013 06:02 AM, zarnovi...@gmail.com wrote:
>
> > I assume that I'm not hitting any of those timeouts, otherwise I should
> > see something in debug output. Do any of those values affect the kernel
> > part of iSCSI client ?
>
> If you were hitting a iscsi issue then you should see something in
> /var/log/messages from the sdX device and/or the iscsi layer before the
> FS error messages. You would see something about a
> DID_TRANSPORT_DISRUPTED or DID_ERROR or DID_TRANSPORT_FAILFAST. You
> might also see something about the iscsi replacement/recovery timeout
> expiring. The iscsiadm session state would also show not logged in.

Mike, thanks for your answer.

I figured it out.. The problem was that the device was deleted while
VM was
still using it. In Openstack/Netapp, different volumes may get the
same LUN
id as they are connected and disconnected. In some cases Openstack
would
try to disconnect a volume which was already disconnected, supplying a
LUN id
that the volume *had* before, but used by someone else now..

That device delete was partly my own fault, I admit.

Every time it happened, the device was re-added soon after by --
rescan.
That's why there were no errors, no iSCSI session state changes.

Regards,

Brano Zarnovican
Reply all
Reply to author
Forward
0 new messages