Hi,
we are using KVMs with root backed by iSCSI LUNs mapped from Netapp. Occasionally, the device get write errors and it is remounted read-only
Mar 15 10:16:59 rb-vertica-hds2-devel dhclient[5053]: DHCPACK from 172.30.40.175 (xid=0x47a97e90)
Mar 15 10:17:00 rb-vertica-hds2-devel dhclient[5053]: bound to 172.30.40.92 -- renewal in 47 seconds.
Mar 15 10:17:03 rb-vertica-hds2-devel kernel: Buffer I/O error on device vda1, logical block 708624
Mar 15 10:17:03 rb-vertica-hds2-devel kernel: lost page write due to I/O error on vda1
..
Mar 15 10:17:32 rb-vertica-hds2-devel kernel: Buffer I/O error on device vda1, logical block 903881
Mar 15 10:17:32 rb-vertica-hds2-devel kernel: lost page write due to I/O error on vda1
Mar 15 10:17:32 rb-vertica-hds2-devel kernel: Buffer I/O error on device vda1, logical block 1705084
Mar 15 10:17:32 rb-vertica-hds2-devel kernel: lost page write due to I/O error on vda1
Mar 15 10:17:32 rb-vertica-hds2-devel kernel: JBD2: Detected IO errors while flushing file data on vda1-8
When the problem happens there are NO errors in logs on compute node. I'm running 'iscsiadm -m session -P3' every 5s.
It shows no change in session or LUN state. I did also 'iscsid' with -d8 which also shows nothing around the time.
How do I identify where are these write errors coming from ?
* problem on virtio-blk ?? Not likely.
* iscsi client problem connecting to target
* actual write problem on target
Example KVM device definition
<disk type='block' device='disk'>
<driver name='qemu' type='raw' cache='none'/>
<source dev='/dev/disk/by-path/ip-172.30.128.3:3260-iscsi-iqn.1992-08.com.netapp:node.netapp02-lun-17'/>
<target dev='vda' bus='virtio'/>
<alias name='virtio-disk0'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
</disk>
iSCSI session has default configuration
iscsiadm -m node -T iqn.1992-08.com.netapp:node.netapp02
...
node.session.cmds_max = 128
node.session.queue_depth = 32
node.session.timeo.replacement_timeout = 120
node.session.err_timeo.abort_timeout = 15
node.session.err_timeo.lu_reset_timeout = 30
node.session.err_timeo.tgt_reset_timeout = 30
node.session.err_timeo.host_reset_timeout = 60
node.conn[0].timeo.logout_timeout = 15
node.conn[0].timeo.login_timeout = 15
node.conn[0].timeo.auth_timeout = 45
node.conn[0].timeo.noop_out_interval = 5
node.conn[0].timeo.noop_out_timeout = 5
...
Recovery Timeout: 120
Target Reset Timeout: 30
LUN Reset Timeout: 30
Abort Timeout: 15
This is the device which had IO errors few hours ago..
grep . /sys/block/sdk/device/*
grep: /sys/block/sdk/device/delete: Permission denied
/sys/block/sdk/device/device_blocked:0
/sys/block/sdk/device/dh_state:detached
/sys/block/sdk/device/evt_media_change:0
/sys/block/sdk/device/iocounterbits:32
/sys/block/sdk/device/iodone_cnt:0x29a
/sys/block/sdk/device/ioerr_cnt:0x0 <-- error count ?
/sys/block/sdk/device/iorequest_cnt:0x29a
/sys/block/sdk/device/modalias:scsi:t-0x00
/sys/block/sdk/device/model:LUN
/sys/block/sdk/device/queue_depth:32
/sys/block/sdk/device/queue_ramp_up_period:120000
/sys/block/sdk/device/queue_type:none
grep: /sys/block/sdk/device/rescan: Permission denied
/sys/block/sdk/device/rev:7360
/sys/block/sdk/device/scsi_level:5
/sys/block/sdk/device/state:running
/sys/block/sdk/device/timeout:30
/sys/block/sdk/device/type:0
/sys/block/sdk/device/uevent:DEVTYPE=scsi_device
/sys/block/sdk/device/uevent:DRIVER=sd
/sys/block/sdk/device/uevent:MODALIAS=scsi:t-0x00
/sys/block/sdk/device/vendor:NETAPP