SCSI GENERIC command queueing for block storage is unstable.

Mike Hayward

unread,

Mar 15, 2010, 2:39:06 PM3/15/10

to linux-...@vger.kernel.org, mvd...@gmail.com

After discovering that O_NONBLOCK reads and writes were actually
blocking calls, I attempted to use the SCSI generic driver for
nonblocking io. The good news is that it is nonblocking; the bad news
is that it is not dependable in any of the systems I have tested with.

Does anyone know if these defects have been fixed in later kernels?

1. When queueing, write can occassionally return errno 12 (ENOMEM, Cannot
allocate memory). This is documented in the SCSI GENERIC HOWTO,
however only for indirect io and it says extremely rare. I can cause
it easily within a few hours and it can return even for direct io when
no io's are queued and 80% of the ram is free or in buffer cache. The
fd polls as available for writing, but retrying never clears the error
and the fd is no longer usable. This is a complete show stopper.

Linux 2.6.22.1-32.fc6 #1 SMP Wed Aug 1 14:30:16 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux

There were no errors logged by the driver. Here is the offending
sg_io_hdr, note all values are in hex:

interface_id S
dxfer_direction fffffffd (SG_DXFER_FROM_DEV)
cmd_len a (it's a READ 10)
mx_sb_len fc
iovec_count 0
dxfer_len 200000
dxferp 1c1f400
cmdp 896518
sbp 896528
timeout 20000
flags 1 (SG_FLAG_DIRECT_IO)
pack_id 0
usr_ptr 8964e0

2. Issuing a lot of concurrent io (READ_10 in this case) to SATA
drives causes the the commands to be aborted for no good reason.
I can reproducibly cause the problem within a few seconds on
known good hardware.

I issue 16 concurrent 64k reads to each of eight sata drives, but
it would probably blow with fewer. Perhaps this is a clue to what
is going on: even if just using one drive, driver_duration shows
the reads take far longer than normal (greater than 10ms) when
first starting to queue io, then the performance behaves more like
one would expect from a sata disk drive.

This problem is reproducible on multiple x86_64 kernels and logs
errors as the drive/bus appears to be reset. Older kernels are
less verbose about it and report with Fixed Sense instead of
descriptors, but the same ATA event is occuring.

Linux opt1 2.6.25-14.fc9.x86_64 #1 SMP Thu May 1 06:06:21 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux

interface_id S dxfer_direction -3 cmd_len 10 mx_sb_len 252 iovec_count 0 dxfer_len 10000 dxferp 0x2161200
cmdp 22a9b58 sbp 22a9b68 timeout 20000 flags 1 pack_id 0 usr_ptr 0x22a9b20
CDB 28 00 00 04 c9 03 00 00 80 00 02 02 00 00 00 00
CDB READ_10 RELADR 0 FUA 0 DPO 0 LBA 0004c903 TRANSFERLEN 0080
CDB Control Vendor 0 NACA 0 LINK 0
status 2 masked_status 1 host_status 0 driver_status 8 resid 0 duration 243 info 1
SENSE FIXED RESPONSE CODE 72 VALID 0 SENSEKEY NO_SENSE
SENSE FIXED ILI 0 EOM 0 FILEMARK 0 INFO 00000000 ASENSELEN 0c
SENSE FIXED COMMANDINFO 000a8000 ADDSENSE 0000
SENSE FIXED FRUCODE 00
SKSData SKSV 0

ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2
ata9.00: port_status 0x20080000
ata9.00: cmd c8/00:80:03:c9:04/00:00:00:00:00/e0 tag 0 dma 65536 in
res 50/00:00:82:c9:04/00:00:00:00:00/e0 Emask 0x2 (HSM violation)
ata9.00: status: { DRDY }
ata9: soft resetting link
ata9: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata9.00: configured for UDMA/133
ata9: EH complete
sd 8:0:0:0: [sdd] 625142448 512-byte hardware sectors (320073 MB)
sd 8:0:0:0: [sdd] Write Protect is off
sd 8:0:0:0: [sdd] Mode Sense: 00 3a 00 00
sd 8:0:0:0: [sdd] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

Linux opt2.loup.net 2.6.18-1.2798.fc6 #1 SMP Mon Oct 16 14:39:22 EDT 2006 x86_64 x86_64 x86_64 GNU/Linux

CDB READ_10 RELADR 0 FUA 0 DPO 0 LBA 000fb903 TRANSFERLEN 0080
CDB Control Vendor 0 NACA 0 LINK 0
status 2 masked_status 1 host_status 0 driver_status 8 resid 10000 duration 4 info 1
SENSE FIXED RESPONSE CODE 70 VALID 0 SENSEKEY ABORTED_COMMAND
SENSE FIXED ILI 0 EOM 0 FILEMARK 0 INFO 00000000 ASENSELEN 0a
SENSE FIXED COMMANDINFO 00000000 ADDSENSE 0000
SENSE FIXED FRUCODE 00
SKSData SKSV 0

ata5: no sense translation for status: 0x50
ata5: translated ATA stat/err 0x50/00 to SCSI SK/ASC/ASCQ 0xb/00/00
ata5: status=0x50 { DriveReady SeekComplete }

Linux opt3.loup.net 2.6.22.1-32.fc6 #1 SMP Wed Aug 1 14:30:16 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux

CDB READ_10 RELADR 0 FUA 0 DPO 0 LBA 00031803 TRANSFERLEN 0080
CDB Control Vendor 0 NACA 0 LINK 0
status 2 masked_status 1 host_status 0 driver_status 8 resid 0 duration 243 info 1
SENSE FIXED RESPONSE CODE 72 VALID 0 SENSEKEY NO_SENSE
SENSE FIXED ILI 0 EOM 0 FILEMARK 0 INFO 00000000 ASENSELEN 0c
SENSE FIXED COMMANDINFO 000a8000 ADDSENSE 0000
SENSE FIXED FRUCODE 00
SKSData SKSV 0

ata9.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2
ata9.00: (port_status 0x20080000)
ata9.00: cmd c8/00:80:03:1d:1b/00:00:00:00:00/e0 tag 0 cdb 0x0 data 65536 in
res 50/00:00:82:1d:1b/00:00:00:00:00/e0 Emask 0x2 (HSM violation)
ata9: soft resetting port
ata9: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata9.00: configured for UDMA/133
ata9: EH complete
sd 8:0:0:0: [sde] 625142448 512-byte hardware sectors (320073 MB)
sd 8:0:0:0: [sde] Write Protect is off
sd 8:0:0:0: [sde] Mode Sense: 00 3a 00 00
sd 8:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

The following aren't so much show stoppers, but they are defects
and/or can cause serious performance problems.

3. hdr.timeout is NOOP! Despite what HOWTO says, READ_10, WRITE_10
succeed even if hdr.driver_duration exceeds timeout.

4. If dxferlen is less than the READ_10 data transferlen, the SCSI
command sometimes reports success when it shouldn't succeed. This
is data corruption and makes debugging very difficult; don't ask me
how I know :-)

5. I've noticed after queueing the first command, subsequent commands
tend to complete in reverse order even though HOWTO indicates oldest
will be returned first. driver_duration seems to confirm reverse
order execution, particularly noticable when ios seem to take a long
time. This is a performance defect if it's what's really happening
since it isn't desirable behavior if, for example, ios are generally
ordered by lba and issued one at a time by kernel to a non queueing
block device.

Completion Command # hdr.driver_duration (us)
1 22 14979
2 20 14981
3 19 14982
4 17 14984
5 16 14985
6 15 14986
7 14 14988
8 12 14990
9 11 14991
10 10 14992

This reverse order behavior is trivial to reproduce even with 16
concurrent INQUIRY commands.

6. There appears to be no fair queueing mechanism for scsi commands
issued from multiple processes to separate devices over a shared
bus. For example, an io can easily be starved in queue for more
than half a minute with just four processes issuing 16 slow ios
at a time. Reducing number of ios at a time to 2 per process
decreases average and maximum latency to tolerable levels.

- Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Robert Hancock

unread,

Mar 15, 2010, 7:32:14 PM3/15/10

to Mike Hayward, linux-...@vger.kernel.org, mvd...@gmail.com

On 03/15/2010 12:25 PM, Mike Hayward wrote:
> After discovering that O_NONBLOCK reads and writes were actually
> blocking calls, I attempted to use the SCSI generic driver for
> nonblocking io. The good news is that it is nonblocking; the bad news
> is that it is not dependable in any of the systems I have tested with.
>
> Does anyone know if these defects have been fixed in later kernels?
>
> 1. When queueing, write can occassionally return errno 12 (ENOMEM, Cannot
> allocate memory). This is documented in the SCSI GENERIC HOWTO,
> however only for indirect io and it says extremely rare. I can cause
> it easily within a few hours and it can return even for direct io when
> no io's are queued and 80% of the ram is free or in buffer cache. The
> fd polls as available for writing, but retrying never clears the error
> and the fd is no longer usable. This is a complete show stopper.
>
> Linux 2.6.22.1-32.fc6 #1 SMP Wed Aug 1 14:30:16 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux

First off, have you tested any of these problems against a newer kernel?

Mike Hayward

unread,

Mar 16, 2010, 11:23:33 PM3/16/10

to hanco...@gmail.com, linux-...@vger.kernel.org, mvd...@gmail.com

Hi Robert,

> > After discovering that O_NONBLOCK reads and writes were actually
> > blocking calls, I attempted to use the SCSI generic driver for
> > nonblocking io. The good news is that it is nonblocking; the bad news
> > is that it is not dependable in any of the systems I have tested with.
> >
> > Does anyone know if these defects have been fixed in later kernels?
> >
> > 1. When queueing, write can occassionally return errno 12 (ENOMEM, Cannot
> > allocate memory). This is documented in the SCSI GENERIC HOWTO,
> > however only for indirect io and it says extremely rare. I can cause
> > it easily within a few hours and it can return even for direct io when
> > no io's are queued and 80% of the ram is free or in buffer cache. The
> > fd polls as available for writing, but retrying never clears the error
> > and the fd is no longer usable. This is a complete show stopper.
> >
> > Linux 2.6.22.1-32.fc6 #1 SMP Wed Aug 1 14:30:16 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
>
> First off, have you tested any of these problems against a newer kernel?

It appears all of the problems have existed at least from 2.6.18
through 2.6.32 since I'm seeing essentially the same behavior on those
plus the two intermediate kernel versions in the last mail, same
hardware. We also see the relevant ones on a recent ubuntu arm device:

Linux armage 2.6.22.18 #1 Thu Mar 19 14:46:22 IST 2009 armv5tejl GNU/Linux

I'm wondering if anyone has recently worked on anything that might be
related. The kernel is issuing error messages independent of a
specific system or kernel and it is very repeatable. The ENOMEM
problem (1) above is much less likely to occur than the other issues
and I have yet to cause it on 2.6.32; doesn't mean it isn't still
there if someone hasn't worked on it recently...

----------------------------------------------------------------------
Linux opt2.loup.net 2.6.32.9-70.fc12.x86_64 #1 SMP Wed Mar 3 04:40:41 UTC 2010 x86_64 x86_64 x86_64 GNU/Linux

interface_id S dxfer_direction -3 cmd_len 10 mx_sb_len 252 iovec_count 0 dxfer_len 10000 dxferp 0x1263200 cmdp 1368cf8 sbp 1368d08 timeout 20000 flags 1 pack_id 0 usr_ptr 0x1368cc0
CDB 28 00 00 11 49 83 00 00 80 00 19 01 00 00 00 00
CDB OP 28 READ_10(10)
CDB READ_10 RELADR 0 FUA 0 DPO 0 LBA 00114983 TRANSFERLEN 0080

CDB Control Vendor 0 NACA 0 LINK 0

status 2 masked_status 1 host_status 0 driver_status 8 resid 0 duration 277 info 1

SENSE FIXED RESPONSE CODE 72 VALID 0 SENSEKEY NO_SENSE
SENSE FIXED ILI 0 EOM 0 FILEMARK 0 INFO 00000000 ASENSELEN 0c
SENSE FIXED COMMANDINFO 000a8000 ADDSENSE 0000
SENSE FIXED FRUCODE 00
SKSData SKSV 0

ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
ata4.00: port_status 0x20080000
ata4.00: failed command: READ DMA
ata4.00: cmd c8/00:80:83:49:11/00:00:00:00:00/e0 tag 0 dma 65536 in
res 50/00:00:02:4a:11/00:00:00:00:00/e0 Emask 0x2 (HSM violation)
ata4.00: status: { DRDY }
ata4: hard resetting link
ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata4.00: configured for UDMA/133
ata4: EH complete
----------------------------------------------------------------------

I'd like to mention another issue I forgot in the last email:

7. The sg info result is supposed to return a nonzero result if it
transfers data DIRECT according to HOWTO but despite issuing aligned
dxferp (as you can see in the one that failed) and setting flags to
DIRECT, it never indicates a direct transfer occurred.

Am I like the only one really using sg driver queueing? From the
number of issues I've found right off the bat, it makes me think it
must be primarily used in blocking ioctl mode.

Are there formal regression tests somewhere that are normally run
against the kernel? I've never seen any in the kernel downloads
themselves. Do such regressions cover the scsi generic driver?

- Mike

Mike Hayward

unread,

Mar 16, 2010, 11:43:49 PM3/16/10

to hanco...@gmail.com, linux-...@vger.kernel.org, mvd...@gmail.com

Hi Robert,

> > After discovering that O_NONBLOCK reads and writes were actually
> > blocking calls, I attempted to use the SCSI generic driver for
> > nonblocking io. The good news is that it is nonblocking; the bad news
> > is that it is not dependable in any of the systems I have tested with.
> >
> > Does anyone know if these defects have been fixed in later kernels?
> >
> > 1. When queueing, write can occassionally return errno 12 (ENOMEM, Cannot
> > allocate memory). This is documented in the SCSI GENERIC HOWTO,
> > however only for indirect io and it says extremely rare. I can cause
> > it easily within a few hours and it can return even for direct io when
> > no io's are queued and 80% of the ram is free or in buffer cache. The
> > fd polls as available for writing, but retrying never clears the error
> > and the fd is no longer usable. This is a complete show stopper.
> >
> > Linux 2.6.22.1-32.fc6 #1 SMP Wed Aug 1 14:30:16 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux
>
> First off, have you tested any of these problems against a newer kernel?

Ok, #1 is also happening on 2.6.32.9-70.fc12.x86_64... this time on
several processes nearly simultaneously, so it appears it isn't just
tied to one fd. This time instead of ENOMEM though, I got errno ==
Invalid argument(16) when writing the sg_io_hdr.

interface_id S dxfer_direction fffffffe cmd_len a mx_sb_len fc
iovec_count 0 dxfer_len 200000 dxferp 0x7f85e53dd200 cmdp 1f30a38
sbp 1f30a48 timeout 20000 flags 1 pack_id 0 usr_ptr 0x1f30a00

There appears nothing wrong with the sg_io_hdr. If I can get through
bug #2, my app will normally run constant io for hours before blowing
and it is unlikely three daemons running against separate sg devices
would simultaneously hit this error unless something were wrong in the
sg driver itself.

- Mike