[SCSI PATCH] sd: max-retries becomes configurable

Jeff Garzik

unread,

Sep 24, 2012, 5:10:02 PM9/24/12

to

drivers/scsi/sd.c | 4 ++++
drivers/scsi/sd.h | 2 +-
2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 4df73e5..d15074b 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -92,6 +92,10 @@ MODULE_ALIAS_SCSI_DEVICE(TYPE_DISK);
MODULE_ALIAS_SCSI_DEVICE(TYPE_MOD);
MODULE_ALIAS_SCSI_DEVICE(TYPE_RBC);

+static int sd_max_retries = 5;
+module_param_named(max_retries, sd_max_retries, int, 0644);
+MODULE_PARM_DESC(max_retries, "Maximum number of retries, before failing command (default 5)");
+
#if !defined(CONFIG_DEBUG_BLOCK_EXT_DEVT)
#define SD_MINORS 16
#else
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index f703f48..f8488fa 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -18,7 +18,7 @@
/*
* Number of allowed retries
*/
-#define SD_MAX_RETRIES 5
+#define SD_MAX_RETRIES sd_max_retries
#define SD_PASSTHROUGH_RETRIES 1
#define SD_MAX_MEDIUM_TIMEOUTS 2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

James Bottomley

unread,

Sep 25, 2012, 12:10:02 AM9/25/12

to

On Mon, 2012-09-24 at 17:00 -0400, Jeff Garzik wrote:
>
> drivers/scsi/sd.c | 4 ++++
> drivers/scsi/sd.h | 2 +-
> 2 files changed, 5 insertions(+), 1 deletion(-)

I'm not opposed in principle to doing this (except that it should be a
sysfs parameter like all our other controls), but what's the reasoning
behind needing it changed?

James

Jeff Garzik

unread,

Sep 25, 2012, 1:30:01 AM9/25/12

to

On 09/25/2012 12:06 AM, James Bottomley wrote:
> On Mon, 2012-09-24 at 17:00 -0400, Jeff Garzik wrote:
>>
>> drivers/scsi/sd.c | 4 ++++
>> drivers/scsi/sd.h | 2 +-
>> 2 files changed, 5 insertions(+), 1 deletion(-)
>
> I'm not opposed in principle to doing this (except that it should be a
> sysfs parameter like all our other controls), but what's the reasoning
> behind needing it changed?

<vendor hat on>

Periodically turns up as a useful field sledgehammer for solving
problems, until the real problem is found and fixed. Got tired of a
very similar patch manually bouncing around the "hey, pssst, this worked
for me" backchannel IT network.

</red hat>

Can you be more specific about sysfs location? A runtime-writable (via
sysfs!) module parameter for a module-wide default seemed appropriate.

Jeff

James Bottomley

unread,

Sep 25, 2012, 6:40:02 AM9/25/12

to

On Tue, 2012-09-25 at 01:21 -0400, Jeff Garzik wrote:
> On 09/25/2012 12:06 AM, James Bottomley wrote:
> > On Mon, 2012-09-24 at 17:00 -0400, Jeff Garzik wrote:
> >>
> >> drivers/scsi/sd.c | 4 ++++
> >> drivers/scsi/sd.h | 2 +-
> >> 2 files changed, 5 insertions(+), 1 deletion(-)
> >
> > I'm not opposed in principle to doing this (except that it should be a
> > sysfs parameter like all our other controls), but what's the reasoning
> > behind needing it changed?
>
> <vendor hat on>
>
> Periodically turns up as a useful field sledgehammer for solving
> problems, until the real problem is found and fixed. Got tired of a
> very similar patch manually bouncing around the "hey, pssst, this worked
> for me" backchannel IT network.
>
> </red hat>

I'm asking because the general consensus from the device guys is that we
should never retry unless the device or the transport tells us to (and
then we shouldn't count the retries). A long time ago we used to get
spurious command failures from retry exhaustion on QUEUE_FULL or BUSY,
but since we switched those to being purely timeout based, I thought the
problem had gone away and I'm curious to know what guise it resurfaced
in.

> Can you be more specific about sysfs location? A runtime-writable (via
> sysfs!) module parameter for a module-wide default seemed appropriate.

Well, if it's really important, the same thing should happen with
retries as happened with timeout (it became a request_queue property),
but it could be hacked as a struct scsi_disk one with a corresponding
entry in sd_dis_attrs.

James

Martin K. Petersen

unread,

Sep 26, 2012, 10:30:03 PM9/26/12

to

>>>>> "James" == James Bottomley <James.B...@HansenPartnership.com> writes:

James> On Mon, 2012-09-24 at 17:00 -0400, Jeff Garzik wrote:
>>
>> drivers/scsi/sd.c | 4 ++++ drivers/scsi/sd.h | 2 +- 2 files changed,
>> 5 insertions(+), 1 deletion(-)

James> I'm not opposed in principle to doing this (except that it should
James> be a sysfs parameter like all our other controls),

Now that we're in that department. I never got any feedback on the
following patch.

Hannes told me in person that he felt the eh_timeout belonged in
scsi_device and not in the request queue. Whereas I favored making it a
block layer tunable despite currently only being used by SCSI. Any
opinions?

--
Martin K. Petersen Oracle Linux Engineering

block/scsi: Allow request and error handling timeouts to be specified

Export rq_timeout in sysfs for block devices since it is a request_queue
property. Until now it has only been possible to explicitly set this for
SCSI class devices.

Also introduce eh_timeout which can be used for error handling
purposes. This was previously hardcoded to 10 seconds in the SCSI error
handling code. However, for some fast-fail scenarios it is necessary to
be able to tune this as it can take several iterations (bus device,
target, bus, controller) before we give up.

Signed-off-by: Martin K. Petersen <martin....@oracle.com>

diff --git a/Documentation/ABI/testing/sysfs-block b/Documentation/ABI/testing/sysfs-block
index c1eb41c..0ad8442 100644
--- a/Documentation/ABI/testing/sysfs-block
+++ b/Documentation/ABI/testing/sysfs-block
@@ -129,6 +129,29 @@ Description:
throughput is desired. If no optimal I/O size is
reported this file contains 0.

+What: /sys/block/<disk>/queue/rq_timeout_secs
+Date: June 2012
+Contact: Martin K. Petersen <martin....@oracle.com>
+Description:
+ This is the timeout in seconds for regular filesystem
+ I/O requests. If the disk device has not completed the
+ request in this many seconds the kernel will fail the
+ I/O and initiate the subsystem-specific error handling
+ process.
+
+What: /sys/block/<disk>/queue/eh_timeout_secs
+Date: June 2012
+Contact: Martin K. Petersen <martin....@oracle.com>
+Description:
+ If a device has timed out while processing regular
+ filesystem I/O the kernel will attempt to bring the
+ device back online. This typically involves an
+ escalating set of approaches (device reset, bus reset,
+ controller reset). eh_timeout_secs describes how many
+ seconds should be spent waiting for response after each
+ recovery attempt before moving on to the next step in
+ the error handling process.
+
What: /sys/block/<disk>/queue/nomerges
Date: January 2010
Contact:
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 565a678..7cc6066 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -87,6 +87,12 @@ void blk_queue_rq_timeout(struct request_queue *q, unsigned int timeout)
}
EXPORT_SYMBOL_GPL(blk_queue_rq_timeout);

+void blk_queue_eh_timeout(struct request_queue *q, unsigned int timeout)
+{
+ q->eh_timeout = timeout;
+}
+EXPORT_SYMBOL_GPL(blk_queue_eh_timeout);
+
void blk_queue_rq_timed_out(struct request_queue *q, rq_timed_out_fn *fn)
{
q->rq_timed_out_fn = fn;
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 9628b29..7725c84 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -280,6 +280,42 @@ queue_rq_affinity_store(struct request_queue *q, const char *page, size_t count)
return ret;
}

+static ssize_t
+queue_rq_timeout_store(struct request_queue *q, const char *page, size_t count)
+{
+ unsigned long rq_timeout;
+ ssize_t ret = queue_var_store(&rq_timeout, page, count);
+
+ q->rq_timeout = rq_timeout * HZ;
+
+ return ret;
+}
+
+static ssize_t queue_rq_timeout_show(struct request_queue *q, char *page)
+{
+ int rq_timeout = q->rq_timeout / HZ;
+
+ return queue_var_show(rq_timeout, (page));
+}
+
+static ssize_t
+queue_eh_timeout_store(struct request_queue *q, const char *page, size_t count)
+{
+ unsigned long eh_timeout;
+ ssize_t ret = queue_var_store(&eh_timeout, page, count);
+
+ q->eh_timeout = eh_timeout * HZ;
+
+ return ret;
+}
+
+static ssize_t queue_eh_timeout_show(struct request_queue *q, char *page)
+{
+ int eh_timeout = q->eh_timeout / HZ;
+
+ return queue_var_show(eh_timeout, (page));
+}
+
static struct queue_sysfs_entry queue_requests_entry = {
.attr = {.name = "nr_requests", .mode = S_IRUGO | S_IWUSR },
.show = queue_requests_show,
@@ -394,6 +430,18 @@ static struct queue_sysfs_entry queue_random_entry = {
.store = queue_store_random,
};

+static struct queue_sysfs_entry queue_rq_timeout_entry = {
+ .attr = {.name = "rq_timeout_secs", .mode = S_IRUGO | S_IWUSR },
+ .show = queue_rq_timeout_show,
+ .store = queue_rq_timeout_store,
+};
+
+static struct queue_sysfs_entry queue_eh_timeout_entry = {
+ .attr = {.name = "eh_timeout_secs", .mode = S_IRUGO | S_IWUSR },
+ .show = queue_eh_timeout_show,
+ .store = queue_eh_timeout_store,
+};
+
static struct attribute *default_attrs[] = {
&queue_requests_entry.attr,
&queue_ra_entry.attr,
@@ -416,6 +464,8 @@ static struct attribute *default_attrs[] = {
&queue_rq_affinity_entry.attr,
&queue_iostats_entry.attr,
&queue_random_entry.attr,
+ &queue_rq_timeout_entry.attr,
+ &queue_eh_timeout_entry.attr,
NULL,
};

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 4a6381c..22e5480 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -42,8 +42,6 @@

#include <trace/events/scsi.h>

-#define SENSE_TIMEOUT (10*HZ)
-
/*
* These should *probably* be handled by the host itself.
* Since it is allowed to sleep, it probably should.
@@ -852,7 +850,8 @@ static int scsi_send_eh_cmnd(struct scsi_cmnd *scmd, unsigned char *cmnd,
*/
static int scsi_request_sense(struct scsi_cmnd *scmd)
{
- return scsi_send_eh_cmnd(scmd, NULL, 0, SENSE_TIMEOUT, ~0);
+ return scsi_send_eh_cmnd(scmd, NULL, 0,
+ scmd->device->request_queue->eh_timeout, ~0);
}

/**
@@ -953,7 +952,8 @@ static int scsi_eh_tur(struct scsi_cmnd *scmd)
int retry_cnt = 1, rtn;

retry_tur:
- rtn = scsi_send_eh_cmnd(scmd, tur_command, 6, SENSE_TIMEOUT, 0);
+ rtn = scsi_send_eh_cmnd(scmd, tur_command, 6,
+ scmd->device->request_queue->eh_timeout, 0);

SCSI_LOG_ERROR_RECOVERY(3, printk("%s: scmd %p rtn %x\n",
__func__, scmd, rtn));
@@ -1065,12 +1065,13 @@ static int scsi_eh_abort_cmds(struct list_head *work_q,
static int scsi_eh_try_stu(struct scsi_cmnd *scmd)
{
static unsigned char stu_command[6] = {START_STOP, 0, 0, 0, 1, 0};
+ unsigned int tmo = scmd->device->request_queue->rq_timeout;

if (scmd->device->allow_restart) {
int i, rtn = NEEDS_RETRY;

for (i = 0; rtn == NEEDS_RETRY && i < 2; i++)
- rtn = scsi_send_eh_cmnd(scmd, stu_command, 6, scmd->device->request_queue->rq_timeout, 0);
+ rtn = scsi_send_eh_cmnd(scmd, stu_command, 6, tmo, 0);

if (rtn == SUCCESS)
return 0;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 4ba3719..3b89071 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1696,6 +1696,9 @@ struct request_queue *__scsi_alloc_queue(struct Scsi_Host *shost,
*/
blk_queue_dma_alignment(q, 0x03);

+ blk_queue_rq_timeout(q, SCSI_DEFAULT_RQ_TIMEOUT);
+ blk_queue_eh_timeout(q, SCSI_DEFAULT_EH_TIMEOUT);
+
return q;
}
EXPORT_SYMBOL(__scsi_alloc_queue);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 4e72a9d..6a52b0d 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -379,6 +379,7 @@ struct request_queue {
unsigned int in_flight[2];

unsigned int rq_timeout;
+ unsigned int eh_timeout;
struct timer_list timeout;
struct list_head timeout_list;

@@ -889,6 +890,7 @@ extern void blk_queue_update_dma_alignment(struct request_queue *, int);
extern void blk_queue_softirq_done(struct request_queue *, softirq_done_fn *);
extern void blk_queue_rq_timed_out(struct request_queue *, rq_timed_out_fn *);
extern void blk_queue_rq_timeout(struct request_queue *, unsigned int);
+extern void blk_queue_eh_timeout(struct request_queue *, unsigned int);
extern void blk_queue_flush(struct request_queue *q, unsigned int flush);
extern void blk_queue_flush_queueable(struct request_queue *q, bool queueable);
extern struct backing_dev_info *blk_get_backing_dev_info(struct block_device *bdev);
diff --git a/include/scsi/scsi.h b/include/scsi/scsi.h
index 66216c1..3a01dd1 100644
--- a/include/scsi/scsi.h
+++ b/include/scsi/scsi.h
@@ -10,9 +10,15 @@

#include <linux/types.h>
#include <linux/scatterlist.h>
+#include <linux/kernel.h>

struct scsi_cmnd;

+enum scsi_timeouts {
+ SCSI_DEFAULT_RQ_TIMEOUT = 30 * HZ,
+ SCSI_DEFAULT_EH_TIMEOUT = 10 * HZ,
+};
+
/*
* The maximum number of SG segments that we will put inside a
* scatterlist (unless chaining is used). Should ideally fit inside a

James Bottomley

unread,

Sep 27, 2012, 12:50:02 AM9/27/12

to

On Wed, 2012-09-26 at 22:20 -0400, Martin K. Petersen wrote:
> >>>>> "James" == James Bottomley <James.B...@HansenPartnership.com> writes:
>
> James> On Mon, 2012-09-24 at 17:00 -0400, Jeff Garzik wrote:
> >>
> >> drivers/scsi/sd.c | 4 ++++ drivers/scsi/sd.h | 2 +- 2 files changed,
> >> 5 insertions(+), 1 deletion(-)
>
> James> I'm not opposed in principle to doing this (except that it should
> James> be a sysfs parameter like all our other controls),
>
> Now that we're in that department. I never got any feedback on the
> following patch.
>
> Hannes told me in person that he felt the eh_timeout belonged in
> scsi_device and not in the request queue. Whereas I favored making it a
> block layer tunable despite currently only being used by SCSI. Any
> opinions?

request_queue makes more sense to me because there was once a plan to
move all our timeout processing to block. I think it got stalled
somewhere, but this would act as a reminder.

James

Jeff Garzik

unread,

Sep 27, 2012, 1:10:03 AM9/27/12

to

On 09/25/2012 06:38 AM, James Bottomley wrote:
> On Tue, 2012-09-25 at 01:21 -0400, Jeff Garzik wrote:
>> Can you be more specific about sysfs location? A runtime-writable (via
>> sysfs!) module parameter for a module-wide default seemed appropriate.
>
> Well, if it's really important, the same thing should happen with
> retries as happened with timeout (it became a request_queue property),
> but it could be hacked as a struct scsi_disk one with a corresponding
> entry in sd_dis_attrs.

Well, it is already a request property... but assigned at
initialization from sd-specific code. sd also passes this through
scmd->allowed to rq->retries.

It could become a request_queue property, but that seems like a hack as
it is simply passed right back into SCSI EH, for SCSI-specific disposition.

Jeff

Dan Williams

unread,

Sep 28, 2012, 2:40:01 PM9/28/12

to

On Wed, Sep 26, 2012 at 7:20 PM, Martin K. Petersen
<martin....@oracle.com> wrote:
>>>>>> "James" == James Bottomley <James.B...@HansenPartnership.com> writes:
>
> James> On Mon, 2012-09-24 at 17:00 -0400, Jeff Garzik wrote:
>>>
>>> drivers/scsi/sd.c | 4 ++++ drivers/scsi/sd.h | 2 +- 2 files changed,
>>> 5 insertions(+), 1 deletion(-)
>
> James> I'm not opposed in principle to doing this (except that it should
> James> be a sysfs parameter like all our other controls),
>
> Now that we're in that department. I never got any feedback on the
> following patch.
>
> Hannes told me in person that he felt the eh_timeout belonged in
> scsi_device and not in the request queue. Whereas I favored making it a
> block layer tunable despite currently only being used by SCSI. Any
> opinions?
>

I think request_queue since tuning these parameters would be useful
for libata/libsas usage and libata is still on track to move out from
under scsi ;-).

Hm, how to extend this for the ata eh case? I had a global hack to
make libata give up quicker [1], but I it would be better to have it
be per queue. What about another tunable to limit how deep into the
recovery escalation to go? For pure scsi it short circuits escalation
and for ata it limits tries ata_eh_reset(), and then the value of
eh_timeout_secs in the non-default case overrides those in
ata_eh_reset_timeouts..

--
Dan

[1]: "libata: reset once" http://marc.info/?l=linux-ide&m=134189240724489&w=2
...Dave found this does not work as advertised for libsas since we
don't end up applying force parameters to sas-ata links. My test case
apparently fell out of eh early for another reason.

Ric Wheeler

unread,

Oct 1, 2012, 3:50:02 AM10/1/12

to

On 09/25/2012 04:08 PM, James Bottomley wrote:
> On Tue, 2012-09-25 at 01:21 -0400, Jeff Garzik wrote:
>> On 09/25/2012 12:06 AM, James Bottomley wrote:
>>> On Mon, 2012-09-24 at 17:00 -0400, Jeff Garzik wrote:
>>>> drivers/scsi/sd.c | 4 ++++
>>>> drivers/scsi/sd.h | 2 +-
>>>> 2 files changed, 5 insertions(+), 1 deletion(-)
>>> I'm not opposed in principle to doing this (except that it should be a
>>> sysfs parameter like all our other controls), but what's the reasoning
>>> behind needing it changed?
>> <vendor hat on>
>>
>> Periodically turns up as a useful field sledgehammer for solving
>> problems, until the real problem is found and fixed. Got tired of a
>> very similar patch manually bouncing around the "hey, pssst, this worked
>> for me" backchannel IT network.
>>
>> </red hat>
> I'm asking because the general consensus from the device guys is that we
> should never retry unless the device or the transport tells us to (and
> then we shouldn't count the retries). A long time ago we used to get
> spurious command failures from retry exhaustion on QUEUE_FULL or BUSY,
> but since we switched those to being purely timeout based, I thought the
> problem had gone away and I'm curious to know what guise it resurfaced
> in.

I think that is still very much a true statement. By the time normal disks
return an error, they have retried *many* times in firmware. There are some
exceptions of course - vibrations and so on might make this useful.

Back when my day job often involved recovering data from dead drives, we
actually normally wanted to cut retries down to zero since various part of the
stack retried for us so much that each bad sector had to be timed out multiple
times!

I don't object to making this a tunable, but we should default to not retrying.

Also would be very interesting to seeing if this actually is useful in the real
world, not just "word on the street" world :)

Ric