[PATCH] cfq-iosched: non-rot devices do not need queue merging

Corrado Zoccolo

unread,

Dec 30, 2009, 7:20:01 AM12/30/09

to

Non rotational devices' performances are not affected by
distance of requests, so there is no point in having overhead
to merge queues of nearby requests.

Signed-off-by: Corrado Zoccolo <czoc...@gmail.com>
---
block/cfq-iosched.c | 5 +++--
1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 918c7fd..01bb0f3 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1268,7 +1268,8 @@ static void cfq_prio_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq)
return;
if (!cfqq->next_rq)
return;
-
+ if (blk_queue_nonrot(cfqd->queue))
+ return;
cfqq->p_root = &cfqd->prio_trees[cfqq->org_ioprio];
__cfqq = cfq_prio_tree_lookup(cfqd, cfqq->p_root,
blk_rq_pos(cfqq->next_rq), &parent, &p);
@@ -1689,7 +1690,7 @@ static struct cfq_queue *cfqq_close(struct cfq_data *cfqd,
struct cfq_queue *__cfqq;
sector_t sector = cfqd->last_position;

- if (RB_EMPTY_ROOT(root))
+ if (RB_EMPTY_ROOT(root) || blk_queue_nonrot(cfqd->queue))
return NULL;

/*
--
1.6.4.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Jens Axboe

unread,

Dec 30, 2009, 1:50:01 PM12/30/09

to

On Wed, Dec 30 2009, Corrado Zoccolo wrote:
> Non rotational devices' performances are not affected by
> distance of requests, so there is no point in having overhead
> to merge queues of nearby requests.

If the distance is zero, it may still make a big difference (at least
for writes). This check would be better as "ncq and doesn't suck", ala

blk_queue_nonrot(q) && tagged

like we do elsewhere.

--
Jens Axboe

Corrado Zoccolo

unread,

Dec 30, 2009, 3:40:02 PM12/30/09

to

On Wed, Dec 30, 2009 at 7:45 PM, Jens Axboe <jens....@oracle.com> wrote:
> On Wed, Dec 30 2009, Corrado Zoccolo wrote:
>> Non rotational devices' performances are not affected by
>> distance of requests, so there is no point in having overhead
>> to merge queues of nearby requests.
>
> If the distance is zero, it may still make a big difference (at least
> for writes). This check would be better as "ncq and doesn't suck", ala
>
> blk_queue_nonrot(q) && tagged
>
> like we do elsewhere.

For reads, though, even flash cards and netbook ssds are completely
unaffected. I have done few experiments on my available disks:
* http://dl.dropbox.com/u/3525644/service_time.png (I used the
program: http://dl.dropbox.com/u/3525644/stride.c to get the graphs).

For distance 0, I think request merging will be more effective than
queue merging, moreover I think the multi-thread trick to have large
I/O depth is used for reads, not writes (where simply issuing buffered
writes already achieves a similar effect), so I think it is safe to
disable it for all non-rotational devices.

Thanks,
Corrado

Jens Axboe

unread,

Dec 30, 2009, 4:20:03 PM12/30/09

to

On Wed, Dec 30 2009, Corrado Zoccolo wrote:
> On Wed, Dec 30, 2009 at 7:45 PM, Jens Axboe <jens....@oracle.com> wrote:
> > On Wed, Dec 30 2009, Corrado Zoccolo wrote:
> >> Non rotational devices' performances are not affected by
> >> distance of requests, so there is no point in having overhead
> >> to merge queues of nearby requests.
> >
> > If the distance is zero, it may still make a big difference (at least
> > for writes). This check would be better as "ncq and doesn't suck", ala
> >
> > � � � �blk_queue_nonrot(q) && tagged
> >
> > like we do elsewhere.
>
> For reads, though, even flash cards and netbook ssds are completely
> unaffected. I have done few experiments on my available disks:
> * http://dl.dropbox.com/u/3525644/service_time.png (I used the
> program: http://dl.dropbox.com/u/3525644/stride.c to get the graphs).

Completely agree, it's writes that matter (as mentioned).

> For distance 0, I think request merging will be more effective than
> queue merging, moreover I think the multi-thread trick to have large

Definitely true, but we don't allow cross cfqq merges to begin with.

> I/O depth is used for reads, not writes (where simply issuing buffered
> writes already achieves a similar effect), so I think it is safe to
> disable it for all non-rotational devices.

That still leaves direct writes. Granted it's a problem with a huge
scope, but still.

Corrado Zoccolo

unread,

Dec 30, 2009, 4:30:02 PM12/30/09

to

On Wed, Dec 30, 2009 at 10:11 PM, Jens Axboe <jens....@oracle.com> wrote:
> On Wed, Dec 30 2009, Corrado Zoccolo wrote:
>> On Wed, Dec 30, 2009 at 7:45 PM, Jens Axboe <jens....@oracle.com> wrote:
>> > On Wed, Dec 30 2009, Corrado Zoccolo wrote:
>> >> Non rotational devices' performances are not affected by
>> >> distance of requests, so there is no point in having overhead
>> >> to merge queues of nearby requests.
>> >
>> > If the distance is zero, it may still make a big difference (at least
>> > for writes). This check would be better as "ncq and doesn't suck", ala
>> >
>> > blk_queue_nonrot(q) && tagged
>> >
>> > like we do elsewhere.
>>
>> For reads, though, even flash cards and netbook ssds are completely
>> unaffected. I have done few experiments on my available disks:
>> * http://dl.dropbox.com/u/3525644/service_time.png (I used the
>> program: http://dl.dropbox.com/u/3525644/stride.c to get the graphs).
>
> Completely agree, it's writes that matter (as mentioned).
>
>> For distance 0, I think request merging will be more effective than
>> queue merging, moreover I think the multi-thread trick to have large
>
> Definitely true, but we don't allow cross cfqq merges to begin with.
>
>> I/O depth is used for reads, not writes (where simply issuing buffered
>> writes already achieves a similar effect), so I think it is safe to
>> disable it for all non-rotational devices.
>
> That still leaves direct writes. Granted it's a problem with a huge
> scope, but still.

Maybe I can mark sync queues that have write requests, and only add
those ones to the prio tree.
For writes, merging queues (and therefore requests) can probably help
even the smart ssds.

Thanks,
Corrado
>
> --
> Jens Axboe
>
>

--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:czoc...@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
Tales of Power - C. Castaneda

Jens Axboe

unread,

Dec 30, 2009, 4:40:01 PM12/30/09

to

That sounds like a solution and avoids the merge/breakup pain for (by
far) most use cases.

> For writes, merging queues (and therefore requests) can probably help
> even the smart ssds.

Yes, but we are getting to the point of having to be more careful about
CPU cycles on SSDs. But lets do it, I'll be spending a good chunk of
time on that very shortly anyway.

--
Jens Axboe

Corrado Zoccolo

unread,

Dec 30, 2009, 5:30:02 PM12/30/09

to

Non rotational devices' performances are not affected by

distance of read requests, so there is no point in having
overhead to merge such queues.
This doesn't apply to writes, so this patch changes the
queued[] field, to be indexed by READ/WRITE instead of
SYNC/ASYNC, and only compute proximity for queues with
WRITE requests.

Signed-off-by: Corrado Zoccolo <czoc...@gmail.com>
---

block/cfq-iosched.c | 20 +++++++++++---------
1 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 918c7fd..7da9391 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -108,9 +108,9 @@ struct cfq_queue {
struct rb_root sort_list;
/* if fifo isn't expired, next request to serve */
struct request *next_rq;
- /* requests queued in sort_list */
+ /* requests queued in sort_list, indexed by READ/WRITE */
int queued[2];
- /* currently allocated requests */
+ /* currently allocated requests, indexed by READ/WRITE */
int allocated[2];
/* fifo list of requests in sort_list */
struct list_head fifo;

@@ -1268,7 +1268,8 @@ static void cfq_prio_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq)
return;
if (!cfqq->next_rq)
return;
-

+ if (blk_queue_nonrot(cfqd->queue) && !cfqq->queued[WRITE])

+ return;
cfqq->p_root = &cfqd->prio_trees[cfqq->org_ioprio];
__cfqq = cfq_prio_tree_lookup(cfqd, cfqq->p_root,
blk_rq_pos(cfqq->next_rq), &parent, &p);

@@ -1337,10 +1338,10 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
static void cfq_del_rq_rb(struct request *rq)
{
struct cfq_queue *cfqq = RQ_CFQQ(rq);
- const int sync = rq_is_sync(rq);
+ const int rw = rq_data_dir(rq);

- BUG_ON(!cfqq->queued[sync]);
- cfqq->queued[sync]--;
+ BUG_ON(!cfqq->queued[rw]);
+ cfqq->queued[rw]--;

elv_rb_del(&cfqq->sort_list, rq);

@@ -1363,7 +1364,7 @@ static void cfq_add_rq_rb(struct request *rq)
struct cfq_data *cfqd = cfqq->cfqd;
struct request *__alias, *prev;

- cfqq->queued[rq_is_sync(rq)]++;
+ cfqq->queued[rq_data_dir(rq)]++;

/*
* looks a little odd, but the first insert might return an alias.
@@ -1393,7 +1394,7 @@ static void cfq_add_rq_rb(struct request *rq)
static void cfq_reposition_rq_rb(struct cfq_queue *cfqq, struct request *rq)
{
elv_rb_del(&cfqq->sort_list, rq);
- cfqq->queued[rq_is_sync(rq)]--;
+ cfqq->queued[rq_data_dir(rq)]--;
cfq_add_rq_rb(rq);
}

@@ -1689,7 +1690,8 @@ static struct cfq_queue *cfqq_close(struct cfq_data *cfqd,

struct cfq_queue *__cfqq;
sector_t sector = cfqd->last_position;

- if (RB_EMPTY_ROOT(root))
+ if (RB_EMPTY_ROOT(root) ||

+ (blk_queue_nonrot(cfqd->queue) && !cur_cfqq->queued[WRITE]))

return NULL;

/*
--
1.6.4.4

--

Vivek Goyal

unread,

Jan 4, 2010, 9:50:02 AM1/4/10

to

On Wed, Dec 30, 2009 at 11:22:47PM +0100, Corrado Zoccolo wrote:
> Non rotational devices' performances are not affected by
> distance of read requests, so there is no point in having
> overhead to merge such queues.
> This doesn't apply to writes, so this patch changes the
> queued[] field, to be indexed by READ/WRITE instead of
> SYNC/ASYNC, and only compute proximity for queues with
> WRITE requests.
>

Hi Corrado,

What's the reason that reads don't benefit from merging queues and hence
merging requests and only writes do on SSD?

> Signed-off-by: Corrado Zoccolo <czoc...@gmail.com>
> ---
> block/cfq-iosched.c | 20 +++++++++++---------
> 1 files changed, 11 insertions(+), 9 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 918c7fd..7da9391 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -108,9 +108,9 @@ struct cfq_queue {
> struct rb_root sort_list;
> /* if fifo isn't expired, next request to serve */
> struct request *next_rq;
> - /* requests queued in sort_list */
> + /* requests queued in sort_list, indexed by READ/WRITE */
> int queued[2];
> - /* currently allocated requests */
> + /* currently allocated requests, indexed by READ/WRITE */
> int allocated[2];

Sometime back Jens had changed all READ/WRITE indexing to SYNC/ASYNC
indexing throughout IO schedulers and block layer. Personally I would
prefer to keep it that way and not have a mix of SYNC/ASYNC and READ/WRITE
indexing in code.

What are we gaining by this patch? Save some cpu cycles by not merging
and splitting the read cfqq on ssd? Do you have any numbers how much is
the saving. My knee jerk reaction is that if gains are not significant,
lets not do this optimization and let the code be simple.

> /* fifo list of requests in sort_list */
> struct list_head fifo;
> @@ -1268,7 +1268,8 @@ static void cfq_prio_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> return;
> if (!cfqq->next_rq)
> return;
> -
> + if (blk_queue_nonrot(cfqd->queue) && !cfqq->queued[WRITE])
> + return;

A 1-2 line comment here will help about why writes still benefit and not
reads.

Corrado Zoccolo

unread,

Jan 4, 2010, 11:40:02 AM1/4/10

to

Hi Vivkek,

On Mon, Jan 4, 2010 at 3:47 PM, Vivek Goyal <vgo...@redhat.com> wrote:
> On Wed, Dec 30, 2009 at 11:22:47PM +0100, Corrado Zoccolo wrote:
>> Non rotational devices' performances are not affected by
>> distance of read requests, so there is no point in having
>> overhead to merge such queues.
>> This doesn't apply to writes, so this patch changes the
>> queued[] field, to be indexed by READ/WRITE instead of
>> SYNC/ASYNC, and only compute proximity for queues with
>> WRITE requests.
>>
>
> Hi Corrado,
>
> What's the reason that reads don't benefit from merging queues and hence
> merging requests and only writes do on SSD?

On SSDs, reads are just limited by the maximum transfer rate, and
larger (i.e. merged) reads will just take proportionally longer.

>> Signed-off-by: Corrado Zoccolo <czoc...@gmail.com>
>> ---
>> block/cfq-iosched.c | 20 +++++++++++---------
>> 1 files changed, 11 insertions(+), 9 deletions(-)
>>
>> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>> index 918c7fd..7da9391 100644
>> --- a/block/cfq-iosched.c
>> +++ b/block/cfq-iosched.c
>> @@ -108,9 +108,9 @@ struct cfq_queue {
>> struct rb_root sort_list;
>> /* if fifo isn't expired, next request to serve */
>> struct request *next_rq;
>> - /* requests queued in sort_list */
>> + /* requests queued in sort_list, indexed by READ/WRITE */
>> int queued[2];
>> - /* currently allocated requests */
>> + /* currently allocated requests, indexed by READ/WRITE */
>> int allocated[2];
>
> Sometime back Jens had changed all READ/WRITE indexing to SYNC/ASYNC
> indexing throughout IO schedulers and block layer.

Not completely. The allocated field (for which I fixed only the
comment) is still addressed as READ/WRITE.

> Personally I would
> prefer to keep it that way and not have a mix of SYNC/ASYNC and READ/WRITE
> indexing in code.

I think that, as long as it is documented, it should be fine.

> What are we gaining by this patch? Save some cpu cycles by not merging
> and splitting the read cfqq on ssd?

Yes. We should save a lot of cycles by saving the rb tree management
to achieve those operations.
Jens' position is that for fast SSDs, we need to save CPU cycles if we
want to perform well.

> Do you have any numbers how much is
> the saving. My knee jerk reaction is that if gains are not significant,
> lets not do this optimization and let the code be simple.

I think we are actually simplifying the code, removing an optimization
(queue merging) when it is not needed.
When you want to reason about how the code performs on SSD, removing
the unknown of queue merging renders the problem easier.

>
>
>> /* fifo list of requests in sort_list */
>> struct list_head fifo;
>> @@ -1268,7 +1268,8 @@ static void cfq_prio_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>> return;
>> if (!cfqq->next_rq)
>> return;
>> -
>> + if (blk_queue_nonrot(cfqd->queue) && !cfqq->queued[WRITE])
>> + return;
>
> A 1-2 line comment here will help about why writes still benefit and not
> reads.
>

It's because low-end SSDs are penalized by small writes. I don't have
an high end SSD to test with, but Jens is going to do more testing,
and eventually he can disable merging also for writes if he sees
improvement. Note that this is not the usual async write, but sync
write with aio, that I think is quite a niche.

Thanks
Corrado

Jeff Moyer

unread,

Jan 4, 2010, 12:00:02 PM1/4/10

to

Corrado Zoccolo <czoc...@gmail.com> writes:

> Hi Vivkek,
>
> On Mon, Jan 4, 2010 at 3:47 PM, Vivek Goyal <vgo...@redhat.com> wrote:
>> On Wed, Dec 30, 2009 at 11:22:47PM +0100, Corrado Zoccolo wrote:
>>> Non rotational devices' performances are not affected by
>>> distance of read requests, so there is no point in having
>>> overhead to merge such queues.
>>> This doesn't apply to writes, so this patch changes the
>>> queued[] field, to be indexed by READ/WRITE instead of
>>> SYNC/ASYNC, and only compute proximity for queues with
>>> WRITE requests.
>>>
>>
>> Hi Corrado,
>>
>> What's the reason that reads don't benefit from merging queues and hence
>> merging requests and only writes do on SSD?
>
> On SSDs, reads are just limited by the maximum transfer rate, and
> larger (i.e. merged) reads will just take proportionally longer.

This is simply not true. You can get more bandwidth from an SSD (I just
checked numbers for 2 vendors' devices) by issuing larger read requests,
no matter whether the access pattern is sequential or random.

Cheers,
Jeff

Corrado Zoccolo

unread,

Jan 4, 2010, 1:40:01 PM1/4/10

to

On Mon, Jan 4, 2010 at 5:51 PM, Jeff Moyer <jmo...@redhat.com> wrote:
> Corrado Zoccolo <czoc...@gmail.com> writes:
>
>> Hi Vivkek,
>>
>> On Mon, Jan 4, 2010 at 3:47 PM, Vivek Goyal <vgo...@redhat.com> wrote:
>>> On Wed, Dec 30, 2009 at 11:22:47PM +0100, Corrado Zoccolo wrote:
>>>> Non rotational devices' performances are not affected by
>>>> distance of read requests, so there is no point in having
>>>> overhead to merge such queues.
>>>> This doesn't apply to writes, so this patch changes the
>>>> queued[] field, to be indexed by READ/WRITE instead of
>>>> SYNC/ASYNC, and only compute proximity for queues with
>>>> WRITE requests.
>>>>
>>>
>>> Hi Corrado,
>>>
>>> What's the reason that reads don't benefit from merging queues and hence
>>> merging requests and only writes do on SSD?
>>
>> On SSDs, reads are just limited by the maximum transfer rate, and
>> larger (i.e. merged) reads will just take proportionally longer.
>
> This is simply not true. You can get more bandwidth from an SSD (I just
> checked numbers for 2 vendors' devices) by issuing larger read requests,
> no matter whether the access pattern is sequential or random.

I know, but the performance increase given the size is sublinear, and
the situation here is slightly different.
In order for the requests to be merged, they have to be submitted concurrently.
So you have to compare 2 concurrent requests of size x with one
request of size 2*x (with some CPU overhead).
Moreover, you always pay the CPU overhead, even if you can't do the
merging, and you must be very lucky to keep merging, because it means
the two processes are working in lockstep; it is not sufficient that
the requests are just nearby, as for rotational disks.

Thanks,
Corrado

>
> Cheers,
> Jeff
>

--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:czoc...@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
Tales of Power - C. Castaneda

Vivek Goyal

unread,

Jan 4, 2010, 1:40:03 PM1/4/10

to

On Mon, Jan 04, 2010 at 11:51:00AM -0500, Jeff Moyer wrote:
> Corrado Zoccolo <czoc...@gmail.com> writes:
>
> > Hi Vivkek,
> >
> > On Mon, Jan 4, 2010 at 3:47 PM, Vivek Goyal <vgo...@redhat.com> wrote:
> >> On Wed, Dec 30, 2009 at 11:22:47PM +0100, Corrado Zoccolo wrote:
> >>> Non rotational devices' performances are not affected by
> >>> distance of read requests, so there is no point in having
> >>> overhead to merge such queues.
> >>> This doesn't apply to writes, so this patch changes the
> >>> queued[] field, to be indexed by READ/WRITE instead of
> >>> SYNC/ASYNC, and only compute proximity for queues with
> >>> WRITE requests.
> >>>
> >>
> >> Hi Corrado,
> >>
> >> What's the reason that reads don't benefit from merging queues and hence
> >> merging requests and only writes do on SSD?
> >
> > On SSDs, reads are just limited by the maximum transfer rate, and
> > larger (i.e. merged) reads will just take proportionally longer.
>
> This is simply not true. You can get more bandwidth from an SSD (I just
> checked numbers for 2 vendors' devices) by issuing larger read requests,
> no matter whether the access pattern is sequential or random.
>

In my simple testing of 4 fio threads doing direct sequential reads
throughput varies significantly if I vary bs from 4K to 128K.

bs=4K 65MB/s
bs=128K 228MB/s

Thanks
Vivek

Vivek Goyal

unread,

Jan 4, 2010, 2:00:02 PM1/4/10

to

For jeff, at least "dump" utility threads were kind of working in lockstep
for writes and he gained significantly by merging these queues together.

So the argument is that CPU overhead saving in this case is more substantial
as compared to gains made by lockstep read threads. I think we shall have to
have some numbers to justify that.

Thanks
Vivek

Jeff Moyer

unread,

Jan 4, 2010, 2:10:02 PM1/4/10

to

Vivek Goyal <vgo...@redhat.com> writes:

Actually, it was for reads.

> So the argument is that CPU overhead saving in this case is more substantial
> as compared to gains made by lockstep read threads. I think we shall have to
> have some numbers to justify that.

Agreed. Corrado, I know you don't have the hardware, so I'll give this
a run through the read-test2 program and see if it regresses at all.

Cheers,
Jeff

Corrado Zoccolo

unread,

Jan 4, 2010, 3:40:01 PM1/4/10

to

Great.

Thanks a lot,
Corrado

Jeff Moyer

unread,

Jan 5, 2010, 10:00:02 AM1/5/10

to

Corrado Zoccolo <czoc...@gmail.com> writes:

I ran the test program 50 times, and here are the results:

==> vanilla <==
Mean: 163.22728
Population Std. Dev.: 0.55401

==> patched <==
Mean: 162.91558
Population Std. Dev.: 1.08612

This looks acceptable to me.

Vivek Goyal

unread,

Jan 5, 2010, 10:20:02 AM1/5/10

to

Thanks Jeff, one thing comes to mind. Now with recent changes, we drive deeper
depths on SSD with NCQ and there are not many pending cfqq on service tree
until and unless number of parallel threads exceed NCQ depth (32). If
that's the case, then I think we might not be seeing lot of queue merging
happening in this test case until and unless dump utility is creating more
than 32 threads.

If time permits, it might also be interesting to run the same test with queue
depth 1 and see if SSDs without NCQ will suffer or not.

Thanks
Vivek

Jeff Moyer

unread,

Jan 5, 2010, 4:20:01 PM1/5/10

to

Vivek Goyal <vgo...@redhat.com> writes:

> Thanks Jeff, one thing comes to mind. Now with recent changes, we drive deeper
> depths on SSD with NCQ and there are not many pending cfqq on service tree
> until and unless number of parallel threads exceed NCQ depth (32). If
> that's the case, then I think we might not be seeing lot of queue merging
> happening in this test case until and unless dump utility is creating more
> than 32 threads.
>
> If time permits, it might also be interesting to run the same test with queue
> depth 1 and see if SSDs without NCQ will suffer or not.

Corrado, I think what Vivek is getting at is that you should check for
both blk_queue_nonrot and cfqd->hw_tag (like in cfq_arm_slice_timer).
Do you agree?

Cheers,
Jeff

Corrado Zoccolo

unread,

Jan 5, 2010, 4:50:01 PM1/5/10

to

On Tue, Jan 5, 2010 at 10:19 PM, Jeff Moyer <jmo...@redhat.com> wrote:
> Vivek Goyal <vgo...@redhat.com> writes:
>
>> Thanks Jeff, one thing comes to mind. Now with recent changes, we drive deeper
>> depths on SSD with NCQ and there are not many pending cfqq on service tree
>> until and unless number of parallel threads exceed NCQ depth (32). If
>> that's the case, then I think we might not be seeing lot of queue merging
>> happening in this test case until and unless dump utility is creating more
>> than 32 threads.
>>
>> If time permits, it might also be interesting to run the same test with queue
>> depth 1 and see if SSDs without NCQ will suffer or not.
>
> Corrado, I think what Vivek is getting at is that you should check for
> both blk_queue_nonrot and cfqd->hw_tag (like in cfq_arm_slice_timer).
> Do you agree?

Well, actually I didn't want to distinguish on hw_tag here. I had to
still allow merging of writes exactly because a write merge can save
hundreds of ms on a non-NCQ SSD.

Vivek is right that on non-NCQ SSDs a successful merge would increase
the performance, but I still think that the likelyhood of a merge is
so low that maintaining the RB-tree is superfluous. Usually, those
devices are coupled with low-end CPUs, so saving the code execution
could be a win there too. I'll run some tests on my netbook.

BTW, I'm looking at read-test2 right now. I see it doesn't use direct
I/O, so it relies also on page cache. I think page cache can detect
the hidden sequential pattern, and thus send big readahead requests to
the device, making merging impossible (on my SSD, readahead size and
max hw request size match).

Thanks,
Corrado

Kirill Afonshin

unread,

Jan 7, 2010, 6:10:02 AM1/7/10

to

I think we should not rely on NCQ/non-NCQ or blk_queue_nonrot() because it may be incorrect:
try this:

find /sys/ -name rotational 2>/dev/null
find /sys/ -name rotational 2>/dev/null|xargs cat

all devices are reported as rotational for me including ram, loop and usb flash drive. Physical block size and optimal io size has invalid values for all my usb flash drives.

I think it would be better to do a short performance test before mount. It will provide all necessary information for io scheduler. We doesn't need information about NCQ and rotational. We need to predict how much time specific io operation will take in current context.

PS: I'm not native speaker.

Best regards,
Kirill Afonshin

Corrado Zoccolo

unread,

Jan 7, 2010, 8:40:01 AM1/7/10

to

Hi Kirill,
NCQ is actually measured, so it is reliable.
Rotational can be wrong, but you can write to it (e.g. at machine
startup) to obtain better handling of your disks.
I am also a fan of auto-tuning, and trying to achieve it in cfq, but
it is kind of complex, since there are many factors, and some of them
aren't modeled by cfq currently.
An example is that, on cheap SSDs or flash cards, small writes are far
slower than anything else (they can take up to 0.5s), while a read
will usually take less than 1ms. Currently, CFQ has no way to handle
this extreme situation, since it uses just one idle value (8ms) for
all transitions.

Corrado

Vivek Goyal

unread,

Jan 7, 2010, 9:40:02 AM1/7/10

to

On Thu, Jan 07, 2010 at 02:38:29PM +0100, Corrado Zoccolo wrote:
> Hi Kirill,
> NCQ is actually measured, so it is reliable.
> Rotational can be wrong, but you can write to it (e.g. at machine
> startup) to obtain better handling of your disks.
> I am also a fan of auto-tuning, and trying to achieve it in cfq, but
> it is kind of complex, since there are many factors, and some of them
> aren't modeled by cfq currently.
> An example is that, on cheap SSDs or flash cards, small writes are far
> slower than anything else (they can take up to 0.5s), while a read
> will usually take less than 1ms. Currently, CFQ has no way to handle
> this extreme situation, since it uses just one idle value (8ms) for
> all transitions.

Hi Corrado,

How does idle time value relate to flash card being slower for writes? If
flash card is slow and we choose to idle on queue (because of direct
writes), idle time value does not even kick in. We just continue to remain
on same cfqq and don't do dispatch from next cfqq.

Idle time value will matter only if there was delay from cpu side or from
workload side in issuing next request after completion of previous one.

Thanks
Vivek

Corrado Zoccolo

unread,

Jan 7, 2010, 12:10:01 PM1/7/10

to

On Thu, Jan 7, 2010 at 3:36 PM, Vivek Goyal <vgo...@redhat.com> wrote:
> Hi Corrado,
>
> How does idle time value relate to flash card being slower for writes? If
> flash card is slow and we choose to idle on queue (because of direct
> writes), idle time value does not even kick in. We just continue to remain
> on same cfqq and don't do dispatch from next cfqq.
>
> Idle time value will matter only if there was delay from cpu side or from
> workload side in issuing next request after completion of previous one.
>
> Thanks
> Vivek

Hi Vivek,
for me, the optimal idle value should approximate the cost of
switching to an other queue.
So, for reads, if we are waiting for more than 1 ms, then we are
wasting bandwidth.
But if we switch from reads to writes (since the reader thought
slightly more than 1ms), and the write is really slow, we can have a
really long latency before the reader can complete its new request.
So the optimal choice would be to have two different idle times, one
for switch between readers, and one when switching from readers to
writers.

Thanks,
Corrado

Vivek Goyal

unread,

Jan 7, 2010, 1:40:02 PM1/7/10

to

On Thu, Jan 07, 2010 at 06:00:54PM +0100, Corrado Zoccolo wrote:
> On Thu, Jan 7, 2010 at 3:36 PM, Vivek Goyal <vgo...@redhat.com> wrote:
> > Hi Corrado,
> >
> > How does idle time value relate to flash card being slower for writes? If
> > flash card is slow and we choose to idle on queue (because of direct
> > writes), idle time value does not even kick in. We just continue to remain
> > on same cfqq and don't do dispatch from next cfqq.
> >
> > Idle time value will matter only if there was delay from cpu side or from
> > workload side in issuing next request after completion of previous one.
> >
> > Thanks
> > Vivek
> Hi Vivek,
> for me, the optimal idle value should approximate the cost of
> switching to an other queue.
> So, for reads, if we are waiting for more than 1 ms, then we are
> wasting bandwidth.
> But if we switch from reads to writes (since the reader thought
> slightly more than 1ms), and the write is really slow, we can have a
> really long latency before the reader can complete its new request.

What workload do you have where reader is thinking more than a 1ms?

To me one issue probably is that for sync queues we drive shallow (1-2)
queue depths and this can be an issue on high end storage where there
can be multiple disks behind the array and this sync queue is just
not keeping array fully utilized. Buffered sequential reads mitigate
this issue up to some extent as requests size is big.

Idling on the queue helps in providing differentiated service for higher
priority queue and also helps to get more out of disk on rotational media
with single disk. But I suspect that on big arrays, this idling on sync
queues and not driving deeper queue depths might hurt.

So if we had a way to detect that we got a big storage array underneath,
may be we can get more throughput by not idling at all. But we will also
loose the service differentiation between various ioprio queues. I guess
your patches of monitoring service times might be useful here.

> So the optimal choice would be to have two different idle times, one
> for switch between readers, and one when switching from readers to
> writers.

Sounds like read and write batches. With you workload type, we are already
doing it. Idle per service tree. At least it solves the problem for
sync-noidle queues where we don't idle between read queues but do idle
between read and buffered write (async queues).

In my testing so far, I have not encountered the workloads where readers
are thinking a lot. Think time has been very small.

Thanks
Vivek

Corrado Zoccolo

unread,

Jan 7, 2010, 3:20:02 PM1/7/10

to

On Thu, Jan 7, 2010 at 7:37 PM, Vivek Goyal <vgo...@redhat.com> wrote:
> On Thu, Jan 07, 2010 at 06:00:54PM +0100, Corrado Zoccolo wrote:
>> On Thu, Jan 7, 2010 at 3:36 PM, Vivek Goyal <vgo...@redhat.com> wrote:
>> > Hi Corrado,
>> >
>> > How does idle time value relate to flash card being slower for writes? If
>> > flash card is slow and we choose to idle on queue (because of direct
>> > writes), idle time value does not even kick in. We just continue to remain
>> > on same cfqq and don't do dispatch from next cfqq.
>> >
>> > Idle time value will matter only if there was delay from cpu side or from
>> > workload side in issuing next request after completion of previous one.
>> >
>> > Thanks
>> > Vivek
>> Hi Vivek,
>> for me, the optimal idle value should approximate the cost of
>> switching to an other queue.
>> So, for reads, if we are waiting for more than 1 ms, then we are
>> wasting bandwidth.
>> But if we switch from reads to writes (since the reader thought
>> slightly more than 1ms), and the write is really slow, we can have a
>> really long latency before the reader can complete its new request.
>
> What workload do you have where reader is thinking more than a 1ms?

My representative workload is booting my netbook. I found that if I
let cfq autotune to a lower slice idle, boot slows down, and bootchart
clearly shows that I/O wait increases and I/O bandwidth decreases.
This tells me that the writes are getting into the picture earlier
than with 8ms idle, and causing a regression.
Note that the reader doesn't need to be one. I could have a set of
readers, and I want to switch between them in 1ms, but idle up to 10ms
or more before switching to async writes.

>
> To me one issue probably is that for sync queues we drive shallow (1-2)
> queue depths and this can be an issue on high end storage where there
> can be multiple disks behind the array and this sync queue is just
> not keeping array fully utilized. Buffered sequential reads mitigate
> this issue up to some extent as requests size is big.

I think for sequential queues, you should tune your readahead to hit
all the disks of the raid. In that case, idling makes sense, because
all the disks will now be ready to serve the new request immediately.

>
> Idling on the queue helps in providing differentiated service for higher
> priority queue and also helps to get more out of disk on rotational media
> with single disk. But I suspect that on big arrays, this idling on sync
> queues and not driving deeper queue depths might hurt.

We should have some numbers to support. In all tests I saw, setting
slice idle to 0 causes regression also on decently sized arrays, at
least when the number of concurrent processes is big enough that 2 of
them likely will make requests to the same disk (and by the birthday
paradox, this can be a quite small number, even with very large
arrays: e.g. with 365-disk raids, 23 concurrent processes have 50%
probability of colliding on the same disk at every single request
sent).

>
> So if we had a way to detect that we got a big storage array underneath,
> may be we can get more throughput by not idling at all. But we will also
> loose the service differentiation between various ioprio queues. I guess
> your patches of monitoring service times might be useful here.

It might, but we need to identify an hardware in which not idling is
beneficial, measure its behaviour and see which measurable parameter
can clearly distinguish it from other hardware where idling is
required. If we are speaking of raid of rotational disks, seek time
(which I was measuring) is not a good parameter, because it can be
still high.

>
>> So the optimal choice would be to have two different idle times, one
>> for switch between readers, and one when switching from readers to
>> writers.
>
> Sounds like read and write batches. With you workload type, we are already
> doing it. Idle per service tree. At least it solves the problem for
> sync-noidle queues where we don't idle between read queues but do idle
> between read and buffered write (async queues).
>

In fact those changes improved my netbook boot time a lot, and I'm not
even using sreadahead. But if autotuning reduces the slice idle, then
I see again the huge penalty of small writes.

> In my testing so far, I have not encountered the workloads where readers
> are thinking a lot. Think time has been very small.

Sometimes real workloads have more variable think times than our
syntetic benchmarks.

>
> Thanks
> Vivek
>
Thanks,
Corrado

Vivek Goyal

unread,

Jan 8, 2010, 2:00:02 PM1/8/10

to

Ok, so booting on your netbook where write cost is high is the case. So
in this particular case you prefer to delay writes a bit to reduce the
read latency and writes can catch up little later.

> >
> > To me one issue probably is that for sync queues we drive shallow (1-2)
> > queue depths and this can be an issue on high end storage where there
> > can be multiple disks behind the array and this sync queue is just
> > not keeping array fully utilized. Buffered sequential reads mitigate
> > this issue up to some extent as requests size is big.
> I think for sequential queues, you should tune your readahead to hit
> all the disks of the raid. In that case, idling makes sense, because
> all the disks will now be ready to serve the new request immediately.
>
> >
> > Idling on the queue helps in providing differentiated service for higher
> > priority queue and also helps to get more out of disk on rotational media
> > with single disk. But I suspect that on big arrays, this idling on sync
> > queues and not driving deeper queue depths might hurt.
> We should have some numbers to support. In all tests I saw, setting
> slice idle to 0 causes regression also on decently sized arrays, at
> least when the number of concurrent processes is big enough that 2 of
> them likely will make requests to the same disk (and by the birthday
> paradox, this can be a quite small number, even with very large
> arrays: e.g. with 365-disk raids, 23 concurrent processes have 50%
> probability of colliding on the same disk at every single request
> sent).

I will do some tests and see if there are cases where driving shallower
depths hurts.

Vivek

Corrado Zoccolo

unread,

Jan 10, 2010, 8:00:01 AM1/10/10

to

Hi,

I did some tests, and found a surprising thing.
Simply running the test script, the BW levels to a high BW value,
regardless of queue merging in CFQ is enabled or disabled.
I suspected something odd was going on, so I modified the script to
drop caches before each run, and now I found that with queue merging
it is 3 times faster than without, so on non-ncq SSD it is better to
have queue merging enabled, after all.
I'm wondering why the page cache being full can give so large results.
My disk is 4 times larger than the available RAM, so it should do just
a 1/4 boost not clearing it.
I have to do more tests to understand what's going on...

Thanks,
Corrado
>
>>
>> Cheers,
>> Jeff
>>
>

--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:czoc...@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
Tales of Power - C. Castaneda

Corrado Zoccolo

unread,

Jan 10, 2010, 4:10:03 PM1/10/10

to

NCQ SSDs' performances are not affected by

distance of read requests, so there is no point in having
overhead to merge such queues.

Non-NCQ SSDs showed regression in some special cases, so
they are ruled out by this patch.

This patch intentionally doesn't affect writes, so
it changes the queued[] field, to be indexed by

READ/WRITE instead of SYNC/ASYNC, and only compute proximity
for queues with WRITE requests.

Signed-off-by: Corrado Zoccolo <czoc...@gmail.com>
---

block/cfq-iosched.c | 28 +++++++++++++++++-----------
1 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 918c7fd..3b7c60e 100644

--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -108,9 +108,9 @@ struct cfq_queue {
struct rb_root sort_list;
/* if fifo isn't expired, next request to serve */
struct request *next_rq;
- /* requests queued in sort_list */
+ /* requests queued in sort_list, indexed by READ/WRITE */
int queued[2];
- /* currently allocated requests */
+ /* currently allocated requests, indexed by READ/WRITE */
int allocated[2];
/* fifo list of requests in sort_list */
struct list_head fifo;

@@ -436,6 +436,10 @@ static inline void cic_set_cfqq(struct cfq_io_context *cic,
cic->cfqq[is_sync] = cfqq;
}

+static inline bool is_smart_ssd(struct cfq_queue *cfqq)
+{
+ return blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag;
+}
/*
* We regard a request as SYNC, if it's either a read or has the SYNC bit
* set (in which case it could also be direct WRITE).
@@ -1268,7 +1272,8 @@ static void cfq_prio_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq)

return;
if (!cfqq->next_rq)
return;
-

+ if (is_smart_ssd(cfqd) && !cfqq->queued[WRITE])

+ return;
cfqq->p_root = &cfqd->prio_trees[cfqq->org_ioprio];
__cfqq = cfq_prio_tree_lookup(cfqd, cfqq->p_root,
blk_rq_pos(cfqq->next_rq), &parent, &p);

@@ -1337,10 +1342,10 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)

static void cfq_del_rq_rb(struct request *rq)
{
struct cfq_queue *cfqq = RQ_CFQQ(rq);
- const int sync = rq_is_sync(rq);
+ const int rw = rq_data_dir(rq);

- BUG_ON(!cfqq->queued[sync]);
- cfqq->queued[sync]--;
+ BUG_ON(!cfqq->queued[rw]);
+ cfqq->queued[rw]--;

elv_rb_del(&cfqq->sort_list, rq);

@@ -1363,7 +1368,7 @@ static void cfq_add_rq_rb(struct request *rq)

struct cfq_data *cfqd = cfqq->cfqd;
struct request *__alias, *prev;

- cfqq->queued[rq_is_sync(rq)]++;
+ cfqq->queued[rq_data_dir(rq)]++;

/*
* looks a little odd, but the first insert might return an alias.

@@ -1393,7 +1398,7 @@ static void cfq_add_rq_rb(struct request *rq)

static void cfq_reposition_rq_rb(struct cfq_queue *cfqq, struct request *rq)
{
elv_rb_del(&cfqq->sort_list, rq);
- cfqq->queued[rq_is_sync(rq)]--;
+ cfqq->queued[rq_data_dir(rq)]--;
cfq_add_rq_rb(rq);
}

@@ -1689,7 +1694,8 @@ static struct cfq_queue *cfqq_close(struct cfq_data *cfqd,

struct cfq_queue *__cfqq;
sector_t sector = cfqd->last_position;

- if (RB_EMPTY_ROOT(root))
+ if (RB_EMPTY_ROOT(root) ||

+ (is_smart_ssd(cfqd) && !cur_cfqq->queued[WRITE]))
return NULL;

/*
@@ -1796,7 +1802,7 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)

/* We do for queues that were marked with idle window flag. */
if (cfq_cfqq_idle_window(cfqq) &&
- !(blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag))
+ !is_smart_ssd(cfqd))
return true;

/*
@@ -1817,7 +1823,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
* for devices that support queuing, otherwise we still have a problem
* with sync vs async workloads.
*/
- if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag)
+ if (is_smart_ssd(cfqd))
return;

WARN_ON(!RB_EMPTY_ROOT(&cfqq->sort_list));

Corrado Zoccolo

unread,

Jan 10, 2010, 4:10:01 PM1/10/10

to

NCQ SSDs' performances are not affected by
distance of read requests, so there is no point in having
overhead to merge such queues.

Non-NCQ SSDs showed regression in some special cases, so
they are ruled out by this patch.

This patch intentionally doesn't affect writes, so
it changes the queued[] field, to be indexed by
READ/WRITE instead of SYNC/ASYNC, and only compute proximity
for queues with WRITE requests.

Signed-off-by: Corrado Zoccolo <czoc...@gmail.com>
---
block/cfq-iosched.c | 28 +++++++++++++++++-----------
1 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 918c7fd..18bce18 100644

--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -108,9 +108,9 @@ struct cfq_queue {
struct rb_root sort_list;
/* if fifo isn't expired, next request to serve */
struct request *next_rq;
- /* requests queued in sort_list */
+ /* requests queued in sort_list, indexed by READ/WRITE */
int queued[2];
- /* currently allocated requests */
+ /* currently allocated requests, indexed by READ/WRITE */
int allocated[2];
/* fifo list of requests in sort_list */
struct list_head fifo;
@@ -436,6 +436,10 @@ static inline void cic_set_cfqq(struct cfq_io_context *cic,
cic->cfqq[is_sync] = cfqq;
}

+static inline bool is_smart_ssd(struct cfq_data *cfqd)

Jeff Garzik

unread,

Jan 11, 2010, 6:30:01 AM1/11/10

to

On 01/10/2010 04:04 PM, Corrado Zoccolo wrote:
> NCQ SSDs' performances are not affected by
> distance of read requests, so there is no point in having
> overhead to merge such queues.
>
> Non-NCQ SSDs showed regression in some special cases, so
> they are ruled out by this patch.
>
> This patch intentionally doesn't affect writes, so
> it changes the queued[] field, to be indexed by
> READ/WRITE instead of SYNC/ASYNC, and only compute proximity
> for queues with WRITE requests.
>
> Signed-off-by: Corrado Zoccolo<czoc...@gmail.com>

That's not really true. Overhead always increases as the total number
of ATA commands issued increases.

Jeff

Corrado Zoccolo

unread,

Jan 11, 2010, 7:30:02 AM1/11/10

to

On Mon, Jan 11, 2010 at 12:25 PM, Jeff Garzik <je...@garzik.org> wrote:
> On 01/10/2010 04:04 PM, Corrado Zoccolo wrote:
>>
>> NCQ SSDs' performances are not affected by
>> distance of read requests, so there is no point in having
>> overhead to merge such queues.
>>
>> Non-NCQ SSDs showed regression in some special cases, so
>> they are ruled out by this patch.
>>
>> This patch intentionally doesn't affect writes, so
>> it changes the queued[] field, to be indexed by
>> READ/WRITE instead of SYNC/ASYNC, and only compute proximity
>> for queues with WRITE requests.
>>
>> Signed-off-by: Corrado Zoccolo<czoc...@gmail.com>
>
> That's not really true. Overhead always increases as the total number of
> ATA commands issued increases.

Jeff Moyer tested the patch on the workload that mostly benefit of
queue merging, and found that
the performance was improved by the patch.
So removing the CPU overhead helps much more than the marginal gain
given by merging on this hardware.

Corrado

Jens Axboe

unread,

Jan 11, 2010, 8:20:02 AM1/11/10

to

On Mon, Jan 11 2010, Corrado Zoccolo wrote:
> On Mon, Jan 11, 2010 at 12:25 PM, Jeff Garzik <je...@garzik.org> wrote:
> > On 01/10/2010 04:04 PM, Corrado Zoccolo wrote:
> >>
> >> NCQ SSDs' performances are not affected by
> >> distance of read requests, so there is no point in having
> >> overhead to merge such queues.
> >>
> >> Non-NCQ SSDs showed regression in some special cases, so
> >> they are ruled out by this patch.
> >>
> >> This patch intentionally doesn't affect writes, so
> >> it changes the queued[] field, to be indexed by
> >> READ/WRITE instead of SYNC/ASYNC, and only compute proximity
> >> for queues with WRITE requests.
> >>
> >> Signed-off-by: Corrado Zoccolo<czoc...@gmail.com>
> >
> > That's not really true. �Overhead always increases as the total number of
> > ATA commands issued increases.
>
> Jeff Moyer tested the patch on the workload that mostly benefit of
> queue merging, and found that
> the performance was improved by the patch.
> So removing the CPU overhead helps much more than the marginal gain
> given by merging on this hardware.

It's not always going to be true. On SATA the command overhead is fairly
low, but on other hardware that may not be the case. Unless you are CPU
bound by your IO device, then merging will always be beneficial. I'm a
little behind on emails after my vacation, Jeff what numbers did you
generate and on what hardware?

--
Jens Axboe

Jeff Garzik

unread,

Jan 11, 2010, 8:20:02 AM1/11/10

to

...and on what workload? "the workload that mostly benefit of queue
merging" is highly subjective, and likely does not cover most workloads
SSDs will see in the field.

Jeff

Jens Axboe

unread,

Jan 11, 2010, 8:30:02 AM1/11/10

to

That, too. The queue merging it not exactly cheap, so perhaps we can
work on making that work better as well. I've got some new hardware in
the bag that'll do IOPS in the millions range, so I'll throw some tests
at it too once I get it cabled up.

--
Jens Axboe

Corrado Zoccolo

unread,

Jan 11, 2010, 10:00:02 AM1/11/10

to

Hi Jeff,
exactly.
The workloads that benefits from queue merging are the ones in which a
sequential
read is actually splitt, and carried out by different processes in
different I/O context, each
sending requests with strides. This is clearly not the best way of
doing sequential access
(I would happily declare those programs as buggy).
CFQ has code that merges queues in this case. I'm disabling the READ
part for NCQ SSDs,
since, as Jeff measured, the code overhead outweight the gain from
merging (if any).

As you said, most workloads don't benefit from queue merging. On those
ones, the patch
just removes an overhead.

Thanks,
Corrado

Vivek Goyal

unread,

Jan 11, 2010, 11:50:03 AM1/11/10

to

Hi Corrado,

In Jeff's test case of running read-test2, I am not even sure if any
merging between the queues took place or not as on NCQ SSD, we are driving
deeper queue depths and unless read-test2 is creating more than 32
threads, there might not be any merging taking place at all.

We also don't have any data/numbers what kind of cpu savings does this
patch bring in.

Vivek

Vivek Goyal

unread,

Jan 11, 2010, 12:10:02 PM1/11/10

to

On Mon, Jan 11, 2010 at 06:00:51PM +0100, Corrado Zoccolo wrote:

> Jeff's test was modeled after real use cases: widely used, legacy
> programs like dump.
> Since we often said that splitting the sequential stream in multiple
> threads was not the
> correct approach, and we did introduce the change in the kernel just
> to support those
> programs (not to encourage writing more of this league), we can assume
> that if they
> do not drive deeper queues, no one will. So the overhead is just
> overhead, and will never
> give any benefit.
>
> I therefore want to remove it, since for SSD it matters.

> >
> > We also don't have any data/numbers what kind of cpu savings does this
> > patch bring in.
>

> Jeff's test showed larger bandwidth with merge disabled, so it implies
> some saving is present.

Following is what Jeff had posted.

==> vanilla <==
Mean: 163.22728
Population Std. Dev.: 0.55401

==> patched <==
Mean: 162.91558
Population Std. Dev.: 1.08612

I see that with patched kernel(your patches), "Mean" BW of 50 runs has gone
down slightly. So where is the improvement in terms of BW? (Are you referring
to higher standard deviation, that means some of the runs observed higher BW
and concluding something from that?)

Vivek

>
> Thanks,
> Corrado

Corrado Zoccolo

unread,

Jan 11, 2010, 12:10:04 PM1/11/10

to

Jeff's test was modeled after real use cases: widely used, legacy

programs like dump.
Since we often said that splitting the sequential stream in multiple
threads was not the
correct approach, and we did introduce the change in the kernel just
to support those
programs (not to encourage writing more of this league), we can assume
that if they
do not drive deeper queues, no one will. So the overhead is just
overhead, and will never
give any benefit.

I therefore want to remove it, since for SSD it matters.
>

> We also don't have any data/numbers what kind of cpu savings does this
> patch bring in.

Jeff's test showed larger bandwidth with merge disabled, so it implies
some saving is present.

Thanks,
Corrado

Vivek Goyal

unread,

Jan 11, 2010, 12:20:02 PM1/11/10

to

Two things come to mind.

- Even if dump/read-test2 is not driving deeper queue depths, but other
competing programs might be driving deeper queue depths, which can give
queue merging opportunity in case of dump program on NCQ SSD.

- If we are thinking that on NCQ SSD we practically don't have queue
merging opportunity then there is no point in keeping it enabled for
WRITES also?

Vivek

Corrado Zoccolo

unread,

Jan 11, 2010, 2:10:01 PM1/11/10

to

Correct. We need a better test to determine if those merges can happen.

>
> - If we are thinking that on NCQ SSD we practically don't have queue
> merging opportunity then there is no point in keeping it enabled for
> WRITES also?

Probably. Needs test here, too.

Corrado

>
> Vivek
>
>>
>> I therefore want to remove it, since for SSD it matters.
>> >
>> > We also don't have any data/numbers what kind of cpu savings does this
>> > patch bring in.
>>
>> Jeff's test showed larger bandwidth with merge disabled, so it implies
>> some saving is present.
>>
>> Thanks,
>> Corrado
>>
>> >
>> > Vivek
>> >
>> >>
>> >> As you said, most workloads don't benefit from queue merging. On those
>> >> ones, the patch
>> >> just removes an overhead.
>> >>
>> >> Thanks,
>> >> Corrado
>> >>

Corrado Zoccolo

unread,

Jan 11, 2010, 2:10:01 PM1/11/10

to

Sorry, I wrongly remembered the numbers where the opposite.

Corrado