Signed-off-by: Corrado Zoccolo <czoc...@gmail.com>
---
block/cfq-iosched.c | 5 +++--
1 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 918c7fd..01bb0f3 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -1268,7 +1268,8 @@ static void cfq_prio_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq)
return;
if (!cfqq->next_rq)
return;
-
+ if (blk_queue_nonrot(cfqd->queue))
+ return;
cfqq->p_root = &cfqd->prio_trees[cfqq->org_ioprio];
__cfqq = cfq_prio_tree_lookup(cfqd, cfqq->p_root,
blk_rq_pos(cfqq->next_rq), &parent, &p);
@@ -1689,7 +1690,7 @@ static struct cfq_queue *cfqq_close(struct cfq_data *cfqd,
struct cfq_queue *__cfqq;
sector_t sector = cfqd->last_position;
- if (RB_EMPTY_ROOT(root))
+ if (RB_EMPTY_ROOT(root) || blk_queue_nonrot(cfqd->queue))
return NULL;
/*
--
1.6.4.4
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
If the distance is zero, it may still make a big difference (at least
for writes). This check would be better as "ncq and doesn't suck", ala
blk_queue_nonrot(q) && tagged
like we do elsewhere.
--
Jens Axboe
For reads, though, even flash cards and netbook ssds are completely
unaffected. I have done few experiments on my available disks:
* http://dl.dropbox.com/u/3525644/service_time.png (I used the
program: http://dl.dropbox.com/u/3525644/stride.c to get the graphs).
For distance 0, I think request merging will be more effective than
queue merging, moreover I think the multi-thread trick to have large
I/O depth is used for reads, not writes (where simply issuing buffered
writes already achieves a similar effect), so I think it is safe to
disable it for all non-rotational devices.
Thanks,
Corrado
Completely agree, it's writes that matter (as mentioned).
> For distance 0, I think request merging will be more effective than
> queue merging, moreover I think the multi-thread trick to have large
Definitely true, but we don't allow cross cfqq merges to begin with.
> I/O depth is used for reads, not writes (where simply issuing buffered
> writes already achieves a similar effect), so I think it is safe to
> disable it for all non-rotational devices.
That still leaves direct writes. Granted it's a problem with a huge
scope, but still.
Thanks,
Corrado
>
> --
> Jens Axboe
>
>
--
__________________________________________________________________________
dott. Corrado Zoccolo mailto:czoc...@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
Tales of Power - C. Castaneda
That sounds like a solution and avoids the merge/breakup pain for (by
far) most use cases.
> For writes, merging queues (and therefore requests) can probably help
> even the smart ssds.
Yes, but we are getting to the point of having to be more careful about
CPU cycles on SSDs. But lets do it, I'll be spending a good chunk of
time on that very shortly anyway.
--
Jens Axboe
Signed-off-by: Corrado Zoccolo <czoc...@gmail.com>
---
block/cfq-iosched.c | 20 +++++++++++---------
1 files changed, 11 insertions(+), 9 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 918c7fd..7da9391 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -108,9 +108,9 @@ struct cfq_queue {
struct rb_root sort_list;
/* if fifo isn't expired, next request to serve */
struct request *next_rq;
- /* requests queued in sort_list */
+ /* requests queued in sort_list, indexed by READ/WRITE */
int queued[2];
- /* currently allocated requests */
+ /* currently allocated requests, indexed by READ/WRITE */
int allocated[2];
/* fifo list of requests in sort_list */
struct list_head fifo;
@@ -1268,7 +1268,8 @@ static void cfq_prio_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq)
return;
if (!cfqq->next_rq)
return;
-
+ if (blk_queue_nonrot(cfqd->queue) && !cfqq->queued[WRITE])
+ return;
cfqq->p_root = &cfqd->prio_trees[cfqq->org_ioprio];
__cfqq = cfq_prio_tree_lookup(cfqd, cfqq->p_root,
blk_rq_pos(cfqq->next_rq), &parent, &p);
@@ -1337,10 +1338,10 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
static void cfq_del_rq_rb(struct request *rq)
{
struct cfq_queue *cfqq = RQ_CFQQ(rq);
- const int sync = rq_is_sync(rq);
+ const int rw = rq_data_dir(rq);
- BUG_ON(!cfqq->queued[sync]);
- cfqq->queued[sync]--;
+ BUG_ON(!cfqq->queued[rw]);
+ cfqq->queued[rw]--;
elv_rb_del(&cfqq->sort_list, rq);
@@ -1363,7 +1364,7 @@ static void cfq_add_rq_rb(struct request *rq)
struct cfq_data *cfqd = cfqq->cfqd;
struct request *__alias, *prev;
- cfqq->queued[rq_is_sync(rq)]++;
+ cfqq->queued[rq_data_dir(rq)]++;
/*
* looks a little odd, but the first insert might return an alias.
@@ -1393,7 +1394,7 @@ static void cfq_add_rq_rb(struct request *rq)
static void cfq_reposition_rq_rb(struct cfq_queue *cfqq, struct request *rq)
{
elv_rb_del(&cfqq->sort_list, rq);
- cfqq->queued[rq_is_sync(rq)]--;
+ cfqq->queued[rq_data_dir(rq)]--;
cfq_add_rq_rb(rq);
}
@@ -1689,7 +1690,8 @@ static struct cfq_queue *cfqq_close(struct cfq_data *cfqd,
struct cfq_queue *__cfqq;
sector_t sector = cfqd->last_position;
- if (RB_EMPTY_ROOT(root))
+ if (RB_EMPTY_ROOT(root) ||
+ (blk_queue_nonrot(cfqd->queue) && !cur_cfqq->queued[WRITE]))
return NULL;
/*
--
1.6.4.4
--
Hi Corrado,
What's the reason that reads don't benefit from merging queues and hence
merging requests and only writes do on SSD?
> Signed-off-by: Corrado Zoccolo <czoc...@gmail.com>
> ---
> block/cfq-iosched.c | 20 +++++++++++---------
> 1 files changed, 11 insertions(+), 9 deletions(-)
>
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index 918c7fd..7da9391 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -108,9 +108,9 @@ struct cfq_queue {
> struct rb_root sort_list;
> /* if fifo isn't expired, next request to serve */
> struct request *next_rq;
> - /* requests queued in sort_list */
> + /* requests queued in sort_list, indexed by READ/WRITE */
> int queued[2];
> - /* currently allocated requests */
> + /* currently allocated requests, indexed by READ/WRITE */
> int allocated[2];
Sometime back Jens had changed all READ/WRITE indexing to SYNC/ASYNC
indexing throughout IO schedulers and block layer. Personally I would
prefer to keep it that way and not have a mix of SYNC/ASYNC and READ/WRITE
indexing in code.
What are we gaining by this patch? Save some cpu cycles by not merging
and splitting the read cfqq on ssd? Do you have any numbers how much is
the saving. My knee jerk reaction is that if gains are not significant,
lets not do this optimization and let the code be simple.
> /* fifo list of requests in sort_list */
> struct list_head fifo;
> @@ -1268,7 +1268,8 @@ static void cfq_prio_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq)
> return;
> if (!cfqq->next_rq)
> return;
> -
> + if (blk_queue_nonrot(cfqd->queue) && !cfqq->queued[WRITE])
> + return;
A 1-2 line comment here will help about why writes still benefit and not
reads.
On Mon, Jan 4, 2010 at 3:47 PM, Vivek Goyal <vgo...@redhat.com> wrote:
> On Wed, Dec 30, 2009 at 11:22:47PM +0100, Corrado Zoccolo wrote:
>> Non rotational devices' performances are not affected by
>> distance of read requests, so there is no point in having
>> overhead to merge such queues.
>> This doesn't apply to writes, so this patch changes the
>> queued[] field, to be indexed by READ/WRITE instead of
>> SYNC/ASYNC, and only compute proximity for queues with
>> WRITE requests.
>>
>
> Hi Corrado,
>
> What's the reason that reads don't benefit from merging queues and hence
> merging requests and only writes do on SSD?
On SSDs, reads are just limited by the maximum transfer rate, and
larger (i.e. merged) reads will just take proportionally longer.
>> Signed-off-by: Corrado Zoccolo <czoc...@gmail.com>
>> ---
>> block/cfq-iosched.c | 20 +++++++++++---------
>> 1 files changed, 11 insertions(+), 9 deletions(-)
>>
>> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
>> index 918c7fd..7da9391 100644
>> --- a/block/cfq-iosched.c
>> +++ b/block/cfq-iosched.c
>> @@ -108,9 +108,9 @@ struct cfq_queue {
>> struct rb_root sort_list;
>> /* if fifo isn't expired, next request to serve */
>> struct request *next_rq;
>> - /* requests queued in sort_list */
>> + /* requests queued in sort_list, indexed by READ/WRITE */
>> int queued[2];
>> - /* currently allocated requests */
>> + /* currently allocated requests, indexed by READ/WRITE */
>> int allocated[2];
>
> Sometime back Jens had changed all READ/WRITE indexing to SYNC/ASYNC
> indexing throughout IO schedulers and block layer.
Not completely. The allocated field (for which I fixed only the
comment) is still addressed as READ/WRITE.
> Personally I would
> prefer to keep it that way and not have a mix of SYNC/ASYNC and READ/WRITE
> indexing in code.
I think that, as long as it is documented, it should be fine.
> What are we gaining by this patch? Save some cpu cycles by not merging
> and splitting the read cfqq on ssd?
Yes. We should save a lot of cycles by saving the rb tree management
to achieve those operations.
Jens' position is that for fast SSDs, we need to save CPU cycles if we
want to perform well.
> Do you have any numbers how much is
> the saving. My knee jerk reaction is that if gains are not significant,
> lets not do this optimization and let the code be simple.
I think we are actually simplifying the code, removing an optimization
(queue merging) when it is not needed.
When you want to reason about how the code performs on SSD, removing
the unknown of queue merging renders the problem easier.
>
>
>> /* fifo list of requests in sort_list */
>> struct list_head fifo;
>> @@ -1268,7 +1268,8 @@ static void cfq_prio_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq)
>> return;
>> if (!cfqq->next_rq)
>> return;
>> -
>> + if (blk_queue_nonrot(cfqd->queue) && !cfqq->queued[WRITE])
>> + return;
>
> A 1-2 line comment here will help about why writes still benefit and not
> reads.
>
It's because low-end SSDs are penalized by small writes. I don't have
an high end SSD to test with, but Jens is going to do more testing,
and eventually he can disable merging also for writes if he sees
improvement. Note that this is not the usual async write, but sync
write with aio, that I think is quite a niche.
Thanks
Corrado
> Hi Vivkek,
>
> On Mon, Jan 4, 2010 at 3:47 PM, Vivek Goyal <vgo...@redhat.com> wrote:
>> On Wed, Dec 30, 2009 at 11:22:47PM +0100, Corrado Zoccolo wrote:
>>> Non rotational devices' performances are not affected by
>>> distance of read requests, so there is no point in having
>>> overhead to merge such queues.
>>> This doesn't apply to writes, so this patch changes the
>>> queued[] field, to be indexed by READ/WRITE instead of
>>> SYNC/ASYNC, and only compute proximity for queues with
>>> WRITE requests.
>>>
>>
>> Hi Corrado,
>>
>> What's the reason that reads don't benefit from merging queues and hence
>> merging requests and only writes do on SSD?
>
> On SSDs, reads are just limited by the maximum transfer rate, and
> larger (i.e. merged) reads will just take proportionally longer.
This is simply not true. You can get more bandwidth from an SSD (I just
checked numbers for 2 vendors' devices) by issuing larger read requests,
no matter whether the access pattern is sequential or random.
Cheers,
Jeff
Thanks,
Corrado
>
> Cheers,
> Jeff
>
--
__________________________________________________________________________
dott. Corrado Zoccolo mailto:czoc...@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
Tales of Power - C. Castaneda
In my simple testing of 4 fio threads doing direct sequential reads
throughput varies significantly if I vary bs from 4K to 128K.
bs=4K 65MB/s
bs=128K 228MB/s
Thanks
Vivek
For jeff, at least "dump" utility threads were kind of working in lockstep
for writes and he gained significantly by merging these queues together.
So the argument is that CPU overhead saving in this case is more substantial
as compared to gains made by lockstep read threads. I think we shall have to
have some numbers to justify that.
Thanks
Vivek
Actually, it was for reads.
> So the argument is that CPU overhead saving in this case is more substantial
> as compared to gains made by lockstep read threads. I think we shall have to
> have some numbers to justify that.
Agreed. Corrado, I know you don't have the hardware, so I'll give this
a run through the read-test2 program and see if it regresses at all.
Cheers,
Jeff
Thanks a lot,
Corrado
I ran the test program 50 times, and here are the results:
==> vanilla <==
Mean: 163.22728
Population Std. Dev.: 0.55401
==> patched <==
Mean: 162.91558
Population Std. Dev.: 1.08612
This looks acceptable to me.
Thanks Jeff, one thing comes to mind. Now with recent changes, we drive deeper
depths on SSD with NCQ and there are not many pending cfqq on service tree
until and unless number of parallel threads exceed NCQ depth (32). If
that's the case, then I think we might not be seeing lot of queue merging
happening in this test case until and unless dump utility is creating more
than 32 threads.
If time permits, it might also be interesting to run the same test with queue
depth 1 and see if SSDs without NCQ will suffer or not.
Thanks
Vivek
> Thanks Jeff, one thing comes to mind. Now with recent changes, we drive deeper
> depths on SSD with NCQ and there are not many pending cfqq on service tree
> until and unless number of parallel threads exceed NCQ depth (32). If
> that's the case, then I think we might not be seeing lot of queue merging
> happening in this test case until and unless dump utility is creating more
> than 32 threads.
>
> If time permits, it might also be interesting to run the same test with queue
> depth 1 and see if SSDs without NCQ will suffer or not.
Corrado, I think what Vivek is getting at is that you should check for
both blk_queue_nonrot and cfqd->hw_tag (like in cfq_arm_slice_timer).
Do you agree?
Cheers,
Jeff
Vivek is right that on non-NCQ SSDs a successful merge would increase
the performance, but I still think that the likelyhood of a merge is
so low that maintaining the RB-tree is superfluous. Usually, those
devices are coupled with low-end CPUs, so saving the code execution
could be a win there too. I'll run some tests on my netbook.
BTW, I'm looking at read-test2 right now. I see it doesn't use direct
I/O, so it relies also on page cache. I think page cache can detect
the hidden sequential pattern, and thus send big readahead requests to
the device, making merging impossible (on my SSD, readahead size and
max hw request size match).
Thanks,
Corrado
find /sys/ -name rotational 2>/dev/null
find /sys/ -name rotational 2>/dev/null|xargs cat
all devices are reported as rotational for me including ram, loop and usb flash drive. Physical block size and optimal io size has invalid values for all my usb flash drives.
I think it would be better to do a short performance test before mount. It will provide all necessary information for io scheduler. We doesn't need information about NCQ and rotational. We need to predict how much time specific io operation will take in current context.
PS: I'm not native speaker.
Best regards,
Kirill Afonshin
Corrado
Hi Corrado,
How does idle time value relate to flash card being slower for writes? If
flash card is slow and we choose to idle on queue (because of direct
writes), idle time value does not even kick in. We just continue to remain
on same cfqq and don't do dispatch from next cfqq.
Idle time value will matter only if there was delay from cpu side or from
workload side in issuing next request after completion of previous one.
Thanks
Vivek
Thanks,
Corrado
What workload do you have where reader is thinking more than a 1ms?
To me one issue probably is that for sync queues we drive shallow (1-2)
queue depths and this can be an issue on high end storage where there
can be multiple disks behind the array and this sync queue is just
not keeping array fully utilized. Buffered sequential reads mitigate
this issue up to some extent as requests size is big.
Idling on the queue helps in providing differentiated service for higher
priority queue and also helps to get more out of disk on rotational media
with single disk. But I suspect that on big arrays, this idling on sync
queues and not driving deeper queue depths might hurt.
So if we had a way to detect that we got a big storage array underneath,
may be we can get more throughput by not idling at all. But we will also
loose the service differentiation between various ioprio queues. I guess
your patches of monitoring service times might be useful here.
> So the optimal choice would be to have two different idle times, one
> for switch between readers, and one when switching from readers to
> writers.
Sounds like read and write batches. With you workload type, we are already
doing it. Idle per service tree. At least it solves the problem for
sync-noidle queues where we don't idle between read queues but do idle
between read and buffered write (async queues).
In my testing so far, I have not encountered the workloads where readers
are thinking a lot. Think time has been very small.
Thanks
Vivek
>
> Idling on the queue helps in providing differentiated service for higher
> priority queue and also helps to get more out of disk on rotational media
> with single disk. But I suspect that on big arrays, this idling on sync
> queues and not driving deeper queue depths might hurt.
We should have some numbers to support. In all tests I saw, setting
slice idle to 0 causes regression also on decently sized arrays, at
least when the number of concurrent processes is big enough that 2 of
them likely will make requests to the same disk (and by the birthday
paradox, this can be a quite small number, even with very large
arrays: e.g. with 365-disk raids, 23 concurrent processes have 50%
probability of colliding on the same disk at every single request
sent).
>
> So if we had a way to detect that we got a big storage array underneath,
> may be we can get more throughput by not idling at all. But we will also
> loose the service differentiation between various ioprio queues. I guess
> your patches of monitoring service times might be useful here.
It might, but we need to identify an hardware in which not idling is
beneficial, measure its behaviour and see which measurable parameter
can clearly distinguish it from other hardware where idling is
required. If we are speaking of raid of rotational disks, seek time
(which I was measuring) is not a good parameter, because it can be
still high.
>
>> So the optimal choice would be to have two different idle times, one
>> for switch between readers, and one when switching from readers to
>> writers.
>
> Sounds like read and write batches. With you workload type, we are already
> doing it. Idle per service tree. At least it solves the problem for
> sync-noidle queues where we don't idle between read queues but do idle
> between read and buffered write (async queues).
>
In fact those changes improved my netbook boot time a lot, and I'm not
even using sreadahead. But if autotuning reduces the slice idle, then
I see again the huge penalty of small writes.
> In my testing so far, I have not encountered the workloads where readers
> are thinking a lot. Think time has been very small.
Sometimes real workloads have more variable think times than our
syntetic benchmarks.
>
> Thanks
> Vivek
>
Thanks,
Corrado
Ok, so booting on your netbook where write cost is high is the case. So
in this particular case you prefer to delay writes a bit to reduce the
read latency and writes can catch up little later.
> >
> > To me one issue probably is that for sync queues we drive shallow (1-2)
> > queue depths and this can be an issue on high end storage where there
> > can be multiple disks behind the array and this sync queue is just
> > not keeping array fully utilized. Buffered sequential reads mitigate
> > this issue up to some extent as requests size is big.
> I think for sequential queues, you should tune your readahead to hit
> all the disks of the raid. In that case, idling makes sense, because
> all the disks will now be ready to serve the new request immediately.
>
> >
> > Idling on the queue helps in providing differentiated service for higher
> > priority queue and also helps to get more out of disk on rotational media
> > with single disk. But I suspect that on big arrays, this idling on sync
> > queues and not driving deeper queue depths might hurt.
> We should have some numbers to support. In all tests I saw, setting
> slice idle to 0 causes regression also on decently sized arrays, at
> least when the number of concurrent processes is big enough that 2 of
> them likely will make requests to the same disk (and by the birthday
> paradox, this can be a quite small number, even with very large
> arrays: e.g. with 365-disk raids, 23 concurrent processes have 50%
> probability of colliding on the same disk at every single request
> sent).
I will do some tests and see if there are cases where driving shallower
depths hurts.
Vivek
I did some tests, and found a surprising thing.
Simply running the test script, the BW levels to a high BW value,
regardless of queue merging in CFQ is enabled or disabled.
I suspected something odd was going on, so I modified the script to
drop caches before each run, and now I found that with queue merging
it is 3 times faster than without, so on non-ncq SSD it is better to
have queue merging enabled, after all.
I'm wondering why the page cache being full can give so large results.
My disk is 4 times larger than the available RAM, so it should do just
a 1/4 boost not clearing it.
I have to do more tests to understand what's going on...
Thanks,
Corrado
>
>>
>> Cheers,
>> Jeff
>>
>
--
__________________________________________________________________________
dott. Corrado Zoccolo mailto:czoc...@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
Tales of Power - C. Castaneda
Non-NCQ SSDs showed regression in some special cases, so
they are ruled out by this patch.
This patch intentionally doesn't affect writes, so
it changes the queued[] field, to be indexed by
READ/WRITE instead of SYNC/ASYNC, and only compute proximity
for queues with WRITE requests.
Signed-off-by: Corrado Zoccolo <czoc...@gmail.com>
---
block/cfq-iosched.c | 28 +++++++++++++++++-----------
1 files changed, 17 insertions(+), 11 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 918c7fd..3b7c60e 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -108,9 +108,9 @@ struct cfq_queue {
struct rb_root sort_list;
/* if fifo isn't expired, next request to serve */
struct request *next_rq;
- /* requests queued in sort_list */
+ /* requests queued in sort_list, indexed by READ/WRITE */
int queued[2];
- /* currently allocated requests */
+ /* currently allocated requests, indexed by READ/WRITE */
int allocated[2];
/* fifo list of requests in sort_list */
struct list_head fifo;
@@ -436,6 +436,10 @@ static inline void cic_set_cfqq(struct cfq_io_context *cic,
cic->cfqq[is_sync] = cfqq;
}
+static inline bool is_smart_ssd(struct cfq_queue *cfqq)
+{
+ return blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag;
+}
/*
* We regard a request as SYNC, if it's either a read or has the SYNC bit
* set (in which case it could also be direct WRITE).
@@ -1268,7 +1272,8 @@ static void cfq_prio_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq)
return;
if (!cfqq->next_rq)
return;
-
+ if (is_smart_ssd(cfqd) && !cfqq->queued[WRITE])
+ return;
cfqq->p_root = &cfqd->prio_trees[cfqq->org_ioprio];
__cfqq = cfq_prio_tree_lookup(cfqd, cfqq->p_root,
blk_rq_pos(cfqq->next_rq), &parent, &p);
@@ -1337,10 +1342,10 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
static void cfq_del_rq_rb(struct request *rq)
{
struct cfq_queue *cfqq = RQ_CFQQ(rq);
- const int sync = rq_is_sync(rq);
+ const int rw = rq_data_dir(rq);
- BUG_ON(!cfqq->queued[sync]);
- cfqq->queued[sync]--;
+ BUG_ON(!cfqq->queued[rw]);
+ cfqq->queued[rw]--;
elv_rb_del(&cfqq->sort_list, rq);
@@ -1363,7 +1368,7 @@ static void cfq_add_rq_rb(struct request *rq)
struct cfq_data *cfqd = cfqq->cfqd;
struct request *__alias, *prev;
- cfqq->queued[rq_is_sync(rq)]++;
+ cfqq->queued[rq_data_dir(rq)]++;
/*
* looks a little odd, but the first insert might return an alias.
@@ -1393,7 +1398,7 @@ static void cfq_add_rq_rb(struct request *rq)
static void cfq_reposition_rq_rb(struct cfq_queue *cfqq, struct request *rq)
{
elv_rb_del(&cfqq->sort_list, rq);
- cfqq->queued[rq_is_sync(rq)]--;
+ cfqq->queued[rq_data_dir(rq)]--;
cfq_add_rq_rb(rq);
}
@@ -1689,7 +1694,8 @@ static struct cfq_queue *cfqq_close(struct cfq_data *cfqd,
struct cfq_queue *__cfqq;
sector_t sector = cfqd->last_position;
- if (RB_EMPTY_ROOT(root))
+ if (RB_EMPTY_ROOT(root) ||
+ (is_smart_ssd(cfqd) && !cur_cfqq->queued[WRITE]))
return NULL;
/*
@@ -1796,7 +1802,7 @@ static bool cfq_should_idle(struct cfq_data *cfqd, struct cfq_queue *cfqq)
/* We do for queues that were marked with idle window flag. */
if (cfq_cfqq_idle_window(cfqq) &&
- !(blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag))
+ !is_smart_ssd(cfqd))
return true;
/*
@@ -1817,7 +1823,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
* for devices that support queuing, otherwise we still have a problem
* with sync vs async workloads.
*/
- if (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag)
+ if (is_smart_ssd(cfqd))
return;
WARN_ON(!RB_EMPTY_ROOT(&cfqq->sort_list));
Non-NCQ SSDs showed regression in some special cases, so
they are ruled out by this patch.
This patch intentionally doesn't affect writes, so
it changes the queued[] field, to be indexed by
READ/WRITE instead of SYNC/ASYNC, and only compute proximity
for queues with WRITE requests.
Signed-off-by: Corrado Zoccolo <czoc...@gmail.com>
---
block/cfq-iosched.c | 28 +++++++++++++++++-----------
1 files changed, 17 insertions(+), 11 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 918c7fd..18bce18 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -108,9 +108,9 @@ struct cfq_queue {
struct rb_root sort_list;
/* if fifo isn't expired, next request to serve */
struct request *next_rq;
- /* requests queued in sort_list */
+ /* requests queued in sort_list, indexed by READ/WRITE */
int queued[2];
- /* currently allocated requests */
+ /* currently allocated requests, indexed by READ/WRITE */
int allocated[2];
/* fifo list of requests in sort_list */
struct list_head fifo;
@@ -436,6 +436,10 @@ static inline void cic_set_cfqq(struct cfq_io_context *cic,
cic->cfqq[is_sync] = cfqq;
}
+static inline bool is_smart_ssd(struct cfq_data *cfqd)
That's not really true. Overhead always increases as the total number
of ATA commands issued increases.
Jeff
Jeff Moyer tested the patch on the workload that mostly benefit of
queue merging, and found that
the performance was improved by the patch.
So removing the CPU overhead helps much more than the marginal gain
given by merging on this hardware.
Corrado
It's not always going to be true. On SATA the command overhead is fairly
low, but on other hardware that may not be the case. Unless you are CPU
bound by your IO device, then merging will always be beneficial. I'm a
little behind on emails after my vacation, Jeff what numbers did you
generate and on what hardware?
--
Jens Axboe
...and on what workload? "the workload that mostly benefit of queue
merging" is highly subjective, and likely does not cover most workloads
SSDs will see in the field.
Jeff
That, too. The queue merging it not exactly cheap, so perhaps we can
work on making that work better as well. I've got some new hardware in
the bag that'll do IOPS in the millions range, so I'll throw some tests
at it too once I get it cabled up.
--
Jens Axboe
As you said, most workloads don't benefit from queue merging. On those
ones, the patch
just removes an overhead.
Thanks,
Corrado
Hi Corrado,
In Jeff's test case of running read-test2, I am not even sure if any
merging between the queues took place or not as on NCQ SSD, we are driving
deeper queue depths and unless read-test2 is creating more than 32
threads, there might not be any merging taking place at all.
We also don't have any data/numbers what kind of cpu savings does this
patch bring in.
Vivek
Following is what Jeff had posted.
==> vanilla <==
Mean: 163.22728
Population Std. Dev.: 0.55401
==> patched <==
Mean: 162.91558
Population Std. Dev.: 1.08612
I see that with patched kernel(your patches), "Mean" BW of 50 runs has gone
down slightly. So where is the improvement in terms of BW? (Are you referring
to higher standard deviation, that means some of the runs observed higher BW
and concluding something from that?)
Vivek
>
> Thanks,
> Corrado
Jeff's test was modeled after real use cases: widely used, legacy
programs like dump.
Since we often said that splitting the sequential stream in multiple
threads was not the
correct approach, and we did introduce the change in the kernel just
to support those
programs (not to encourage writing more of this league), we can assume
that if they
do not drive deeper queues, no one will. So the overhead is just
overhead, and will never
give any benefit.
I therefore want to remove it, since for SSD it matters.
>
> We also don't have any data/numbers what kind of cpu savings does this
> patch bring in.
Jeff's test showed larger bandwidth with merge disabled, so it implies
some saving is present.
Thanks,
Corrado
Two things come to mind.
- Even if dump/read-test2 is not driving deeper queue depths, but other
competing programs might be driving deeper queue depths, which can give
queue merging opportunity in case of dump program on NCQ SSD.
- If we are thinking that on NCQ SSD we practically don't have queue
merging opportunity then there is no point in keeping it enabled for
WRITES also?
Vivek
Correct. We need a better test to determine if those merges can happen.
>
> - If we are thinking that on NCQ SSD we practically don't have queue
> merging opportunity then there is no point in keeping it enabled for
> WRITES also?
Probably. Needs test here, too.
Corrado
>
> Vivek
>
>>
>> I therefore want to remove it, since for SSD it matters.
>> >
>> > We also don't have any data/numbers what kind of cpu savings does this
>> > patch bring in.
>>
>> Jeff's test showed larger bandwidth with merge disabled, so it implies
>> some saving is present.
>>
>> Thanks,
>> Corrado
>>
>> >
>> > Vivek
>> >
>> >>
>> >> As you said, most workloads don't benefit from queue merging. On those
>> >> ones, the patch
>> >> just removes an overhead.
>> >>
>> >> Thanks,
>> >> Corrado
>> >>
Sorry, I wrongly remembered the numbers where the opposite.
Corrado