The test scenario: 1 JBOD has 12 disks and every disk has 2 partitions. Create
8 1-GB files per partition and start 8 processes to do rand read on the 8 files
per partitions. There are 8*24 processes totally. randread block size is 64K.
We found the regression on 2 machines. One machine has 8GB memory and the other has
6GB.
Bisect is very unstable. The related patches are many instead of just one.
1) commit 8e550632cccae34e265cb066691945515eaa7fb5
Author: Corrado Zoccolo <czoc...@gmail.com>
Date: Thu Nov 26 10:02:58 2009 +0100
cfq-iosched: fix corner cases in idling logic
This patch introduces about less than 20% regression. I just reverted below section
and this part regression disappear. It shows this regression is stable and not impacted
by other patches.
@@ -1253,9 +1254,9 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
return;
/*
- * still requests with the driver, don't idle
+ * still active requests from this queue, don't idle
*/
- if (rq_in_driver(cfqd))
+ if (cfqq->dispatched)
return;
2) How about other 20%~30% regressions? It's complicated. My bisect plus
Li Shaohua's investigation located 3 patches,
df5fe3e8e13883f58dc97489076bbcc150789a21,
b3b6d0408c953524f979468562e7e210d8634150,
5db5d64277bf390056b1a87d0bb288c8b8553f96.
tiobench also has regression and Li Shaohua located the same patches. See link
http://lkml.indiana.edu/hypermail/linux/kernel/0912.2/03355.html.
Shaohua worked about patches to fix the tiobench regression. However, his patch
doesn't work for fio randread 64k regression.
I retried bisect manually and eventually located below patch,
commit 718eee0579b802aabe3bafacf09d0a9b0830f1dd
Author: Corrado Zoccolo <czoc...@gmail.com>
Date: Mon Oct 26 22:45:29 2009 +0100
cfq-iosched: fairness for sync no-idle queues
The patch is a little big. After many try, I found below section is the key.
@@ -2218,13 +2352,10 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
- (!cfqd->cfq_latency && cfqd->hw_tag && CFQQ_SEEKY(cfqq)))
+ (sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq)))
enable_idle = 0;
That section deletes the condition checking of !cfqd->cfq_latency, so
enable_idle=0 with more possibility.
I wrote a testing patch which just overlooks the original 3 patches related
to tiobench regression, and a patch which adds back the checking of !cfqd->cfq_latency.
Then, all regression of fio randread 64k disappears.
Then, instead of working around the original 3 patches, I applied Shaohua's 2 patches
and added the checking of !cfqd->cfq_latency while also reverting the patch mentioned in 1).
But the result still has more than 20% regression. So Shaohua's patches couldn't improve
fio rand read 64k regression.
fio_mmap_randread_4k has about 10% improvement instead of regression. I checked
that my patch plus the debugging patch have no impact on this improvement.
randwrite 64k has about 25% regression. My method also restores its performance.
I worked out a patch to add the checking of !cfqd->cfq_latency back in
function cfq_update_idle_window.
In addition, as for item 1), could we just revert the section in cfq_arm_slice_timer?
As Shaohua's patches don't work for this regression, we might continue to find
better methods. I will check it next week.
---
With kernel 2.6.33-rc1, fio rand read 64k has more than 40% regression. Located
below patch.
commit 718eee0579b802aabe3bafacf09d0a9b0830f1dd
Author: Corrado Zoccolo <czoc...@gmail.com>
Date: Mon Oct 26 22:45:29 2009 +0100
cfq-iosched: fairness for sync no-idle queues
It introduces for more than 20% regression. The reason is function cfq_update_idle_window
forgets to check cfqd->cfq_latency, so enable_idle=0 with more possibility.
Below patch against 2.6.33-rc1 adds the checking back.
Signed-off-by: Zhang Yanmin <yanmin...@linux.intel.com>
---
diff -Nraup linux-2.6.33_rc1/block/cfq-iosched.c linux-2.6.33_rc1_rand64k/block/cfq-iosched.c
--- linux-2.6.33_rc1/block/cfq-iosched.c 2009-12-23 14:12:03.000000000 +0800
+++ linux-2.6.33_rc1_rand64k/block/cfq-iosched.c 2009-12-31 16:26:32.000000000 +0800
@@ -3064,8 +3064,8 @@ cfq_update_idle_window(struct cfq_data *
cfq_mark_cfqq_deep(cfqq);
if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
- (!cfq_cfqq_deep(cfqq) && sample_valid(cfqq->seek_samples)
- && CFQQ_SEEKY(cfqq)))
+ (!cfqd->cfq_latency && !cfq_cfqq_deep(cfqq) &&
+ sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq)))
enable_idle = 0;
else if (sample_valid(cic->ttime_samples)) {
if (cic->ttime_mean > cfqd->cfq_slice_idle)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Can you compare the performance also with 2.6.31?
I think I understand what causes your problem.
2.6.32, with default settings, handled even random readers as
sequential ones to provide fairness. This has benefits on single disks
and JBODs, but causes harm on raids.
For 2.6.33, we changed the way in which this is handled, restoring the
enable_idle = 0 for seeky queues as it was in 2.6.31:
@@ -2218,13 +2352,10 @@ cfq_update_idle_window(struct cfq_data *cfqd,
struct cfq_queue *cfqq,
enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
- (!cfqd->cfq_latency && cfqd->hw_tag && CFQQ_SEEKY(cfqq)))
+ (sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq)))
enable_idle = 0;
(compare with 2.6.31:
if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
(cfqd->hw_tag && CIC_SEEKY(cic)))
enable_idle = 0;
excluding the sample_valid check, it should be equivalent for you (I
assume you have NCQ disks))
and we provide fairness for them by servicing all seeky queues
together, and then idling before switching to other ones.
The mmap 64k randreader will have a large seek_mean, resulting in
being marked seeky, but will send 16 * 4k sequential requests one
after the other, so alternating between those seeky queues will cause
harm.
I'm working on a new way to compute seekiness of queues, that should
fix your issue, correctly identifying those queues as non-seeky (for
me, a queue should be considered seeky only if it submits more than 1
seeky requests for 8 sequential ones).
>
> The test scenario: 1 JBOD has 12 disks and every disk has 2 partitions. Create
> 8 1-GB files per partition and start 8 processes to do rand read on the 8 files
> per partitions. There are 8*24 processes totally. randread block size is 64K.
>
> We found the regression on 2 machines. One machine has 8GB memory and the other has
> 6GB.
>
> Bisect is very unstable. The related patches are many instead of just one.
>
>
> 1) commit 8e550632cccae34e265cb066691945515eaa7fb5
> Author: Corrado Zoccolo <czoc...@gmail.com>
> Date: Thu Nov 26 10:02:58 2009 +0100
>
> cfq-iosched: fix corner cases in idling logic
>
>
> This patch introduces about less than 20% regression. I just reverted below section
> and this part regression disappear. It shows this regression is stable and not impacted
> by other patches.
>
> @@ -1253,9 +1254,9 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
> return;
>
> /*
> - * still requests with the driver, don't idle
> + * still active requests from this queue, don't idle
> */
> - if (rq_in_driver(cfqd))
> + if (cfqq->dispatched)
> return;
>
This shouldn't affect you if all queues are marked as idle. Does just
your patch:
> - (!cfq_cfqq_deep(cfqq) && sample_valid(cfqq->seek_samples)
> - && CFQQ_SEEKY(cfqq)))
> + (!cfqd->cfq_latency && !cfq_cfqq_deep(cfqq) &&
> + sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq)))
fix most of the regression without touching arm_slice_timer?
I guess
> 5db5d64277bf390056b1a87d0bb288c8b8553f96.
will still introduce a 10% regression, but this is needed to improve
latency, and you can just disable low_latency to avoid it.
Thanks,
Corrado
> Can you compare the performance also with 2.6.31?
We did. We run Linux kernel Performance Tracking project and run many benchmarks when a RC kernel
is released.
The result of 2.6.31 is quite similar to the one of 2.6.32. But the one of 2.6.30 is about
8% better than the one of 2.6.31.
> I think I understand what causes your problem.
> 2.6.32, with default settings, handled even random readers as
> sequential ones to provide fairness. This has benefits on single disks
> and JBODs, but causes harm on raids.
I didn't test RAID as that machine with hardware RAID HBA is crashed now. But if we turn on
hardware RAID in HBA, mostly we use noop io scheduler.
> For 2.6.33, we changed the way in which this is handled, restoring the
> enable_idle = 0 for seeky queues as it was in 2.6.31:
> @@ -2218,13 +2352,10 @@ cfq_update_idle_window(struct cfq_data *cfqd,
> struct cfq_queue *cfqq,
> enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
>
> if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
> - (!cfqd->cfq_latency && cfqd->hw_tag && CFQQ_SEEKY(cfqq)))
> + (sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq)))
> enable_idle = 0;
> (compare with 2.6.31:
> if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
> (cfqd->hw_tag && CIC_SEEKY(cic)))
> enable_idle = 0;
> excluding the sample_valid check, it should be equivalent for you (I
> assume you have NCQ disks))
> and we provide fairness for them by servicing all seeky queues
> together, and then idling before switching to other ones.
As for function cfq_update_idle_window, you is right. But since
2.6.32, CFQ merges many patches and the patches have impact on each other.
Although 5 patches are related to the regression, above line is quite
independent. Reverting above line could always improve the result for about
20%.
> >
> This shouldn't affect you if all queues are marked as idle.
Do you mean to use command ionice to mark it as idle class? I didn't try it.
> Does just
> your patch:
> > - (!cfq_cfqq_deep(cfqq) && sample_valid(cfqq->seek_samples)
> > - && CFQQ_SEEKY(cfqq)))
> > + (!cfqd->cfq_latency && !cfq_cfqq_deep(cfqq) &&
> > + sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq)))
> fix most of the regression without touching arm_slice_timer?
No. If to fix the regression completely, I need apply above patch plus
a debug patch. The debug patch is to just work around the 3 patches report by
Shaohua's tiobench regression report. Without the debug patch, the regression
isn't resolved.
Below is the debug patch.
diff -Nraup linux-2.6.33_rc1/block/cfq-iosched.c linux-2.6.33_rc1_randread64k/block/cfq-iosched.c
--- linux-2.6.33_rc1/block/cfq-iosched.c 2009-12-23 14:12:03.000000000 +0800
+++ linux-2.6.33_rc1_randread64k/block/cfq-iosched.c 2009-12-30 17:12:28.000000000 +0800
@@ -592,6 +592,9 @@ cfq_set_prio_slice(struct cfq_data *cfqd
cfqq->slice_start = jiffies;
cfqq->slice_end = jiffies + slice;
cfqq->allocated_slice = slice;
+/*YMZHANG*/
+ cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies;
+
cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies);
}
@@ -1836,7 +1839,8 @@ static void cfq_arm_slice_timer(struct c
/*
* still active requests from this queue, don't idle
*/
- if (cfqq->dispatched)
+ //if (cfqq->dispatched)
+ if (rq_in_driver(cfqd))
return;
/*
@@ -1941,6 +1945,9 @@ static void cfq_setup_merge(struct cfq_q
new_cfqq = __cfqq;
}
+ /* YMZHANG debug */
+ return;
+
process_refs = cfqq_process_refs(cfqq);
/*
* If the process for the cfqq has gone away, there is no
>
> I guess
> > 5db5d64277bf390056b1a87d0bb288c8b8553f96.
> will still introduce a 10% regression, but this is needed to improve
> latency, and you can just disable low_latency to avoid it.
You are right. I did a quick testing. If my patch + revert 2 patches and keep
5db5d64, the regression is about 20%.
But low_latency=0 doesn't work like what we imagined. If patch + revert 2 patches
and keep 5db5d64 while set low_latency=0, the regression is still there. One
reason is my patch doesn't work when low_latency=0.
>
> Thanks,
> Corrado
I attach the fio job file for your reference.
I got a cold and will continue to work on it next week.
Yanmin
[job0]
startdelay=0
rw=randread
filename=testfile1
[job1]
startdelay=0
rw=randread
filename=testfile2
[job2]
startdelay=0
rw=randread
filename=testfile3
[job3]
startdelay=0
rw=randread
filename=testfile4
The attached patches, in particular 0005 (that apply on top of
for-linus branch of Jen's tree
git://git.kernel.dk/linux-2.6-block.git) fix the regression on this
simplified workload.
>
>> >
>> This shouldn't affect you if all queues are marked as idle.
> Do you mean to use command ionice to mark it as idle class? I didn't try it.
No. I meant forcing enable_idle = 1, as you were almost doing with
your patch, when cfq_latency was set.
With my above patch, this should not be needed any more, since the
queues should be seen as sequential.
>
>> Does just
>> your patch:
>> > - (!cfq_cfqq_deep(cfqq) && sample_valid(cfqq->seek_samples)
>> > - && CFQQ_SEEKY(cfqq)))
>> > + (!cfqd->cfq_latency && !cfq_cfqq_deep(cfqq) &&
>> > + sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq)))
>> fix most of the regression without touching arm_slice_timer?
> No. If to fix the regression completely, I need apply above patch plus
> a debug patch. The debug patch is to just work around the 3 patches report by
> Shaohua's tiobench regression report. Without the debug patch, the regression
> isn't resolved.
Jens already merged one of Shaohua's patches, that may fix the problem
with queue combining.
> Below is the debug patch.
> diff -Nraup linux-2.6.33_rc1/block/cfq-iosched.c linux-2.6.33_rc1_randread64k/block/cfq-iosched.c
> --- linux-2.6.33_rc1/block/cfq-iosched.c 2009-12-23 14:12:03.000000000 +0800
> +++ linux-2.6.33_rc1_randread64k/block/cfq-iosched.c 2009-12-30 17:12:28.000000000 +0800
> @@ -592,6 +592,9 @@ cfq_set_prio_slice(struct cfq_data *cfqd
> cfqq->slice_start = jiffies;
> cfqq->slice_end = jiffies + slice;
> cfqq->allocated_slice = slice;
> +/*YMZHANG*/
> + cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies;
> +
This is disabled, on a vanilla 2.6.33 kernel, by setting low_latency = 0
> cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies);
> }
>
> @@ -1836,7 +1839,8 @@ static void cfq_arm_slice_timer(struct c
> /*
> * still active requests from this queue, don't idle
> */
> - if (cfqq->dispatched)
> + //if (cfqq->dispatched)
> + if (rq_in_driver(cfqd))
> return;
>
> /*
> @@ -1941,6 +1945,9 @@ static void cfq_setup_merge(struct cfq_q
> new_cfqq = __cfqq;
> }
>
> + /* YMZHANG debug */
> + return;
> +
This should be partially addressed by Shaohua's patch merged in Jens' tree.
But note that your 8 processes, can randomly start doing I/O on the
same file, so merging those queues is sometimes reasonable.
The patch to split them quickly was still not merged, though, so you
will still see some regression due to this. In my simplified job file,
I removed the randomness to make sure this cannot happen.
> process_refs = cfqq_process_refs(cfqq);
> /*
> * If the process for the cfqq has gone away, there is no
>
>
>>
>> I guess
>> > 5db5d64277bf390056b1a87d0bb288c8b8553f96.
>> will still introduce a 10% regression, but this is needed to improve
>> latency, and you can just disable low_latency to avoid it.
> You are right. I did a quick testing. If my patch + revert 2 patches and keep
> 5db5d64, the regression is about 20%.
>
> But low_latency=0 doesn't work like what we imagined. If patch + revert 2 patches
> and keep 5db5d64 while set low_latency=0, the regression is still there. One
> reason is my patch doesn't work when low_latency=0.
Right. You can try with my patch, instead, that doesn't depend on
low_latency, and set it to 0 to remove this performance degradation.
My results:
2.6.32.2:
READ: io=146688KB, aggrb=2442KB/s, minb=602KB/s, maxb=639KB/s,
mint=60019msec, maxt=60067msec
2.6.33 - jens:
READ: io=128512KB, aggrb=2140KB/s, minb=526KB/s, maxb=569KB/s,
mint=60004msec, maxt=60032msec
2.6.33 - jens + my patches :
READ: io=143232KB, aggrb=2384KB/s, minb=595KB/s, maxb=624KB/s,
mint=60003msec, maxt=60072msec
2.6.33 - jens + my patches + low_lat = 0:
READ: io=145216KB, aggrb=2416KB/s, minb=596KB/s, maxb=632KB/s,
mint=60027msec, maxt=60087msec
>>
>> Thanks,
>> Corrado
> I attach the fio job file for your reference.
>
> I got a cold and will continue to work on it next week.
>
> Yanmin
>
Thanks,
Corrado
As we have about 40 fio sub cases, we have a script to create fio job file from
a specific parameter list. So there are some superfluous parameters.
Another point is we need stable result.
> with lot of
> things going on.
> Let's keep this for last.
Ok. But the change like what you do mostly reduces regresion.
I didn't download the tree. I tested the 3 attached patches against 2.6.33-rc1. The
result isn't resolved.
>
> >
> >> >
> >> This shouldn't affect you if all queues are marked as idle.
> > Do you mean to use command ionice to mark it as idle class? I didn't try it.
> No. I meant forcing enable_idle = 1, as you were almost doing with
> your patch, when cfq_latency was set.
> With my above patch, this should not be needed any more, since the
> queues should be seen as sequential.
>
> >
> >> Does just
> >> your patch:
> >> > - (!cfq_cfqq_deep(cfqq) && sample_valid(cfqq->seek_samples)
> >> > - && CFQQ_SEEKY(cfqq)))
> >> > + (!cfqd->cfq_latency && !cfq_cfqq_deep(cfqq) &&
> >> > + sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq)))
> >> fix most of the regression without touching arm_slice_timer?
> > No. If to fix the regression completely, I need apply above patch plus
> > a debug patch. The debug patch is to just work around the 3 patches report by
> > Shaohua's tiobench regression report. Without the debug patch, the regression
> > isn't resolved.
>
> Jens already merged one of Shaohua's patches, that may fix the problem
> with queue combining.
I did another testing. Apply my debug patch+ the low_latency patch, but use
Shaohua's 2 patches (improve merge and split), the regression disappears.
Another reason is I start 8 processes per partition and every disk has 2 partitions,
so there are 16 processes per disk. With another JBOD, I use one partition per disk,
and the regression is only 8%.
>From this point, can CFQ do not merge request queues which access different partitions?
As you know, it's unusual that a process accesses files across partitions. io scheduler
is at low layer which doesn't know partition.
Can you quantify if there is an improvement, though?
Please, also include Shahoua's patches.
I'd like to see the comparison between (always with low_latency set to 0):
plain 2.6.33
plain 2.6.33 + shahoua's
plain 2.6.33 + shahoua's + my patch
plain 2.6.33 + shahoua's + my patch + rq_in_driver vs dispatched patch.
With half of the processes, time slices are higher, and the disk cache
can do a better job when servicing interleaved sequential requests.
>
> >From this point, can CFQ do not merge request queues which access different partitions?
(puzzled: I didn't write this, and can't find a message in the thread
with this question.)
> As you know, it's unusual that a process accesses files across partitions. io scheduler
> is at low layer which doesn't know partition.
CFQ bases decision on distance between requests, and requests going to
different partitions will have much higher distance. So the associated
queues will be more likely marked as seeky.
> Could you generate the same script, but with each process accessing
> only one of the files, instead of chosing it at random?
Ok. New testing starts 8 processes per partition and every process just works
on one file.
Ok. Because of company policy, I could only post percent instead of real number.
> Please, also include Shahoua's patches.
> I'd like to see the comparison between (always with low_latency set to 0):
> plain 2.6.33
> plain 2.6.33 + shahoua's
> plain 2.6.33 + shahoua's + my patch
> plain 2.6.33 + shahoua's + my patch + rq_in_driver vs dispatched patch.
1) low_latency=0
2.6.32 kernel 0
2.6.33-rc1 -0.33
2.6.33-rc1_shaohua -0.33
2.6.33-rc1+corrado 0.03
2.6.33-rc1_corrado+shaohua 0.02
2.6.33-rc1_corrado+shaohua+rq_in_driver 0.01
2) low_latency=1
2.6.32 kernel 0
2.6.33-rc1 -0.45
2.6.33-rc1+corrado -0.24
2.6.33-rc1_corrado+shaohua -0.23
2.6.33-rc1_corrado+shaohua+rq_in_driver -0.23
When low_latency=1, we get the biggest number with kernel 2.6.32.
Comparing with low_latency=0's result, the prior one is about 4% better.
My email client is evolution and sometimes it adds > unexpectedly.
> > As you know, it's unusual that a process accesses files across partitions. io scheduler
> > is at low layer which doesn't know partition.
> CFQ bases decision on distance between requests, and requests going to
> different partitions will have much higher distance. So the associated
> queues will be more likely marked as seeky.
Right. Thanks for your explanation.
> 2) low_latency=1
> 2.6.32 kernel 0
> 2.6.33-rc1 -0.45
> 2.6.33-rc1+corrado -0.24
> 2.6.33-rc1_corrado+shaohua -0.23
> 2.6.33-rc1_corrado+shaohua+rq_in_driver -0.23
The results are as expected. With each process working on a separate
file, Shahoua's patches do not influence the result sensibly.
Interestingly, even rq_in_driver doesn't improve in this case, so
maybe its effect is somewhat connected to queue merging.
The remaining -23% is due to timeslice shrinking, that is done to
reduce max latency when there are too many processes doing I/O, at the
expense of throughput. It is a documented change, and the suggested
way if you favor throughput over latency is to set low_latency = 0.
>
>
> When low_latency=1, we get the biggest number with kernel 2.6.32.
> Comparing with low_latency=0's result, the prior one is about 4% better.
Ok, so 2.6.33 + corrado (with low_latency =0) is comparable with
fastest 2.6.32, so we can consider the first part of the problem
solved.
For the queue merging issue, maybe Jeff has some improvements w.r.t
shaohua's approach.
Thanks,
Corrado
We saw that cfqq->dispatched worked fine when there was no queue
merging happening, so it must be something concerning merging,
probably dispatched is not accurate when we set up for a merging, but
the merging was not yet done.
>
> We saw that cfqq->dispatched worked fine when there was no queue
> merging happening, so it must be something concerning merging,
> probably dispatched is not accurate when we set up for a merging, but
> the merging was not yet done.
Thanks,
Corrado
It's tough to say. Is there any chance I could get some blktrace data
for the run?
Cheers,
Jeff
Performance improvement because of replacing cfqq->dispatched with
rq_in_driver() is really strange. This will mean we will do even lesser
idling on the cfqq. That means faster cfqq switching and that should mean more
seeks (for this test case) and reduce throughput. This is just opposite to your approach of treating a random read mmap queue as sync where we will idle on
the queue.
Thanks
Vivek
Thanks,
Corrado
>
> Thanks
> Vivek
>
>>
>> Thanks,
>> Corrado
>>
>> >
>> >>
>> >> We saw that cfqq->dispatched worked fine when there was no queue
>> >> merging happening, so it must be something concerning merging,
>> >> probably dispatched is not accurate when we set up for a merging, but
>> >> the merging was not yet done.
>> >
>> >
>> >
>
--
__________________________________________________________________________
dott. Corrado Zoccolo mailto:czoc...@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
Tales of Power - C. Castaneda
Thanks,
Shaohua
I thought there was merging and/or unmerging activity. You don't
mention that here.
I'll see if I can reproduce it.
Cheers,
Jeff
Just a data point. I ran 8 fio mmap jobs, bs=64K, direct=1, size=2G
runtime=30 with vanilla kernel (2.6.33-rc4) and with modified kernel which
replaced cfqq->dispatched with rq_in_driver(cfqd).
I did not see any significant throughput improvement but I did see max_clat
halfed in modified kernel.
Vanilla kernel
==============
read bw: 3701KB/s
max clat: 401050 us
Number of times idle timer was armed: 20980
Number of cfqq expired/switched: 6377
cfqq merge operations: 0
Modified kernel (rq_in_driver(cfqd))
===================================
read bw: 3645KB/s
max clat: 800515 us
Number of times idle timer was armed: 2875
Number of cfqq expired/switched: 17750
cfqq merge operations: 0
This kind of confirms that rq_in_driver(cfqd) will reduce the number of
times we idle on queues and will make queue switching faster. That also
explains the reduce max clat.
If that's the case, then it should also have increased the number of seeks
(at least on yanmin's setup of JBOD), and reduce throughput. But instead
reverse seems to be happening in his setup.
Yanmin, as Jeff mentioned, if you can capture the blktrace of vanilla and
modified kernel and upload somewhere to look at, it might help.
Thanks
Vivek