I took time and remeasured tiobench results on recent kernel. A short
conclusion is that there is still a performance regression which I reported
few months ago. The machine is Intel 2 CPU with 2 GB RAM and plain SATA
drive. tiobench sequential write performance numbers with 16 threads:
2.6.29: AVG STDERR
37.80 38.54 39.48 -> 38.606667 0.687475
2.6.32-rc5:
37.36 36.41 36.61 -> 36.793333 0.408928
So about 5% regression. The regression happened sometime between 2.6.29 and
2.6.30 and stays the same since then... With deadline scheduler, there's
no regression. Shouldn't we do something about it?
Honza
--
Jan Kara <ja...@suse.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
> Hi,
>
> I took time and remeasured tiobench results on recent kernel. A short
> conclusion is that there is still a performance regression which I reported
> few months ago. The machine is Intel 2 CPU with 2 GB RAM and plain SATA
> drive. tiobench sequential write performance numbers with 16 threads:
> 2.6.29: AVG STDERR
> 37.80 38.54 39.48 -> 38.606667 0.687475
>
> 2.6.32-rc5:
> 37.36 36.41 36.61 -> 36.793333 0.408928
>
> So about 5% regression. The regression happened sometime between 2.6.29 and
> 2.6.30 and stays the same since then... With deadline scheduler, there's
> no regression. Shouldn't we do something about it?
Background:
http://lkml.org/lkml/2009/5/28/415
Thanks for bringing this up again. I'll try to make some time to look
into it if others don't beat me to it.
Cheers,
Jeff
> Hi,
>
> I took time and remeasured tiobench results on recent kernel. A short
> conclusion is that there is still a performance regression which I reported
> few months ago. The machine is Intel 2 CPU with 2 GB RAM and plain SATA
> drive. tiobench sequential write performance numbers with 16 threads:
> 2.6.29: AVG STDERR
> 37.80 38.54 39.48 -> 38.606667 0.687475
>
> 2.6.32-rc5:
> 37.36 36.41 36.61 -> 36.793333 0.408928
>
> So about 5% regression. The regression happened sometime between 2.6.29 and
> 2.6.30 and stays the same since then... With deadline scheduler, there's
> no regression. Shouldn't we do something about it?
Sorry it took so long, but I've been flat out lately. I ran some
numbers against 2.6.29 and 2.6.32-rc5, both with low_latency set to 0
and to 1. Here are the results (average of two runs):
rlat | rrlat | wlat | rwlat
kernel | Thr | read | randr | write | randw | avg, max | avg, max | avg, max | avg,max
------------------------------------------------------------------------------------------------------------------------
2.6.29 | 8 | 72.95 | 20.06 | 269.66 | 231.59 | 6.625, 1683.66 | 23.241, 1547.97 | 1.761, 698.10 | 0.720, 443.64
| 16 | 72.33 | 20.03 | 278.85 | 228.81 | 13.643, 2499.77 | 46.575, 1717.10 | 3.304, 1149.29 | 1.011, 140.30
------------------------------------------------------------------------------------------------------------------------
2.6.32-rc5 | 8 | 86.58 | 19.80 | 198.82 | 205.06 | 5.694, 977.26 | 22.559, 870.16 | 2.359, 693.88 | 0.530, 24.32
| 16 | 86.82 | 21.10 | 199.00 | 212.02 | 11.010, 1958.78 | 40.195, 1662.35 | 4.679, 1351.27 | 1.007, 25.36
------------------------------------------------------------------------------------------------------------------------
2.6.32-rc5 | 8 | 87.65 | 117.65 | 298.27 | 212.35 | 5.615, 984.89 | 4.060, 97.39 | 1.535, 311.14 | 0.534, 24.29
low_lat=0 | 16 | 95.60 | 119.95*| 302.48 | 213.27 | 10.263, 1750.19 | 13.899, 1006.21 | 3.221, 734.22 | 1.062, 40.40
------------------------------------------------------------------------------------------------------------------------
Legend:
rlat - read latency
rrlat - random read latency
wlat - write lancy
rwlat - random write latency
* - the two runs reported vastly different numbers: 67.53 and 172.46
So, as you can see, if we turn off the low_latency tunable, we get
better numbers across the board with the exception of random writes.
It's also interesting to note that the latencies reported by tiobench
are more favorable with low_latency set to 0, which is
counter-intuitive.
So, now it seems we don't have a regression in sequential read
bandwidth, but we do have a regression in random read bandwidth (though
the random write latencies look better). So, I'll look into that, as it
is almost 10%, which is significant.
Cheers,
Jeff
Sorry, I don't see a 10% regression in random read from your numbers.
I see a larger one in sequential write for low_latency=1 (this was
the regression Jan reported in the original message), but not for
low_latency=0. And a 10% regression in random writes, that is not
completely fixed even by disabling low_latency.
I guess your seemingly counter-intuitive results for low_latency are
due to the uncommon hardware (low_latency was intended mainly for
desktop-class disks). Luckily, the patches queued for 2.6.33 already
address this low_latency misbehaviour.
Thanks,
Corrado.
> Jan Kara <ja...@suse.cz> writes:
>
>> Hi,
>>
>> I took time and remeasured tiobench results on recent kernel. A short
>> conclusion is that there is still a performance regression which I reported
>> few months ago. The machine is Intel 2 CPU with 2 GB RAM and plain SATA
>> drive. tiobench sequential write performance numbers with 16 threads:
>> 2.6.29: AVG STDERR
>> 37.80 38.54 39.48 -> 38.606667 0.687475
>>
>> 2.6.32-rc5:
>> 37.36 36.41 36.61 -> 36.793333 0.408928
>>
>> So about 5% regression. The regression happened sometime between 2.6.29 and
>> 2.6.30 and stays the same since then... With deadline scheduler, there's
>> no regression. Shouldn't we do something about it?
>
> Sorry it took so long, but I've been flat out lately. I ran some
> numbers against 2.6.29 and 2.6.32-rc5, both with low_latency set to 0
> and to 1. Here are the results (average of two runs):
I modified the tiobench script to do a drop_caches between runs so I
could stop fiddling around with the numbers myself. Extra credit goes
to anyone who hacks it up to report standard deviation.
Anyway, here are the latest results, average of 3 runs each for 2.6.29
and 2.6.32-rc6 with low_latency set to 0. Note that there was a fix in
CFQ that would result in properly preempting the active queue for
metadata I/O.
rlat | rrlat | wlat | rwlat
kernel | Thr | read | randr | write | randw | avg, max | avg, max | avg, max | avg,max
------------------------------------------------------------------------------------------------------------------------
2.6.29 | 8 | 66.43 | 20.52 | 296.32 | 214.17 | 22.330, 3106.47 | 70.026, 2804.02 | 4.817, 2406.65 | 1.420, 349.44
| 16 | 63.28 | 20.45 | 322.65 | 212.77 | 46.457, 5779.14 |137.455, 4982.75 | 8.378, 5408.60 | 2.764, 425.79
------------------------------------------------------------------------------------------------------------------------
2.6.32-rc6 | 8 | 87.66 | 115.22 | 324.19 | 222.18 | 16.677, 3065.81 | 11.834, 194.18 | 4.261, 1212.86 | 1.577, 103.20
low_lat=0 | 16 | 94.06 | 49.65 | 327.06 | 214.74 | 30.318, 5468.20 | 50.947, 1725.15 | 8.271, 1522.95 | 3.064, 89.16
------------------------------------------------------------------------------------------------------------------------
Given those numbers, everything looks ok from a regression perspective.
More investigation should be done for the random read numbers (given
that they fluctuate quite a bit), but that's purely an enhancement at
this point in time.
Just to be sure, I'll kick off 10 runs and make sure the averages fall
out the same way. If you don't hear from me, though, assume this
regression is fixed. The key is to set low_latency to 0 for this
benchmark. We should probably add notes about when to switch off
low_latency to the io scheduler documentation. Jens, would you mind
doing that?
Jeff, Jens,
do you think we should try to do more auto-tuning of cfq parameters?
Looking at those numbers for SANs, I think we are being suboptimal in
some cases.
E.g. sequential read throughput is lower than random read.
In those cases, converting all sync queues in sync-noidle (as defined
in for-2.6.33) should allow a better aggregate throughput when there
are multiple sequential readers, as in those tiobench tests.
I also think that current slice_idle and slice_sync values are good
for devices with 8ms seek time, but they are too high for non-NCQ
flash devices, where "seek" penalty is under 1ms, and we still prefer
idling.
If we agree on this, should the measurement part (I'm thinking to
measure things like seek time, throughput, etc...) be added to the
common elevator code, or done inside cfq?
If we want to put it in the common code, maybe we can also remove the
duplication of NCQ detection, by publishing the NCQ flag from elevator
to the io-schedulers.
Thanks,
Corrado
>
> Cheers,
> Jeff
> --
> Jeff, Jens,
> do you think we should try to do more auto-tuning of cfq parameters?
> Looking at those numbers for SANs, I think we are being suboptimal in
> some cases.
> E.g. sequential read throughput is lower than random read.
I investigated this further, and this was due to a problem in the
benchmark. It was being run with only 500 samples for random I/O and
65536 samples for sequential. After fixing this, we see random I/O is
slower than sequential, as expected.
> I also think that current slice_idle and slice_sync values are good
> for devices with 8ms seek time, but they are too high for non-NCQ
> flash devices, where "seek" penalty is under 1ms, and we still prefer
> idling.
Do you have numbers to back that up? If not, throw a fio job file over
the fence and I'll test it on one such device.
> If we agree on this, should the measurement part (I'm thinking to
> measure things like seek time, throughput, etc...) be added to the
> common elevator code, or done inside cfq?
Well, if it's something that is of interest to others, than pushing it
up a layer makes sense. If only CFQ is going to use it, keep it there.
Cheers,
Jeff
>> If we agree on this, should the measurement part (I'm thinking to
>> measure things like seek time, throughput, etc...) be added to the
>> common elevator code, or done inside cfq?
>
> Well, if it's something that is of interest to others, than pushing it
> up a layer makes sense. If only CFQ is going to use it, keep it there.
If the direction is to have only one intelligent I/O scheduler, as the
removal of anticipatory indicates, then it is the latter. I don't
think noop or deadline will ever make any use of them.
But it could still be useful for reporting performance as seen by the
kernel, after the page cache.
Thanks
Corrado
Honza
--
Jan Kara <ja...@suse.cz>
SUSE Labs, CR
Honza
--
Jan Kara <ja...@suse.cz>
SUSE Labs, CR
> Sadly, I don't see the improvement you can see :(. The numbers are the
> same regardless low_latency set to 0:
> 2.6.32-rc5 low_latency = 0:
> 37.39 36.43 36.51 -> 36.776667 0.434920
> But my testing environment is a plain SATA drive so that probably
> explains the difference...
I just retested (10 runs for each kernel) on a SATA disk with no NCQ
support and I could not see a difference. I'll try to dig up a disk
that support NCQ. Is that what you're using for testing?
Cheers,
Jeff
2.6.29 2.6.32-rc6,low_latency=0
----------------------------------
Average: 34.6648 34.4475
Pop.Std.Dev.: 0.55523 0.21981
> 2.6.29 2.6.32-rc6,low_latency=0
> ----------------------------------
> Average: 34.6648 34.4475
> Pop.Std.Dev.: 0.55523 0.21981
Hmm, strange. Miklos Szeredi tried tiobench on his machine and he also
saw the regression. I'll try to think what could make the difference.
Honza
--
Jan Kara <ja...@suse.cz>
SUSE Labs, CR
> On Wed 11-11-09 12:43:30, Jeff Moyer wrote:
>> Jan Kara <ja...@suse.cz> writes:
>>
>> > Sadly, I don't see the improvement you can see :(. The numbers are the
>> > same regardless low_latency set to 0:
>> > 2.6.32-rc5 low_latency = 0:
>> > 37.39 36.43 36.51 -> 36.776667 0.434920
>> > But my testing environment is a plain SATA drive so that probably
>> > explains the difference...
>>
>> I just retested (10 runs for each kernel) on a SATA disk with no NCQ
>> support and I could not see a difference. I'll try to dig up a disk
>> that support NCQ. Is that what you're using for testing?
> I don't think I am. How do I find out?
Good question. ;-) I grep for NCQ in dmesg output and make sure it's
greater than 0/32. There may be a better way, though.
>> 2.6.29 2.6.32-rc6,low_latency=0
>> ----------------------------------
>> Average: 34.6648 34.4475
>> Pop.Std.Dev.: 0.55523 0.21981
> Hmm, strange. Miklos Szeredi tried tiobench on his machine and he also
> saw the regression. I'll try to think what could make the difference.
OK, I'll try again.
Cheers,
Jeff
cat /sys/block/<dev>/device/queue_depth
:-)
--
Jens Axboe
Yeah, only works for storage that plugs into the SCSI stack.
What I thought might make a difference why I'm seeing the drop and you
are not is size of RAM or number of CPUs vs the tiobench file size or
number of threads. I'm running on a machine with 2 GB of RAM, using 4 GB
filesize. The machine has 2 cores and I'm using 16 tiobench threads. I'm
now rerunning tests with various numbers of threads to see how big
difference it makes.
Honza
Honza
--
Jan Kara <ja...@suse.cz>
SUSE Labs, CR
2.6.32-rc7:
Threads Avg Stddev
1 41.580000 0.403072
2 39.163333 0.374641
4 39.483333 0.400111
8 38.560000 0.106145
16 37.966667 0.098770
32 36.476667 0.032998
So apparently the difference between 2.6.29 and 2.6.32-rc7 increases as
the number of threads rises. With how many threads have you been running
when using SATA drive and what machine is it?
I'm now running a test with larger file size (8GB instead of 4) to see
what difference it makes.
I've been running with both 8 and 16 threads. The machine has 4 CPUs
and 4GB of RAM. I've been testing with an 8GB file size.
Cheers,
Jeff
Other details may be relevant, e.g.the file system on which the file
is located, whether the caches are dropped before starting each run,
and so on.
Corrado
2.6.32-rc7:
1 41.860000 0.063770
2 39.196667 0.012472
4 39.426667 0.162138
8 37.550000 0.040825
16 37.710000 0.096264
32 35.680000 0.109848
BTW: I'm running the test always on a fresh ext3 in data=ordered mode
with barrier=1.
Honza
--
Jan Kara <ja...@suse.cz>
SUSE Labs, CR