On Mon, 8 May 2017, 'Paolo Valente' via bfq-iosched wrote:
>
> > Il giorno 08 mag 2017, alle ore 17:54,
peetp...@gmail.com ha scritto:
> >
> > It seems that phoronix has published some BFQ benchmarks:
> >
> >
https://www.phoronix.com/scan.php?page=article&item=linux-412-io&num=1
> >
> > I am sure that those test are more synthetic and don't measure latency but still, the overall I/O performance seems quite poor.
> > Is this somekind of regression while porting BFQ to the mq-scheduler-framwork?
> >
>
> The regressions that are common to all blk-mq schedulers are most
> certainly due to problems of the blk-mq framework. I haven't seen any
> such problem with my devices, so I guess it is something
> device-specific. The other regressions are due to the fact that, one
> one side, all tests are run with the default BFQ configuration, geared
> towards responsiveness and low latency for soft real time
> applications, while, on the oppose side, all those tests are
> throughput-centric (even those reporting time as a figure of merit).
> That configuration does sacrifice throughput when needed for
> application and system-level latency. With a throughput-centric,
> synthetic workload, the only concrete result is a loss of performance
> on the throughput-related figures of merit under test.
>
> Thank you very much for your report, which let me understand that I
> have to work much more on informing people on how to reconfigure BFQ,
> very easily, if they don't want to use BFQ for the use cases for which
> it has been fine tuned over these years, but only for the typical
> server-like workloads.
Since you mentioned tuning, I thought to share the script below which we
use to configure BFQ. Since we use bcache, the SSD is low-latency and the
spinning disk is configured for higher throughput (low_latency=0). The
comments at the end of each line specifies the default.
This tunes many of the queue attributes on the entire blockdev stack, so
it isn't just BFQ---but this script works very well for our use with lots
of virtual machines.
I would be curious to know what others think of these settings and whether
or not this helps.
Relatedly for this thread, we noticed that autotuning max_budget=0 (at
least in bfq v8r5) is too low for good throughput on our simple hardware
(6x3TB 7.2k RAID5 + 4x Samsung Pro 850's RAID10, all in 64k stripes). By
setting max_budget to our stride-width (64*5*1024/blocksize) and the 16MB
for our SSDs, the throughput was noticable better without sacrificing
latency (at least not noticably).
==========================================================================
#!/bin/bash
# Can this be written as a udev rule?
# see page 19+:
https://events.linuxfoundation.org/sites/events/files/eeus13_shelton.pdf
for i in /sys/block/*; do
DEVPATH=/dev/`basename $i`
DEVPATH=`echo $DEVPATH | tr '!' /`
# Assume etherd (AoE) isn't rotational, let the target decide what to do:
if [[ "$i" =~ etherd ]]; then
echo 0 > $i/queue/rotational
fi
# Some values from
https://www.kernel.org/doc/Documentation/block/queue-sysfs.txt
# If the scheduler != 'none', then it is a normal hardware-like disk:
if [ -e $i/queue/scheduler ] && [ "`cat $i/queue/scheduler`" != none ]; then
# Set 64k read-ahead by default for disks
echo 64 > $i/queue/read_ahead_kb 2> /dev/null
### These metrics work quite well with the BFQ scheduler:
echo bfq > $i/queue/scheduler
if [ "`cat $i/queue/rotational`" = 1 ]; then
#### rotational disk:
echo "Updating $i (rotational)":
### /queue/
echo 1024 > $i/queue/max_sectors_kb # 1024
echo 512 > $i/queue/nr_requests # 128
### /queue/iosched/
## Factors (scalers)
# maximum factor by which the weight of a weight-raised queue is multiplied.
# echo 30 > $i/queue/iosched/wr_coeff # 30
## Times
# I like these to be prime to minimize alignment.
# timeout_* should divide into fifo_expire_* at least once.
echo 131 > $i/queue/iosched/timeout_sync # 125 ms
echo 263 > $i/queue/iosched/fifo_expire_sync # 124 ms
echo 353 > $i/queue/iosched/fifo_expire_async # 248 ms
# wr_* are BFQ "weight-raising" knobs:
# maximum duration of a weight-raising period (jiffies).
# echo 3357 > $i/queue/iosched/wr_max_time # 3357 jiffies
# minimum idle period after which weight-raising may be reactivated for a queue (in jiffies).
# echo 2000 > $i/queue/iosched/wr_min_idle_time # 2000 jiffies
# minimum period between request arrivals after which weight-raising
# may be reactivated for an already busy queue (jiffies).
# echo 2000 > $i/queue/iosched/wr_min_idle_time # 2000 jiffies
## Sectors
# max_budget: convert to bytes:
#echo 0 > $i/queue/iosched/max_budget # 0=autotune, n sectors
# 4x 5-disk 64k strides, units in sectors:
echo $((4*5*64*1024 / `blockdev --getbsz $DEVPATH`)) \
> $i/queue/iosched/max_budget
#echo $((2048*1024 / `blockdev --getbsz $DEVPATH`)) \
# > $i/queue/iosched/max_budget
## Bytes
# back_seek_max (KBytes) default is 16MB (16384)
# Set to ~1/16" backtravel given 100GB/sq-in areal density (72gb).
# (will average over spindle count):
echo $((72*1024*1024)) > $i/queue/iosched/back_seek_max # default 16384
## Knobs (value meaning varies)
echo 0 > $i/queue/iosched/slice_idle # default 8 in bfq v7r8
echo 0 > $i/queue/iosched/low_latency # 1, bool
echo 1 > $i/queue/add_random # default 1
# Use full (CPU expensive) merges for rotating disks:
echo 0 > $i/queue/nomerges # default 0, full merge
else
#### non-rotational disk:
echo "Updating $i (SSD)":
### /queue/
echo 1024 > $i/queue/max_sectors_kb # default 1024
echo 256 > $i/queue/nr_requests # default 128
### /queue/iosched/
## Times
# I like these to be prime to minimize alignment.
# timeout_* should divide into fifo_expire_* at least once.
#echo 37 > $i/queue/iosched/timeout_sync # default 125 ms
#echo 83 > $i/queue/iosched/fifo_expire_sync # default 124 ms
#echo 167 > $i/queue/iosched/fifo_expire_async # default 248 ms
echo 131 > $i/queue/iosched/timeout_sync # 125 ms
echo 263 > $i/queue/iosched/fifo_expire_sync # 124 ms
echo 353 > $i/queue/iosched/fifo_expire_async # 248 ms
## Sectors
# max_budget: convert to bytes:
#echo 0 > $i/queue/iosched/max_budget # 0=autotune, n sectors
# 16MB SSD Budget
echo $((16*1024*1024 / `blockdev --getbsz $DEVPATH`)) \
> $i/queue/iosched/max_budget # 8MB
## Bytes
# back_seek_max (KBytes) default is 16MB (16384)
# ssd 256gb back-seek is ok, we're an SSD!
echo $((256*1024*1024)) > $i/queue/iosched/back_seek_max
## Knobs (unitless)
echo 0 > $i/queue/iosched/slice_idle # default 8 in bfq v7r8
echo 1 > $i/queue/iosched/low_latency # default 1, bool
echo 1 > $i/queue/add_random # default 1
# Use simple merging only for SSDs:
echo 1 > $i/queue/nomerges # default 0
fi
else
# These must not be normal disks because their scheduler is 'none'
# More than likely they are like devicemapper targets or some other
# non-queue block device (drbd, zram, bcache, dm, etc).
# Skip the entropy overhead for non-queue blockdevs:
echo 0 > $i/queue/add_random
# This enables the user to disable the lookup logic involved with IO
# merging requests in the block layer. By default (0) all merges are
# enabled. When set to 1 only simple one-hit merges will be tried. When
# set to 2 no merge algorithms will be tried (including one-hit or more
# complex tree/hash lookups).
if [[ "$i" =~ bcache ]]; then
# Merge before hitting bcache since we are getting close
# to the disks.
echo 1 > $i/queue/nomerges
# Treat bcache as rotational:
echo 1 > $i/queue/rotational
else
echo 2 > $i/queue/nomerges
echo 0 > $i/queue/rotational
fi
# If this option is '1', the block layer will migrate request completions to the
# cpu "group" that originally submitted the request. For some workloads this
# provides a significant reduction in CPU cycles due to caching effects.
#
# For storage configurations that need to maximize distribution of completion
# processing setting this option to '2' forces the completion to run on the
# requesting cpu (bypassing the "group" aggregation logic).
echo 1 > $i/queue/rq_affinity
# Don't worry about IO counters on virtual disks:
echo 0 > $i/queue/iostats
echo 64 > $i/queue/read_ahead_kb
echo 64 > $i/queue/max_sectors_kb
fi
echo "$DEVPATH: rotational: `cat $i/queue/rotational` sched='`cat $i/queue/scheduler`' "
done
==========================================================================
--
Eric Wheeler