Linux 4.12 BFQ I/O Benchmarks (Phoronix)

peetp...@gmail.com

unread,

May 8, 2017, 11:54:31 AM5/8/17

to bfq-iosched

It seems that phoronix has published some BFQ benchmarks:

https://www.phoronix.com/scan.php?page=article&item=linux-412-io&num=1

I am sure that those test are more synthetic and don't measure latency but still, the overall I/O performance seems quite poor.

Is this somekind of regression while porting BFQ to the mq-scheduler-framwork?

~Peet

Paolo Valente

unread,

May 8, 2017, 1:52:41 PM5/8/17

to bfq-i...@googlegroups.com

The regressions that are common to all blk-mq schedulers are most
certainly due to problems of the blk-mq framework. I haven't seen any
such problem with my devices, so I guess it is something
device-specific. The other regressions are due to the fact that, one
one side, all tests are run with the default BFQ configuration, geared
towards responsiveness and low latency for soft real time
applications, while, on the oppose side, all those tests are
throughput-centric (even those reporting time as a figure of merit).
That configuration does sacrifice throughput when needed for
application and system-level latency. With a throughput-centric,
synthetic workload, the only concrete result is a loss of performance
on the throughput-related figures of merit under test.

Thank you very much for your report, which let me understand that I
have to work much more on informing people on how to reconfigure BFQ,
very easily, if they don't want to use BFQ for the use cases for which
it has been fine tuned over these years, but only for the typical
server-like workloads.

Let me start right now: if you are concerned only about throughput,
then set to 0 the low_latency parameter. This switches off all extra
low-latency mechanisms. If this is not enough on your flash-based
device, then set slice_idle to 0 too.

I hope I have provided you with enough information. I'll try post a
comment on that article too.

Thanks,
Paolo

> ~Peet
>
> --
> You received this message because you are subscribed to the Google Groups "bfq-iosched" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to bfq-iosched...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Eric Wheeler

unread,

May 11, 2017, 2:10:42 PM5/11/17

to 'Paolo Valente' via bfq-iosched, Kai Krakow

On Mon, 8 May 2017, 'Paolo Valente' via bfq-iosched wrote:
>
> > Il giorno 08 mag 2017, alle ore 17:54, peetp...@gmail.com ha scritto:
> >
> > It seems that phoronix has published some BFQ benchmarks:
> >
> > https://www.phoronix.com/scan.php?page=article&item=linux-412-io&num=1
> >
> > I am sure that those test are more synthetic and don't measure latency but still, the overall I/O performance seems quite poor.
> > Is this somekind of regression while porting BFQ to the mq-scheduler-framwork?
> >
>
> The regressions that are common to all blk-mq schedulers are most
> certainly due to problems of the blk-mq framework. I haven't seen any
> such problem with my devices, so I guess it is something
> device-specific. The other regressions are due to the fact that, one
> one side, all tests are run with the default BFQ configuration, geared
> towards responsiveness and low latency for soft real time
> applications, while, on the oppose side, all those tests are
> throughput-centric (even those reporting time as a figure of merit).
> That configuration does sacrifice throughput when needed for
> application and system-level latency. With a throughput-centric,
> synthetic workload, the only concrete result is a loss of performance
> on the throughput-related figures of merit under test.
>
> Thank you very much for your report, which let me understand that I
> have to work much more on informing people on how to reconfigure BFQ,
> very easily, if they don't want to use BFQ for the use cases for which
> it has been fine tuned over these years, but only for the typical
> server-like workloads.

Since you mentioned tuning, I thought to share the script below which we
use to configure BFQ. Since we use bcache, the SSD is low-latency and the
spinning disk is configured for higher throughput (low_latency=0). The
comments at the end of each line specifies the default.

This tunes many of the queue attributes on the entire blockdev stack, so
it isn't just BFQ---but this script works very well for our use with lots
of virtual machines.

I would be curious to know what others think of these settings and whether
or not this helps.

Relatedly for this thread, we noticed that autotuning max_budget=0 (at
least in bfq v8r5) is too low for good throughput on our simple hardware
(6x3TB 7.2k RAID5 + 4x Samsung Pro 850's RAID10, all in 64k stripes). By
setting max_budget to our stride-width (64*5*1024/blocksize) and the 16MB
for our SSDs, the throughput was noticable better without sacrificing
latency (at least not noticably).

==========================================================================
#!/bin/bash

# Can this be written as a udev rule?
# see page 19+: https://events.linuxfoundation.org/sites/events/files/eeus13_shelton.pdf

for i in /sys/block/*; do
DEVPATH=/dev/`basename $i`
DEVPATH=`echo $DEVPATH | tr '!' /`

# Assume etherd (AoE) isn't rotational, let the target decide what to do:
if [[ "$i" =~ etherd ]]; then
echo 0 > $i/queue/rotational
fi

# Some values from https://www.kernel.org/doc/Documentation/block/queue-sysfs.txt

# If the scheduler != 'none', then it is a normal hardware-like disk:
if [ -e $i/queue/scheduler ] && [ "`cat $i/queue/scheduler`" != none ]; then
# Set 64k read-ahead by default for disks
echo 64 > $i/queue/read_ahead_kb 2> /dev/null

### These metrics work quite well with the BFQ scheduler:
echo bfq > $i/queue/scheduler
if [ "`cat $i/queue/rotational`" = 1 ]; then
#### rotational disk:
echo "Updating $i (rotational)":

### /queue/
echo 1024 > $i/queue/max_sectors_kb # 1024
echo 512 > $i/queue/nr_requests # 128

### /queue/iosched/

## Factors (scalers)
# maximum factor by which the weight of a weight-raised queue is multiplied.
# echo 30 > $i/queue/iosched/wr_coeff # 30

## Times
# I like these to be prime to minimize alignment.
# timeout_* should divide into fifo_expire_* at least once.
echo 131 > $i/queue/iosched/timeout_sync # 125 ms

echo 263 > $i/queue/iosched/fifo_expire_sync # 124 ms
echo 353 > $i/queue/iosched/fifo_expire_async # 248 ms

# wr_* are BFQ "weight-raising" knobs:
# maximum duration of a weight-raising period (jiffies).
# echo 3357 > $i/queue/iosched/wr_max_time # 3357 jiffies

# minimum idle period after which weight-raising may be reactivated for a queue (in jiffies).
# echo 2000 > $i/queue/iosched/wr_min_idle_time # 2000 jiffies

# minimum period between request arrivals after which weight-raising
# may be reactivated for an already busy queue (jiffies).
# echo 2000 > $i/queue/iosched/wr_min_idle_time # 2000 jiffies

## Sectors

# max_budget: convert to bytes:
#echo 0 > $i/queue/iosched/max_budget # 0=autotune, n sectors
# 4x 5-disk 64k strides, units in sectors:
echo $((4*5*64*1024 / `blockdev --getbsz $DEVPATH`)) \
> $i/queue/iosched/max_budget
#echo $((2048*1024 / `blockdev --getbsz $DEVPATH`)) \
# > $i/queue/iosched/max_budget

## Bytes

# back_seek_max (KBytes) default is 16MB (16384)
# Set to ~1/16" backtravel given 100GB/sq-in areal density (72gb).
# (will average over spindle count):
echo $((72*1024*1024)) > $i/queue/iosched/back_seek_max # default 16384

## Knobs (value meaning varies)
echo 0 > $i/queue/iosched/slice_idle # default 8 in bfq v7r8
echo 0 > $i/queue/iosched/low_latency # 1, bool
echo 1 > $i/queue/add_random # default 1

# Use full (CPU expensive) merges for rotating disks:
echo 0 > $i/queue/nomerges # default 0, full merge

else
#### non-rotational disk:
echo "Updating $i (SSD)":

### /queue/
echo 1024 > $i/queue/max_sectors_kb # default 1024
echo 256 > $i/queue/nr_requests # default 128

### /queue/iosched/

## Times
# I like these to be prime to minimize alignment.
# timeout_* should divide into fifo_expire_* at least once.
#echo 37 > $i/queue/iosched/timeout_sync # default 125 ms
#echo 83 > $i/queue/iosched/fifo_expire_sync # default 124 ms
#echo 167 > $i/queue/iosched/fifo_expire_async # default 248 ms

echo 131 > $i/queue/iosched/timeout_sync # 125 ms

echo 263 > $i/queue/iosched/fifo_expire_sync # 124 ms
echo 353 > $i/queue/iosched/fifo_expire_async # 248 ms

## Sectors

# max_budget: convert to bytes:
#echo 0 > $i/queue/iosched/max_budget # 0=autotune, n sectors
# 16MB SSD Budget
echo $((16*1024*1024 / `blockdev --getbsz $DEVPATH`)) \
> $i/queue/iosched/max_budget # 8MB

## Bytes

# back_seek_max (KBytes) default is 16MB (16384)
# ssd 256gb back-seek is ok, we're an SSD!
echo $((256*1024*1024)) > $i/queue/iosched/back_seek_max

## Knobs (unitless)
echo 0 > $i/queue/iosched/slice_idle # default 8 in bfq v7r8
echo 1 > $i/queue/iosched/low_latency # default 1, bool
echo 1 > $i/queue/add_random # default 1

# Use simple merging only for SSDs:
echo 1 > $i/queue/nomerges # default 0

fi

else
# These must not be normal disks because their scheduler is 'none'
# More than likely they are like devicemapper targets or some other
# non-queue block device (drbd, zram, bcache, dm, etc).

# Skip the entropy overhead for non-queue blockdevs:
echo 0 > $i/queue/add_random

# This enables the user to disable the lookup logic involved with IO
# merging requests in the block layer. By default (0) all merges are
# enabled. When set to 1 only simple one-hit merges will be tried. When
# set to 2 no merge algorithms will be tried (including one-hit or more
# complex tree/hash lookups).
if [[ "$i" =~ bcache ]]; then
# Merge before hitting bcache since we are getting close
# to the disks.
echo 1 > $i/queue/nomerges

# Treat bcache as rotational:
echo 1 > $i/queue/rotational
else
echo 2 > $i/queue/nomerges
echo 0 > $i/queue/rotational
fi

# If this option is '1', the block layer will migrate request completions to the
# cpu "group" that originally submitted the request. For some workloads this
# provides a significant reduction in CPU cycles due to caching effects.
#
# For storage configurations that need to maximize distribution of completion
# processing setting this option to '2' forces the completion to run on the
# requesting cpu (bypassing the "group" aggregation logic).
echo 1 > $i/queue/rq_affinity

# Don't worry about IO counters on virtual disks:
echo 0 > $i/queue/iostats

echo 64 > $i/queue/read_ahead_kb

echo 64 > $i/queue/max_sectors_kb
fi

echo "$DEVPATH: rotational: `cat $i/queue/rotational` sched='`cat $i/queue/scheduler`' "

done
==========================================================================

--
Eric Wheeler

Reply all

Reply to author

Forward