[RFC 2/3]block: FIOPS ioscheduler core

Shaohua Li

unread,

Jan 4, 2012, 1:50:01 AM1/4/12

to

fiops-iosch.patch

Shaohua Li

unread,

Jan 4, 2012, 1:50:02 AM1/4/12

to

fiops-rw-scale.patch

Shaohua Li

unread,

Jan 4, 2012, 1:50:02 AM1/4/12

to

separate-cfq-ioc.patch

Shaohua Li

unread,

Jan 4, 2012, 1:50:02 AM1/4/12

to

An IOPS based I/O scheduler

Flash based storage has some different characteristics against rotate disk.
1. no I/O seek.
2. read and write I/O cost usually is much different.
3. Time which a request takes depends on request size.
4. High throughput and IOPS, low latency.

CFQ iosched does well for rotate disk, for example fair dispatching, idle
for sequential read. It also has optimization for flash based storage (for
item 1 above), but overall it's not designed for flash based storage. It's
a slice based algorithm. Since flash based storage request cost is very
low, and drive has big queue_depth is quite popular now which makes
dispatching cost even lower, CFQ's slice accounting (jiffy based)
doesn't work well. CFQ doesn't consider above item 2 & 3.

FIOPS (Fair IOPS) ioscheduler is trying to fix the gaps. It's IOPS based, so
only targets for drive without I/O seek. It's quite similar like CFQ, but
the dispatch decision is made according to IOPS instead of slice.

The algorithm is simple. Drive has a service tree, and each task lives in
the tree. The key into the tree is called vios (virtual I/O). Every request
has vios, which is calculated according to its ioprio, request size and so
on. Task's vios is the sum of vios of all requests it dispatches. FIOPS
always selects task with minimum vios in the service tree and let the task
dispatch request. The dispatched request's vios is then added to the task's
vios and the task is repositioned in the sevice tree.

The series are orgnized as:
Patch 1: separate CFQ's io context management code. FIOPS will use it too.
Patch 2: The core FIOPS.
Patch 3: request read/write vios scale. This demontrates how the vios scale.

To make the code simple for easy view, some scale code isn't included here,
some not implementated yet.

TODO:
1. ioprio support (have patch already)
2. request size vios scale
3. cgroup support
4. tracing support
5. automatically select default iosched according to QUEUE_FLAG_NONROT.

Comments and suggestions are welcome!

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Dave Chinner

unread,

Jan 4, 2012, 2:20:02 AM1/4/12

to

Benchmark results?

Cheers,

Dave.
--
Dave Chinner
da...@fromorbit.com

Namhyung Kim

unread,

Jan 4, 2012, 3:40:02 AM1/4/12

to

Hi,

Two of nitpicks below ..

2012-01-04 3:53 PM, Shaohua Li Wrote:
> CFQ's io context management creates a per-device io context for each task.
> It's quite generic. Separate it from CFQ, and use it for fiops I/O scheduler.
>
> Signed-off-by: Shaohua Li<shaoh...@intel.com>
> ---

[snip]

> +int ioc_builder_init(struct ioc_builder *builder)
> +{
> + if (!builder->alloc_ioc || !builder->free_ioc)
> + return -ENOMEM;
> +
> + builder->ioc_count = alloc_percpu(unsigned long);
> + if (!builder->ioc_count)
> + return -ENOMEM;
> +
> + builder->ioc_gone = NULL;
> + spin_lock_init(&builder->ioc_gone_lock);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL(ioc_builder_init);
> +
> +void io_context_builder_exit(struct ioc_builder *builder)

It'd be better using 'ioc_builder_exit' as a name for consistency, IMHO.

> +{
> + DECLARE_COMPLETION_ONSTACK(all_gone);
> +
> + builder->ioc_gone =&all_gone;
> + /* ioc_gone's update must be visible before reading ioc_count */
> + smp_wmb();
> +
> + /*
> + * this also protects us from entering cfq_slab_kill() with
> + * pending RCU callbacks
> + */
> + if (elv_ioc_count_read(*builder->ioc_count))
> + wait_for_completion(&all_gone);
> +
> + free_percpu(builder->ioc_count);
> +}
> +EXPORT_SYMBOL(io_context_builder_exit);

[snip]

> +static void queue_data_cic_free_rcu(struct rcu_head *head)
> +{
> + struct dev_io_context *cic;
> + struct ioc_builder *builder;
> +
> + cic = container_of(head, struct dev_io_context, rcu_head);
> + builder = cic->builder;
> +
> + builder->free_ioc(builder, cic);
> + elv_ioc_count_dec(*builder->ioc_count);
> +
> + if (builder->ioc_gone) {
> + /*
> + * CFQ scheduler is exiting, grab exit lock and check

s/CFQ/IO/ ?

Thanks.
Namhyung Kim

Shaohua Li

unread,

Jan 5, 2012, 1:40:02 AM1/5/12

to

I didn't have data yet. The patches are still in earlier stage, I want
to focus on the basic idea first.

Thanks,
Shaohua

Shaohua Li

unread,

Jan 6, 2012, 12:00:02 AM1/6/12

to

since you asked, I tested in a 4 socket machine with 12 X25M SSD jbod,
fs is ext4.

workload percentage change with fiops against cfq
fio_sync_read_4k -2
fio_mediaplay_64k 0
fio_mediaplay_128k 0
fio_mediaplay_rr_64k 0
fio_sync_read_rr_4k 0
fio_sync_write_128k 0
fio_sync_write_64k -1
fio_sync_write_4k -2
fio_sync_write_64k_create 0
fio_sync_write_rr_64k_create 0
fio_sync_write_128k_create 0
fio_aio_randread_4k -4
fio_aio_randread_64k 0
fio_aio_randwrite_4k 1
fio_aio_randwrite_64k 0
fio_aio_randrw_4k -1
fio_aio_randrw_64k 0
fio_tpch 9
fio_tpcc 0
fio_mmap_randread_4k -1
fio_mmap_randread_64k 1
fio_mmap_randread_1k -8
fio_mmap_randwrite_4k 35
fio_mmap_randwrite_64k 22
fio_mmap_randwrite_1k 28
fio_mmap_randwrite_4k_halfbusy 24
fio_mmap_randrw_4k 23
fio_mmap_randrw_64k 4
fio_mmap_randrw_1k 22
fio_mmap_randrw_4k_halfbusy 35
fio_mmap_sync_read_4k 0
fio_mmap_sync_read_64k -1
fio_mmap_sync_read_128k -1
fio_mmap_sync_read_rr_64k 5
fio_mmap_sync_read_rr_4k 3

The fio_mmap_randread_1k has regression against 3.2-rc7, but no
regression against 3.2-rc6 kernel, still checking why. The fiops has
improvement for read/write mixed workload. CFQ is known not good for
read/write mixed workload.

Namjae Jeon

unread,

Jan 6, 2012, 1:10:01 AM1/6/12

to

> ===================================================================
> --- linux.orig/block/Kconfig.iosched 2011-12-28 09:42:18.000000000 +0800
> +++ linux/block/Kconfig.iosched 2012-01-04 13:58:35.000000000 +0800
> @@ -43,6 +43,14 @@ config CFQ_GROUP_IOSCHED
> ---help---
> Enable group IO scheduling in CFQ.
>
> +config IOSCHED_FIOPS
> + tristate "IOPS based I/O scheduler"
> + default y
> + ---help---
> + This is an IOPS based I/O scheduler. It will try to distribute
> + IOPS equally among all processes in the system. It's mainly for
> + Flash based storage.
> +
> choice
> prompt "Default I/O scheduler"
> default DEFAULT_CFQ
Hi.
Is below code needed ?
choice
prompt "Default I/O scheduler"
default DEFAULT_CFQ
help
Select the I/O scheduler which will be used by default for all
block devices.

config DEFAULT_DEADLINE
bool "Deadline" if IOSCHED_DEADLINE=y

config DEFAULT_CFQ
bool "CFQ" if IOSCHED_CFQ=y

+ config DEFAULT_FIOPS
+ bool "FIOPS" if IOSCHED_FIOPS=y

config DEFAULT_NOOP
bool "No-op"

config DEFAULT_IOSCHED
string
default "deadline" if DEFAULT_DEADLINE
default "cfq" if DEFAULT_CFQ
default "noop" if DEFAULT_NOOP
+ default "fiops" if DEFAULT_FIOPS

Namhyung Kim

unread,

Jan 6, 2012, 4:20:01 AM1/6/12

to

2012-01-06 PM 2:12, Shaohua Li wrote:
> On Thu, 2012-01-05 at 14:50 +0800, Shaohua Li wrote:
>> On Wed, 2012-01-04 at 18:19 +1100, Dave Chinner wrote:
>>> On Wed, Jan 04, 2012 at 02:53:37PM +0800, Shaohua Li wrote:
>>>> An IOPS based I/O scheduler
>>>>
>>>> Flash based storage has some different characteristics against rotate disk.
>>>> 1. no I/O seek.
>>>> 2. read and write I/O cost usually is much different.
>>>> 3. Time which a request takes depends on request size.
>>>> 4. High throughput and IOPS, low latency.
>>>>
>>>> CFQ iosched does well for rotate disk, for example fair dispatching, idle
>>>> for sequential read. It also has optimization for flash based storage (for
>>>> item 1 above), but overall it's not designed for flash based storage. It's
>>>> a slice based algorithm. Since flash based storage request cost is very
>>>> low, and drive has big queue_depth is quite popular now which makes
>>>> dispatching cost even lower, CFQ's slice accounting (jiffy based)

>>>> doesn't work well. CFQ doesn't consider above item 2& 3.

Hi,

Looks promising. :) Anyway what's your configuration for the test? Did
you use vios scaling based on IO direction and/or ioprio?

Thanks,
Namhyung Kim

Zhu Yanhai

unread,

Jan 6, 2012, 4:50:02 AM1/6/12

to

Hi,

2012/1/4 Shaohua Li <shaoh...@intel.com>:

Actually most RAID cards we have ever seen report the wrong value for
this flag, including LSI's 1078E, 9260/9265, HBA cards, etc. So it may
needs a human-being to setup it correctly to make this adaptive code
work.
Also there is a much more complicated scenario ahead :) that's the
hybrid storage. E.g, one or two SSD mixed with several SAS disks, or
one expensive FusionIO's ioDrive mixed with several SAS disks, powered
by Facebook's Flashcache. AFAIK there is not any IO bandwidth control
solution for such things. However such setups are really very popular
these days.
I'm going to take some tests with these code against the real workload
this month, hopefully we can get more ideas, and of course the
feedback data.

--
Thanks,
Zhu Yanhai
Taobao Inc.

Jan Kara

unread,

Jan 6, 2012, 9:40:02 AM1/6/12

to

Nice, but this is only about throughput, isn't it? Part of the reason why
CFQ takes hit in the throughput is that it prefers sync IO (i.e. reads and
synchronous writes) against other IO. Does your scheduler do anything like
that? Could you for example compare a latency of reads while running heavy
background writing between CFQ and your scheduler? Loads like this where
original motivation for CFQ I believe.

Honza
--
Jan Kara <ja...@suse.cz>
SUSE Labs, CR

Zhu Yanhai

unread,

Jan 6, 2012, 8:10:02 PM1/6/12

to

2012/1/4 Shaohua Li <shaoh...@intel.com>:
> FIOPS (Fair IOPS) ioscheduler is IOPS based ioscheduler, so only targets

> for drive without I/O seek. It's quite similar like CFQ, but the dispatch
> decision is made according to IOPS instead of slice.
>
> The algorithm is simple. Drive has a service tree, and each task lives in
> the tree. The key into the tree is called vios (virtual I/O). Every request
> has vios, which is calculated according to its ioprio, request size and so
> on. Task's vios is the sum of vios of all requests it dispatches. FIOPS
> always selects task with minimum vios in the service tree and let the task
> dispatch request. The dispatched request's vios is then added to the task's
> vios and the task is repositioned in the sevice tree.
>

> Unlike CFQ, FIOPS doesn't have separate sync/async queues, because with I/O
> less writeback, usually a task can only dispatch either sync or async requests.
> Bias read or write request can still be done with read/write scale.
>
> One issue is if workload iodepth is lower than drive queue_depth, IOPS
> share of a task might not be strictly according to its priority, request
> size and so on. In this case, the drive is in idle actually. Solving the
> problem need make drive idle, so impact performance. I believe CFQ isn't
> completely fair between tasks in such case too.

>
> Signed-off-by: Shaohua Li <shaoh...@intel.com>
> ---

> block/Kconfig.iosched | 8
> block/Makefile | 1
> block/blk-ioc.c | 2
> block/blk.h | 2
> block/fiops-iosched.c | 659 ++++++++++++++++++++++++++++++++++++++++++++++++++
> 5 files changed, 670 insertions(+), 2 deletions(-)
>
> Index: linux/block/Makefile
> ===================================================================
> --- linux.orig/block/Makefile 2011-12-28 09:42:18.000000000 +0800
> +++ linux/block/Makefile 2011-12-28 09:42:35.000000000 +0800
> @@ -14,6 +14,7 @@ obj-$(CONFIG_BLK_DEV_THROTTLING) += blk-
> obj-$(CONFIG_IOSCHED_NOOP) += noop-iosched.o
> obj-$(CONFIG_IOSCHED_DEADLINE) += deadline-iosched.o
> obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched.o
> +obj-$(CONFIG_IOSCHED_FIOPS) += fiops-iosched.o
>
> obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
> obj-$(CONFIG_BLK_DEV_INTEGRITY) += blk-integrity.o
> Index: linux/block/fiops-iosched.c
> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ linux/block/fiops-iosched.c 2012-01-04 14:01:26.000000000 +0800
> @@ -0,0 +1,659 @@
> +/*
> + * IOPS based IO scheduler. Based on CFQ.
> + * Copyright (C) 2003 Jens Axboe <ax...@kernel.dk>
> + * Shaohua Li <sh...@kernel.org>
> + */
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/blkdev.h>
> +#include <linux/elevator.h>
> +#include <linux/jiffies.h>
> +#include <linux/rbtree.h>
> +#include <linux/ioprio.h>
> +#include "blk.h"
> +
> +#define VIOS_SCALE_SHIFT 10
> +#define VIOS_SCALE (1 << VIOS_SCALE_SHIFT)
> +
> +struct fiops_rb_root {
> + struct rb_root rb;
> + struct rb_node *left;
> + unsigned count;
> +
> + u64 min_vios;
> +};
> +#define FIOPS_RB_ROOT (struct fiops_rb_root) { .rb = RB_ROOT}
> +
> +struct fiops_data {
> + struct queue_data qdata;
> +
> + struct fiops_rb_root service_tree;
> +
> + unsigned int busy_queues;
> +
> + struct work_struct unplug_work;
> +};
> +
> +struct fiops_ioc {
> + struct dev_io_context dev_ioc;
> +
> + unsigned int flags;
> + struct fiops_data *fiopsd;
> + struct rb_node rb_node;
> + u64 vios; /* key in service_tree */
> + struct fiops_rb_root *service_tree;
> +
> + struct rb_root sort_list;
> + struct list_head fifo;
> +
> + pid_t pid;
> +};
> +
> +static struct kmem_cache *fiops_ioc_pool;
> +static struct ioc_builder ioc_builder;
> +#define queue_data_to_fiopsd(ptr) container_of(ptr, struct fiops_data, qdata)
> +#define dev_ioc_to_fiops_ioc(ptr) container_of(ptr, struct fiops_ioc, dev_ioc)
> +#define ioc_service_tree(ioc) (&((ioc)->fiopsd->service_tree))
> +
> +#define RQ_CIC(rq) \
> + ((struct fiops_ioc *) (rq)->elevator_private[0])
> +enum ioc_state_flags {
> + FIOPS_IOC_FLAG_on_rr = 0, /* on round-robin busy list */
> +};
> +
> +#define FIOPS_IOC_FNS(name) \
> +static inline void fiops_mark_ioc_##name(struct fiops_ioc *ioc) \
> +{ \
> + ioc->flags |= (1 << FIOPS_IOC_FLAG_##name); \
> +} \
> +static inline void fiops_clear_ioc_##name(struct fiops_ioc *ioc) \
> +{ \
> + ioc->flags &= ~(1 << FIOPS_IOC_FLAG_##name); \
> +} \
> +static inline int fiops_ioc_##name(const struct fiops_ioc *ioc) \
> +{ \
> + return ((ioc)->flags & (1 << FIOPS_IOC_FLAG_##name)) != 0; \
> +}
> +
> +FIOPS_IOC_FNS(on_rr);
> +#undef FIOPS_IOC_FNS
> +
> +/*
> + * The below is leftmost cache rbtree addon
> + */
> +static struct fiops_ioc *fiops_rb_first(struct fiops_rb_root *root)
> +{
> + /* Service tree is empty */
> + if (!root->count)
> + return NULL;
> +
> + if (!root->left)
> + root->left = rb_first(&root->rb);
> +
> + if (root->left)
> + return rb_entry(root->left, struct fiops_ioc, rb_node);
> +
> + return NULL;
> +}
> +
> +static void rb_erase_init(struct rb_node *n, struct rb_root *root)
> +{
> + rb_erase(n, root);
> + RB_CLEAR_NODE(n);
> +}
> +
> +static void fiops_rb_erase(struct rb_node *n, struct fiops_rb_root *root)
> +{
> + if (root->left == n)
> + root->left = NULL;
> + rb_erase_init(n, &root->rb);
> + --root->count;
> +}
> +
> +static inline u64 max_vios(u64 min_vios, u64 vios)
> +{
> + s64 delta = (s64)(vios - min_vios);
> + if (delta > 0)
> + min_vios = vios;
> +
> + return min_vios;
> +}
> +
> +static void fiops_update_min_vios(struct fiops_rb_root *service_tree)
> +{
> + struct fiops_ioc *ioc;
> +
> + ioc = fiops_rb_first(service_tree);
> + if (!ioc)
> + return;
> + service_tree->min_vios = max_vios(service_tree->min_vios, ioc->vios);
This is tracking the max vios here, not the min one. Should be min_vios()?
> +}
> +
> +/*
> + * The fiopsd->service_trees holds all pending fiops_ioc's that have
> + * requests waiting to be processed. It is sorted in the order that
> + * we will service the queues.
> + */
> +static void fiops_service_tree_add(struct fiops_data *fiopsd,
> + struct fiops_ioc *ioc)
> +{
> + struct rb_node **p, *parent;
> + struct fiops_ioc *__ioc;
> + struct fiops_rb_root *service_tree = ioc_service_tree(ioc);
> + u64 vios;
> + int left;
> +
> + /* New added IOC */
> + if (RB_EMPTY_NODE(&ioc->rb_node))
> + vios = max_vios(service_tree->min_vios, ioc->vios);
> + else {
> + vios = ioc->vios;
> + /* ioc->service_tree might not equal to service_tree */
> + fiops_rb_erase(&ioc->rb_node, ioc->service_tree);
> + ioc->service_tree = NULL;
> + }
> +
> + left = 1;
> + parent = NULL;
> + ioc->service_tree = service_tree;
> + p = &service_tree->rb.rb_node;
> + while (*p) {
> + struct rb_node **n;
> +
> + parent = *p;
> + __ioc = rb_entry(parent, struct fiops_ioc, rb_node);
> +
> + /*
> + * sort by key, that represents service time.
> + */
> + if (vios < __ioc->vios)
> + n = &(*p)->rb_left;
> + else {
> + n = &(*p)->rb_right;
> + left = 0;
> + }
> +
> + p = n;
> + }
> +
> + if (left)
> + service_tree->left = &ioc->rb_node;
> +
> + ioc->vios = vios;
> + rb_link_node(&ioc->rb_node, parent, p);
> + rb_insert_color(&ioc->rb_node, &service_tree->rb);
> + service_tree->count++;
> +
> + fiops_update_min_vios(service_tree);
> +}
> +
> +/*
> + * Update ioc's position in the service tree.
> + */
> +static void fiops_resort_rr_list(struct fiops_data *fiopsd,
> + struct fiops_ioc *ioc)
> +{
> + /*
> + * Resorting requires the ioc to be on the RR list already.
> + */
> + if (fiops_ioc_on_rr(ioc))
> + fiops_service_tree_add(fiopsd, ioc);
> +}
> +
> +/*
> + * add to busy list of queues for service, trying to be fair in ordering
> + * the pending list according to last request service
> + */
> +static void fiops_add_ioc_rr(struct fiops_data *fiopsd, struct fiops_ioc *ioc)
> +{
> + BUG_ON(fiops_ioc_on_rr(ioc));
> + fiops_mark_ioc_on_rr(ioc);
> +
> + fiopsd->busy_queues++;
> +
> + fiops_resort_rr_list(fiopsd, ioc);
> +}
> +
> +/*
> + * Called when the ioc no longer has requests pending, remove it from
> + * the service tree.
> + */
> +static void fiops_del_ioc_rr(struct fiops_data *fiopsd, struct fiops_ioc *ioc)
> +{
> + BUG_ON(!fiops_ioc_on_rr(ioc));
> + fiops_clear_ioc_on_rr(ioc);
> +
> + if (!RB_EMPTY_NODE(&ioc->rb_node)) {
> + fiops_rb_erase(&ioc->rb_node, ioc->service_tree);
> + ioc->service_tree = NULL;
> + }
> +
> + BUG_ON(!fiopsd->busy_queues);
> + fiopsd->busy_queues--;
> +}
> +
> +/*
> + * rb tree support functions
> + */
> +static void fiops_del_rq_rb(struct request *rq)
> +{
> + struct fiops_ioc *ioc = RQ_CIC(rq);
> +
> + elv_rb_del(&ioc->sort_list, rq);
> +}
> +
> +static void fiops_add_rq_rb(struct request *rq)
> +{
> + struct fiops_ioc *ioc = RQ_CIC(rq);
> + struct fiops_data *fiopsd = ioc->fiopsd;
> +
> + elv_rb_add(&ioc->sort_list, rq);
> +
> + if (!fiops_ioc_on_rr(ioc))
> + fiops_add_ioc_rr(fiopsd, ioc);
> +}
> +
> +static void fiops_reposition_rq_rb(struct fiops_ioc *ioc, struct request *rq)
> +{
> + elv_rb_del(&ioc->sort_list, rq);
> + fiops_add_rq_rb(rq);
> +}
> +
> +static void fiops_remove_request(struct request *rq)
> +{
> + list_del_init(&rq->queuelist);
> + fiops_del_rq_rb(rq);
> +}
> +
> +static u64 fiops_scaled_vios(struct fiops_data *fiopsd,
> + struct fiops_ioc *ioc, struct request *rq)
> +{
> + return VIOS_SCALE;
> +}
> +
> +/* return vios dispatched */
> +static u64 fiops_dispatch_request(struct fiops_data *fiopsd,
> + struct fiops_ioc *ioc)
> +{
> + struct request *rq;
> + struct request_queue *q = fiopsd->qdata.queue;
> +
> + rq = rq_entry_fifo(ioc->fifo.next);
> +
> + fiops_remove_request(rq);
> + elv_dispatch_sort(q, rq);
> +
> + return fiops_scaled_vios(fiopsd, ioc, rq);
> +}
> +
> +static int fiops_forced_dispatch(struct fiops_data *fiopsd)
> +{
> + struct fiops_ioc *ioc;
> + int dispatched = 0;
> +
> + while ((ioc = fiops_rb_first(&fiopsd->service_tree)) != NULL) {
> + while (!list_empty(&ioc->fifo)) {
> + fiops_dispatch_request(fiopsd, ioc);
> + dispatched++;
> + }
> + if (fiops_ioc_on_rr(ioc))
> + fiops_del_ioc_rr(fiopsd, ioc);
> + }
> + return dispatched;
> +}
> +
> +static struct fiops_ioc *fiops_select_ioc(struct fiops_data *fiopsd)
> +{
> + struct fiops_ioc *ioc;
> +
> + if (RB_EMPTY_ROOT(&fiopsd->service_tree.rb))
> + return NULL;
> + ioc = fiops_rb_first(&fiopsd->service_tree);
> + return ioc;
> +}
> +
> +static void fiops_charge_vios(struct fiops_data *fiopsd,
> + struct fiops_ioc *ioc, u64 vios)
> +{
> + struct fiops_rb_root *service_tree = ioc->service_tree;
> + ioc->vios += vios;
> +
> + if (RB_EMPTY_ROOT(&ioc->sort_list))
> + fiops_del_ioc_rr(fiopsd, ioc);
> + else
> + fiops_resort_rr_list(fiopsd, ioc);
> +
> + fiops_update_min_vios(service_tree);
> +}
> +
> +static int fiops_dispatch_requests(struct request_queue *q, int force)
> +{
> + struct fiops_data *fiopsd = q->elevator->elevator_data;
> + struct fiops_ioc *ioc;
> + u64 vios;
> +
> + if (unlikely(force))
> + return fiops_forced_dispatch(fiopsd);
> +
> + ioc = fiops_select_ioc(fiopsd);
> + if (!ioc)
> + return 0;
> +
> + vios = fiops_dispatch_request(fiopsd, ioc);
> +
> + fiops_charge_vios(fiopsd, ioc, vios);
> + return 1;
> +}
> +
> +static void fiops_insert_request(struct request_queue *q, struct request *rq)
> +{
> + struct fiops_ioc *ioc = RQ_CIC(rq);
> +
> + rq_set_fifo_time(rq, jiffies);
> +
> + list_add_tail(&rq->queuelist, &ioc->fifo);
> +
> + fiops_add_rq_rb(rq);
> +}
> +
> +/*
> + * scheduler run of queue, if there are requests pending and no one in the
> + * driver that will restart queueing
> + */
> +static inline void fiops_schedule_dispatch(struct fiops_data *fiopsd)
> +{
> + if (fiopsd->busy_queues)
> + kblockd_schedule_work(fiopsd->qdata.queue,
> + &fiopsd->unplug_work);
> +}
> +
> +static int
> +fiops_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
> +{
> + struct fiops_data *fiopsd = q->elevator->elevator_data;
> + struct dev_io_context *dev_ioc;
> + struct fiops_ioc *cic;
> +
> + might_sleep_if(gfp_mask & __GFP_WAIT);
> +
> + dev_ioc = queue_data_get_io_context(&ioc_builder, &fiopsd->qdata,
> + gfp_mask);
> + if (!dev_ioc)
> + goto queue_fail;
> +
> + cic = dev_ioc_to_fiops_ioc(dev_ioc);
> +
> + /*
> + * we hold a reference of dev_ioc and nobody else set this request,
> + * doesn't need locking
> + */
> + rq->elevator_private[0] = cic;

> +
> + return 0;
> +

> +queue_fail:
> + fiops_schedule_dispatch(fiopsd);
> + return 1;
> +}
> +
> +static void fiops_put_request(struct request *rq)
> +{
> + struct fiops_ioc *ioc = RQ_CIC(rq);
> +
> + if (ioc) {
> + rq->elevator_private[0] = NULL;
> + put_io_context(ioc->dev_ioc.ioc);
> + }
> +}
> +
> +static struct request *
> +fiops_find_rq_fmerge(struct fiops_data *fiopsd, struct bio *bio)
> +{
> + struct task_struct *tsk = current;
> + struct dev_io_context *gen_cic;
> + struct fiops_ioc *cic;
> +
> + gen_cic = queue_data_cic_lookup(&fiopsd->qdata, tsk->io_context);
> + if (!gen_cic)
> + return NULL;
> + cic = dev_ioc_to_fiops_ioc(gen_cic);
> +
> + if (cic) {
> + sector_t sector = bio->bi_sector + bio_sectors(bio);
> +
> + return elv_rb_find(&cic->sort_list, sector);
> + }
> +
> + return NULL;
> +}
> +
> +static int fiops_merge(struct request_queue *q, struct request **req,
> + struct bio *bio)
> +{
> + struct fiops_data *fiopsd = q->elevator->elevator_data;
> + struct request *__rq;
> +
> + __rq = fiops_find_rq_fmerge(fiopsd, bio);
> + if (__rq && elv_rq_merge_ok(__rq, bio)) {
> + *req = __rq;
> + return ELEVATOR_FRONT_MERGE;
> + }
> +
> + return ELEVATOR_NO_MERGE;
> +}
> +
> +static void fiops_merged_request(struct request_queue *q, struct request *req,
> + int type)
> +{
> + if (type == ELEVATOR_FRONT_MERGE) {
> + struct fiops_ioc *ioc = RQ_CIC(req);
> +
> + fiops_reposition_rq_rb(ioc, req);
> + }
> +}
> +
> +static void
> +fiops_merged_requests(struct request_queue *q, struct request *rq,
> + struct request *next)
> +{
> + struct fiops_ioc *ioc = RQ_CIC(rq);
> + struct fiops_data *fiopsd = q->elevator->elevator_data;
> + /*
> + * reposition in fifo if next is older than rq
> + */
> + if (!list_empty(&rq->queuelist) && !list_empty(&next->queuelist) &&
> + time_before(rq_fifo_time(next), rq_fifo_time(rq))) {
> + list_move(&rq->queuelist, &next->queuelist);
> + rq_set_fifo_time(rq, rq_fifo_time(next));
> + }
> +
> + fiops_remove_request(next);
> +
> + ioc = RQ_CIC(next);
> + /*
> + * all requests of this task are merged to other tasks, delete it
> + * from the service tree.
> + */
> + if (fiops_ioc_on_rr(ioc) && RB_EMPTY_ROOT(&ioc->sort_list))
> + fiops_del_ioc_rr(fiopsd, ioc);
> +}
> +
> +static int fiops_allow_merge(struct request_queue *q, struct request *rq,
> + struct bio *bio)
> +{
> + struct fiops_data *fiopsd = q->elevator->elevator_data;
> + struct dev_io_context *gen_cic;
> + struct fiops_ioc *cic;
> +
> + /*
> + * Lookup the ioc that this bio will be queued with. Allow
> + * merge only if rq is queued there.
> + */
> + gen_cic = queue_data_cic_lookup(&fiopsd->qdata, current->io_context);
> + if (!gen_cic)
> + return false;
> + cic = dev_ioc_to_fiops_ioc(gen_cic);
> +
> + return cic == RQ_CIC(rq);
> +}
> +
> +static void fiops_exit_queue(struct elevator_queue *e)
> +{
> + struct fiops_data *fiopsd = e->elevator_data;
> + struct request_queue *q = fiopsd->qdata.queue;
> +
> + cancel_work_sync(&fiopsd->unplug_work);
> +
> + spin_lock_irq(q->queue_lock);
> +
> + ioc_builder_exit_queue(&ioc_builder, &fiopsd->qdata);
> +
> + spin_unlock_irq(q->queue_lock);
> + kfree(fiopsd);
> +}
> +
> +static void fiops_kick_queue(struct work_struct *work)
> +{
> + struct fiops_data *fiopsd =
> + container_of(work, struct fiops_data, unplug_work);
> + struct request_queue *q = fiopsd->qdata.queue;
> +
> + spin_lock_irq(q->queue_lock);
> + __blk_run_queue(q);
> + spin_unlock_irq(q->queue_lock);
> +}
> +
> +static void *fiops_init_queue(struct request_queue *q)
> +{
> + struct fiops_data *fiopsd;
> +
> + fiopsd = kzalloc_node(sizeof(*fiopsd), GFP_KERNEL, q->node);
> + if (!fiopsd)
> + return NULL;
> +
> + if (ioc_builder_init_queue(&ioc_builder, &fiopsd->qdata, q)) {
> + kfree(fiopsd);
> + return NULL;
> + }
> +
> + fiopsd->service_tree = FIOPS_RB_ROOT;
> +
> + INIT_WORK(&fiopsd->unplug_work, fiops_kick_queue);
> +
> + return fiopsd;
> +}
> +
> +static void fiops_slab_kill(void)
> +{
> + /*
> + * Caller already ensured that pending RCU callbacks are completed,
> + * so we should have no busy allocations at this point.
> + */
> + if (fiops_ioc_pool)
> + kmem_cache_destroy(fiops_ioc_pool);
> +}
> +
> +static int __init fiops_slab_setup(void)
> +{
> + fiops_ioc_pool = KMEM_CACHE(fiops_ioc, 0);
> + if (!fiops_ioc_pool)
> + return -ENOMEM;
> +
> + return 0;
> +}
> +
> +static struct dev_io_context *
> +fiops_alloc_ioc(struct ioc_builder *builder, struct queue_data *qdata,
> + gfp_t gfp_mask)
> +{
> + struct fiops_ioc *ioc = kmem_cache_alloc_node(fiops_ioc_pool,
> + gfp_mask, qdata->queue->node);
> + if (ioc)
> + return &ioc->dev_ioc;
> + return NULL;
> +}
> +
> +static void fiops_free_ioc(struct ioc_builder *builder,
> + struct dev_io_context *dev_ioc)
> +{
> + struct fiops_ioc *ioc = dev_ioc_to_fiops_ioc(dev_ioc);
> + kmem_cache_free(fiops_ioc_pool, ioc);
> +}
> +
> +static void fiops_init_cic(struct queue_data *qdata,
> + struct dev_io_context *gen_cic)
> +{
> + struct fiops_data *fiopsd = queue_data_to_fiopsd(qdata);
> + struct fiops_ioc *ioc = dev_ioc_to_fiops_ioc(gen_cic);
> +
> + RB_CLEAR_NODE(&ioc->rb_node);
> + INIT_LIST_HEAD(&ioc->fifo);
> + ioc->sort_list = RB_ROOT;
> +
> + ioc->fiopsd = fiopsd;
> +
> + ioc->pid = current->pid;
> +}
> +
> +static void fiops_exit_cic(struct queue_data *qdata,
> + struct dev_io_context *gen_cic)
> +{
> + struct fiops_ioc *ioc = dev_ioc_to_fiops_ioc(gen_cic);
> +
> + WARN_ON(fiops_ioc_on_rr(ioc));
> +}
> +
> +static struct ioc_builder ioc_builder = {
> + .alloc_ioc = fiops_alloc_ioc,
> + .free_ioc = fiops_free_ioc,
> + .cic_init = fiops_init_cic,
> + .cic_exit = fiops_exit_cic,
> +};
> +
> +static struct elevator_type iosched_fiops = {
> + .ops = {
> + .elevator_merge_fn = fiops_merge,
> + .elevator_merged_fn = fiops_merged_request,
> + .elevator_merge_req_fn = fiops_merged_requests,
> + .elevator_allow_merge_fn = fiops_allow_merge,
> + .elevator_dispatch_fn = fiops_dispatch_requests,
> + .elevator_add_req_fn = fiops_insert_request,
> + .elevator_former_req_fn = elv_rb_former_request,
> + .elevator_latter_req_fn = elv_rb_latter_request,
> + .elevator_set_req_fn = fiops_set_request,
> + .elevator_put_req_fn = fiops_put_request,
> + .elevator_init_fn = fiops_init_queue,
> + .elevator_exit_fn = fiops_exit_queue,
> + .trim = queue_data_free_io_context,
> + },
> + .elevator_name = "fiops",
> + .elevator_owner = THIS_MODULE,
> +};
> +
> +static int __init fiops_init(void)
> +{
> + if (fiops_slab_setup())
> + return -ENOMEM;
> + if (ioc_builder_init(&ioc_builder)) {
> + fiops_slab_kill();

> + return -ENOMEM;
> + }
> +

> + elv_register(&iosched_fiops);

> +
> + return 0;
> +}
> +

> +static void __exit fiops_exit(void)
> +{
> + elv_unregister(&iosched_fiops);
> + io_context_builder_exit(&ioc_builder);
> + fiops_slab_kill();
> +}
> +
> +module_init(fiops_init);
> +module_exit(fiops_exit);
> +
> +MODULE_AUTHOR("Jens Axboe, Shaohua Li <sh...@kernel.org>");
> +MODULE_LICENSE("GPL");
> +MODULE_DESCRIPTION("IOPS based IO scheduler");
> Index: linux/block/Kconfig.iosched

> ===================================================================
> --- linux.orig/block/Kconfig.iosched 2011-12-28 09:42:18.000000000 +0800
> +++ linux/block/Kconfig.iosched 2012-01-04 13:58:35.000000000 +0800
> @@ -43,6 +43,14 @@ config CFQ_GROUP_IOSCHED
> ---help---
> Enable group IO scheduling in CFQ.
>
> +config IOSCHED_FIOPS
> + tristate "IOPS based I/O scheduler"
> + default y
> + ---help---
> + This is an IOPS based I/O scheduler. It will try to distribute
> + IOPS equally among all processes in the system. It's mainly for
> + Flash based storage.
> +
> choice
> prompt "Default I/O scheduler"
> default DEFAULT_CFQ

> Index: linux/block/blk.h
> ===================================================================
> --- linux.orig/block/blk.h 2011-12-28 09:42:18.000000000 +0800
> +++ linux/block/blk.h 2011-12-28 09:42:35.000000000 +0800
> @@ -206,7 +206,7 @@ static inline void blk_throtl_exit(struc
> static inline void blk_throtl_release(struct request_queue *q) { }
> #endif /* CONFIG_BLK_DEV_THROTTLING */
>
> -#if IS_ENABLED(CONFIG_IOSCHED_CFQ)
> +#if IS_ENABLED(CONFIG_IOSCHED_CFQ) || IS_ENABLED(CONFIG_IOSCHED_FIOPS)
> struct queue_data;
> struct ioc_builder {
> struct dev_io_context *(*alloc_ioc)(struct ioc_builder *builder,
> Index: linux/block/blk-ioc.c
> ===================================================================
> --- linux.orig/block/blk-ioc.c 2011-12-28 09:42:18.000000000 +0800
> +++ linux/block/blk-ioc.c 2011-12-28 09:42:35.000000000 +0800
> @@ -164,7 +164,7 @@ static int __init blk_ioc_init(void)
> }
> subsys_initcall(blk_ioc_init);
>
> -#if IS_ENABLED(CONFIG_IOSCHED_CFQ)
> +#if IS_ENABLED(CONFIG_IOSCHED_CFQ) || IS_ENABLED(CONFIG_IOSCHED_FIOPS)
> #define CIC_DEAD_INDEX_SHIFT 1
>
> static inline void *queue_data_dead_key(struct queue_data *qdata)

Dave Chinner

unread,

Jan 8, 2012, 5:20:02 PM1/8/12

to

Numbers like this are meaningless without knowing what the hardware
capability is and how the numbers compare to that raw capability.
They tell me only mmap based random write improves in
performance, and only one specific type of random write improves,
not all types.

That raises more questions that it answers: why do AIO based random
writes not go any faster? Is that because even with CFQ, AIO based
random writes saturate the device? i.e. is AIO based IO that much
faster than mmap based IO that there is no scope for improvement on
your hardware?

You need to present raw numbers and give us some idea of how close
those numbers are to raw hardware capability for us to have any idea
what improvements these numbers actually demonstrate.

Cheers,

Dave.
--
Dave Chinner
da...@fromorbit.com

Shaohua Li

unread,

Jan 8, 2012, 8:00:01 PM1/8/12

to

Yes, your guess is right. The hardware has limitation. 12 SSD exceeds
the jbod capability, for both throughput and IOPS, that's why only
read/write mixed workload impacts. I'll use less SSD in later tests,
which will demonstrate the performance better. I'll report both raw
numbers and fiops/cfq numbers later.

Thanks,
Shaohua

Shaohua Li

unread,

Jan 8, 2012, 8:20:01 PM1/8/12

to

It has. The vios will be scaled based on read/write. By default, I'll
give read 2.5 times more share than write, which matches CFQ. It's quite
easy to change the ratio with sysfs knobs.

> Could you for example compare a latency of reads while running heavy
> background writing between CFQ and your scheduler? Loads like this where
> original motivation for CFQ I believe.

CFQ supports preemption, FIOPS doesn't, so I suppose read latency of CFQ
is still better in such workload.
In this initial post, I just want to demonstrate the basic idea of the
ioscheduler. I'll post more data for both latency and throughput in next
round.

Thanks,
Shaohua

Vivek Goyal

unread,

Jan 15, 2012, 5:30:02 PM1/15/12

to

On Wed, Jan 04, 2012 at 02:53:37PM +0800, Shaohua Li wrote:

> An IOPS based I/O scheduler
>
> Flash based storage has some different characteristics against rotate disk.
> 1. no I/O seek.
> 2. read and write I/O cost usually is much different.
> 3. Time which a request takes depends on request size.
> 4. High throughput and IOPS, low latency.
>
> CFQ iosched does well for rotate disk, for example fair dispatching, idle
> for sequential read. It also has optimization for flash based storage (for
> item 1 above), but overall it's not designed for flash based storage. It's
> a slice based algorithm. Since flash based storage request cost is very
> low, and drive has big queue_depth is quite popular now which makes
> dispatching cost even lower, CFQ's slice accounting (jiffy based)
> doesn't work well. CFQ doesn't consider above item 2 & 3.
>
> FIOPS (Fair IOPS) ioscheduler is trying to fix the gaps. It's IOPS based, so
> only targets for drive without I/O seek. It's quite similar like CFQ, but
> the dispatch decision is made according to IOPS instead of slice.

Hi Shaohua,

What problem are you trying to fix. If you just want to do provide
fairness in terms of IOPS instead of time, then I think existing code
should be easily modifiable instead of writing a new IO scheduler
altogether. It is just like switching the mode either based on tunable
or based on disk type.

If slice_idle is zero, we already have some code to effectively do
accounting in terms of IOPS. This is primarily for group level. We should
be able to extend it to queue level.

These are implementation details. I think what I am not able to understand
that what is the problem and what are you going to gain by doing
accounting in terms of IOPS.

Also, for flash based storage, isn't noop or deadline a good choice of
scheduler. Why do we want to come up with something which is CFQ like.
For this fast device any idling will kill the performance and if you
don't idle, then I think practically most of the workload will almost
become round robin kind of serving.

Thanks
Vivek

Vivek Goyal

unread,

Jan 15, 2012, 5:30:03 PM1/15/12

to

Set the slice_idle=0 and possibly increase quantum to 32 or 64 and you
should get better performance. I think practically it should become a
IOPS based scheduler with little imprecise accounting.

Thanks
Vivek

Vivek Goyal

unread,

Jan 15, 2012, 5:40:02 PM1/15/12

to

On Mon, Jan 09, 2012 at 09:26:45AM +0800, Shaohua Li wrote:

[..]

> > Could you for example compare a latency of reads while running heavy
> > background writing between CFQ and your scheduler? Loads like this where
> > original motivation for CFQ I believe.
> CFQ supports preemption, FIOPS doesn't, so I suppose read latency of CFQ
> is still better in such workload.
> In this initial post, I just want to demonstrate the basic idea of the
> ioscheduler. I'll post more data for both latency and throughput in next
> round.

I think before numbers what will be more helpful to know is that what are
you trying to achieve and why existing CFQ code can not do that with little
modification.

Thanks
Vivek

Vivek Goyal

unread,

Jan 15, 2012, 5:50:01 PM1/15/12

to

On Mon, Jan 09, 2012 at 09:09:35AM +0800, Shaohua Li wrote:

[..]

> > You need to present raw numbers and give us some idea of how close
> > those numbers are to raw hardware capability for us to have any idea
> > what improvements these numbers actually demonstrate.
> Yes, your guess is right. The hardware has limitation. 12 SSD exceeds
> the jbod capability, for both throughput and IOPS, that's why only
> read/write mixed workload impacts. I'll use less SSD in later tests,
> which will demonstrate the performance better. I'll report both raw
> numbers and fiops/cfq numbers later.

If fiops number are better please explain why those numbers are better.
If you cut down on idling, it is obivious that you will get higher
throughput on these flash devices. CFQ does disable queue idling for
non rotational NCQ devices. If higher throughput is due to driving
deeper queue depths, then CFQ can do that too just by changing quantum
and disabling idling.

So I really don't understand that what are you doing fundamentally
different in FIOPS ioscheduler.

The only thing I can think of more accurate accounting per queue in
terms of number of IOs instead of time. Which can just serve to improve
fairness a bit for certain workloads. In practice, I think it might
not matter much.

Thanks
Vivek

Shaohua Li

unread,

Jan 15, 2012, 11:40:02 PM1/15/12

to

On Sun, 2012-01-15 at 17:45 -0500, Vivek Goyal wrote:
> On Mon, Jan 09, 2012 at 09:09:35AM +0800, Shaohua Li wrote:
>
> [..]
> > > You need to present raw numbers and give us some idea of how close
> > > those numbers are to raw hardware capability for us to have any idea
> > > what improvements these numbers actually demonstrate.
> > Yes, your guess is right. The hardware has limitation. 12 SSD exceeds
> > the jbod capability, for both throughput and IOPS, that's why only
> > read/write mixed workload impacts. I'll use less SSD in later tests,
> > which will demonstrate the performance better. I'll report both raw
> > numbers and fiops/cfq numbers later.
>
> If fiops number are better please explain why those numbers are better.
> If you cut down on idling, it is obivious that you will get higher
> throughput on these flash devices. CFQ does disable queue idling for
> non rotational NCQ devices. If higher throughput is due to driving
> deeper queue depths, then CFQ can do that too just by changing quantum
> and disabling idling.

it's because of quantum. Surely you can change the quantum, and CFQ
performance will increase, but you will find CFQ is very unfair then.

> So I really don't understand that what are you doing fundamentally
> different in FIOPS ioscheduler.
>
> The only thing I can think of more accurate accounting per queue in
> terms of number of IOs instead of time. Which can just serve to improve
> fairness a bit for certain workloads. In practice, I think it might
> not matter much.

If quantum is big, CFQ will have better performance, but it actually
fallbacks to Noop, no any fairness. fairness is important and is why we
introduce CFQ.

In summary, CFQ isn't both fair and good performance. FIOPS is trying to
be fair and have good performance. I didn't think any time based
accounting can make the goal happen for NCQ and SSD (even cfq cgroup
code has iops mode, so suppose you should already know this well).

Surely you can change CFQ to make it IOPS based, but this will mess the
code a lot, and FIOPS shares a lot of code with CFQ. So I'd like to have
a separate ioscheduler which is IOPS based.

Thanks,
Shaohua

Vivek Goyal

unread,

Jan 16, 2012, 2:20:01 AM1/16/12

to

On Mon, Jan 16, 2012 at 12:36:30PM +0800, Shaohua Li wrote:
> On Sun, 2012-01-15 at 17:45 -0500, Vivek Goyal wrote:
> > On Mon, Jan 09, 2012 at 09:09:35AM +0800, Shaohua Li wrote:
> >
> > [..]
> > > > You need to present raw numbers and give us some idea of how close
> > > > those numbers are to raw hardware capability for us to have any idea
> > > > what improvements these numbers actually demonstrate.
> > > Yes, your guess is right. The hardware has limitation. 12 SSD exceeds
> > > the jbod capability, for both throughput and IOPS, that's why only
> > > read/write mixed workload impacts. I'll use less SSD in later tests,
> > > which will demonstrate the performance better. I'll report both raw
> > > numbers and fiops/cfq numbers later.
> >
> > If fiops number are better please explain why those numbers are better.
> > If you cut down on idling, it is obivious that you will get higher
> > throughput on these flash devices. CFQ does disable queue idling for
> > non rotational NCQ devices. If higher throughput is due to driving
> > deeper queue depths, then CFQ can do that too just by changing quantum
> > and disabling idling.
> it's because of quantum. Surely you can change the quantum, and CFQ
> performance will increase, but you will find CFQ is very unfair then.

Why increasing quantum leads to CFQ being unfair? In terms of time it
still tries to be fair. That's a different thing that with NCQ, right
time measurement is not possible with requests from multiple queues
being in the driver/disk at the same time. So accouting in terms of
iops per queue might make sense.

>
> > So I really don't understand that what are you doing fundamentally
> > different in FIOPS ioscheduler.
> >
> > The only thing I can think of more accurate accounting per queue in
> > terms of number of IOs instead of time. Which can just serve to improve
> > fairness a bit for certain workloads. In practice, I think it might
> > not matter much.
> If quantum is big, CFQ will have better performance, but it actually
> fallbacks to Noop, no any fairness. fairness is important and is why we
> introduce CFQ.

It is not exactly noop. It still preempts writes and prioritizes reads
and direct writes.

Also, what's the real life workload where you face issues with using
say deadline with these flash based storage.

>
> In summary, CFQ isn't both fair and good performance. FIOPS is trying to
> be fair and have good performance. I didn't think any time based
> accounting can make the goal happen for NCQ and SSD (even cfq cgroup
> code has iops mode, so suppose you should already know this well).
>
> Surely you can change CFQ to make it IOPS based, but this will mess the
> code a lot, and FIOPS shares a lot of code with CFQ. So I'd like to have
> a separate ioscheduler which is IOPS based.

I think writing a separate IO scheduler just to do accouting in IOPS while
retaining rest of the CFQ code is not a very good idea. Modifying CFQ code
to be able to deal with both time based as well as IOPS accounting might
turn out to be simpler.

Thanks
Vivek

Shaohua Li

unread,

Jan 16, 2012, 3:00:01 AM1/16/12

to

On Mon, 2012-01-16 at 02:11 -0500, Vivek Goyal wrote:
> On Mon, Jan 16, 2012 at 12:36:30PM +0800, Shaohua Li wrote:
> > On Sun, 2012-01-15 at 17:45 -0500, Vivek Goyal wrote:
> > > On Mon, Jan 09, 2012 at 09:09:35AM +0800, Shaohua Li wrote:
> > >
> > > [..]
> > > > > You need to present raw numbers and give us some idea of how close
> > > > > those numbers are to raw hardware capability for us to have any idea
> > > > > what improvements these numbers actually demonstrate.
> > > > Yes, your guess is right. The hardware has limitation. 12 SSD exceeds
> > > > the jbod capability, for both throughput and IOPS, that's why only
> > > > read/write mixed workload impacts. I'll use less SSD in later tests,
> > > > which will demonstrate the performance better. I'll report both raw
> > > > numbers and fiops/cfq numbers later.
> > >
> > > If fiops number are better please explain why those numbers are better.
> > > If you cut down on idling, it is obivious that you will get higher
> > > throughput on these flash devices. CFQ does disable queue idling for
> > > non rotational NCQ devices. If higher throughput is due to driving
> > > deeper queue depths, then CFQ can do that too just by changing quantum
> > > and disabling idling.
> > it's because of quantum. Surely you can change the quantum, and CFQ
> > performance will increase, but you will find CFQ is very unfair then.
>
> Why increasing quantum leads to CFQ being unfair? In terms of time it
> still tries to be fair.

we can dispatch a lot of requests to NCQ SSD with very small time
interval. The disk can finish a lot of requests in small time interval
too. The time is much smaller than 1 jiffy. Increasing quantum can lead
a task dispatches request more faster and makes the accounting worse,
because with small quantum the task needs wait to dispatch. you can
easily verify this with a simple fio test.

> That's a different thing that with NCQ, right
> time measurement is not possible with requests from multiple queues
> being in the driver/disk at the same time. So accouting in terms of
> iops per queue might make sense.

yes.

> > > So I really don't understand that what are you doing fundamentally
> > > different in FIOPS ioscheduler.
> > >
> > > The only thing I can think of more accurate accounting per queue in
> > > terms of number of IOs instead of time. Which can just serve to improve
> > > fairness a bit for certain workloads. In practice, I think it might
> > > not matter much.
> > If quantum is big, CFQ will have better performance, but it actually
> > fallbacks to Noop, no any fairness. fairness is important and is why we
> > introduce CFQ.
>
> It is not exactly noop. It still preempts writes and prioritizes reads
> and direct writes.

sure, I mean fairness mostly here.

> Also, what's the real life workload where you face issues with using
> say deadline with these flash based storage.

deadline doesn't provide fairness. mainly cgroup workload. workload with
different ioprio has issues too, but I don't know which real workload
uses ioprio.

> >
> > In summary, CFQ isn't both fair and good performance. FIOPS is trying to
> > be fair and have good performance. I didn't think any time based
> > accounting can make the goal happen for NCQ and SSD (even cfq cgroup
> > code has iops mode, so suppose you should already know this well).
> >
> > Surely you can change CFQ to make it IOPS based, but this will mess the
> > code a lot, and FIOPS shares a lot of code with CFQ. So I'd like to have
> > a separate ioscheduler which is IOPS based.
>
> I think writing a separate IO scheduler just to do accouting in IOPS while
> retaining rest of the CFQ code is not a very good idea. Modifying CFQ code
> to be able to deal with both time based as well as IOPS accounting might
> turn out to be simpler.

changing CFQ works, but I really want to avoid having something like
if (iops)
xxx
else
xxx
I plan adding scales for read/write, request size, etc, because
read/write cost is different and request with different size has
different cost in SSD. This can be added to CFQ too with pain. That said
I didn't completely object to make CFQ support IOPS accounting, but my
feeling is a separate ioscheduler is more clean.

Thanks,
Shaohua

Vivek Goyal

unread,

Jan 16, 2012, 3:30:01 AM1/16/12

to

Personally I have not run into any workload which provides deep queue depths
constantly for a very long time. I had to run fio to create such
scnearios.

Not running deep queue depths will lead to expiration of queue (Otherwise
idling will kill performance on these fast devices). And without idling
most of the logic of slice and accounting does not help. A queue
dispatches some requests and expires (irrespective of what time slice
you had allocated it based on ioprio).

That's why I am insisting that it would be nice that any move in this
direction should be driven by some real workload instead of just coming
up with synthetic workloads.

>
> > >
> > > In summary, CFQ isn't both fair and good performance. FIOPS is trying to
> > > be fair and have good performance. I didn't think any time based
> > > accounting can make the goal happen for NCQ and SSD (even cfq cgroup
> > > code has iops mode, so suppose you should already know this well).
> > >
> > > Surely you can change CFQ to make it IOPS based, but this will mess the
> > > code a lot, and FIOPS shares a lot of code with CFQ. So I'd like to have
> > > a separate ioscheduler which is IOPS based.
> >
> > I think writing a separate IO scheduler just to do accouting in IOPS while
> > retaining rest of the CFQ code is not a very good idea. Modifying CFQ code
> > to be able to deal with both time based as well as IOPS accounting might
> > turn out to be simpler.
> changing CFQ works, but I really want to avoid having something like
> if (iops)
> xxx
> else
> xxx

I think you can provide a wrapper function and abstract out unit of time
as "charge" or something like that.

> I plan adding scales for read/write, request size, etc, because
> read/write cost is different and request with different size has
> different cost in SSD. This can be added to CFQ too with pain. That said
> I didn't completely object to make CFQ support IOPS accounting, but my
> feeling is a separate ioscheduler is more clean.

You can provide one function to calculate the "charge" which can be either
time or some kind of IOPS unit adjusted with request size and use that.

Given the fact that new IOPS scheduler will be sharing lots of code with
CFQ (I presume 80-90% of concepts are same) and the only major difference
is accounting, I would tend to think that modifying CFQ is better.

Jens, what do you think?

Thanks
Vivek

Shaohua Li

unread,

Jan 16, 2012, 8:10:02 PM1/16/12

to

That's true, if workload doesn't drive deep queue depths, any accounting
can't help for NCQ disks as far as I tried. Idling is the only method to
make accounting correct, but it impacts performance too much.

> That's why I am insisting that it would be nice that any move in this
> direction should be driven by some real workload instead of just coming
> up with synthetic workloads.

I thought yanhai from taobao (cc-ed) has real workload and he found cfq
performance suffers a lot.

Thanks,
Shaohua

Vivek Goyal

unread,

Jan 17, 2012, 4:10:02 AM1/17/12

to

Idiling will kill performance and faster the device, more prominent are
the effects of idling. So to me using CFQ on these fast devices is not
a very good idea and deadline might just serve well.

>
> > That's why I am insisting that it would be nice that any move in this
> > direction should be driven by some real workload instead of just coming
> > up with synthetic workloads.
> I thought yanhai from taobao (cc-ed) has real workload and he found cfq
> performance suffers a lot.

Can we run that real workload with "deadline" and see what kind of
concerns do we have. Is anybody getting starved for long time. If not,
then we don't have to do anything.

I think trying to make to make CFQ work (Or trying to come up with CFQ
like IOPS scheduler) on these fast devices might not lead us anywhere.

Thanks
Vivek

Shaohua Li

unread,

Jan 17, 2012, 8:30:02 PM1/17/12

to

If only performance matters, I'd rather use noop for ssd. There is
requirement to have cgroup support (maybe ioprio) to give different
tasks different bandwidth.

Vivek Goyal

unread,

Jan 18, 2012, 8:10:02 AM1/18/12

to

On Wed, Jan 18, 2012 at 09:20:37AM +0800, Shaohua Li wrote:

[..]

> > I think trying to make to make CFQ work (Or trying to come up with CFQ
> > like IOPS scheduler) on these fast devices might not lead us anywhere.
> If only performance matters, I'd rather use noop for ssd. There is
> requirement to have cgroup support (maybe ioprio) to give different
> tasks different bandwidth.

Sure but the issue is that we need to idle in an effort to prioritize
a task and idling kills performance. So you can implement something but
I have doubts that on a fast hardware it is going to be very useful.

Another issue is that with flash based storage, it can drive really deep
queue depths. If that's the case, then just ioscheduler can't solve the
prioritazaion issues (until and unless ioscheduler does not drive deep
queue depths and kills performance). We need some kind of cooperation
from device (like device understanding the notion of iopriority), so
that device can prioritize the requests and one need not to idle. That
way, we might be able to get service differentiation while getting
reasonable throughput.

Thanks
Vivek

Shaohua Li

unread,

Jan 18, 2012, 8:30:02 PM1/18/12

to

On Wed, 2012-01-18 at 08:04 -0500, Vivek Goyal wrote:
> On Wed, Jan 18, 2012 at 09:20:37AM +0800, Shaohua Li wrote:
>
> [..]
> > > I think trying to make to make CFQ work (Or trying to come up with CFQ
> > > like IOPS scheduler) on these fast devices might not lead us anywhere.
> > If only performance matters, I'd rather use noop for ssd. There is
> > requirement to have cgroup support (maybe ioprio) to give different
> > tasks different bandwidth.
>
> Sure but the issue is that we need to idle in an effort to prioritize
> a task and idling kills performance. So you can implement something but
> I have doubts that on a fast hardware it is going to be very useful.

I didn't do idle in fiops. If workload iodepth is low, this will cause
fairness problem and I just leave it be. There is no way to make low
iodepth workload fair without performance sacrifice. CFQ for SSD has the
same problem too.

> Another issue is that with flash based storage, it can drive really deep
> queue depths. If that's the case, then just ioscheduler can't solve the
> prioritazaion issues (until and unless ioscheduler does not drive deep
> queue depths and kills performance). We need some kind of cooperation
> from device (like device understanding the notion of iopriority), so
> that device can prioritize the requests and one need not to idle. That
> way, we might be able to get service differentiation while getting
> reasonable throughput.

SSD is still like normal disk in terms of queue depth. Don't know the
iodepth of pcie flash card. If the queue depth of such card is very big
(I suppose there is a limitation, because after a critical point
increasing queue depth doesn't increase device performance), we
definitely need reconsider this.

it would be great device let ioscheduler know more info. In my mind, I
hope device can dynamatically adjust its queue depth. For example, in my
ssd, if request size is 4k, I get max throughput with queue depth 32. If
request size is 128k, just 2 queue depth is enough to get peek
throughput. If device can stop fetching request after 2 128k requests
pending, this will solve some fairness issues for low iodepth workload.
Unfortunately device capacity for different request size highly depends
on vendor. The fiops request size scale tries to solve the issue, but I
still didn't find a good scale model for this yet.

I suppose device can not do good prioritization if workload iodepth is
low. If there are just few requests pending, nobody (device or
ioscheduler) can do correct judgment in such case because there isn't
enough info.

Thanks,
Shaohua