interactiveness during large writes

Ritesh Raj Sarraf

unread,

Aug 13, 2016, 9:57:37 AM8/13/16

to bfq-i...@googlegroups.com

Hi,

As per the description, BFQ is touted as a low-latency scheduler for
interactivity.

The website described it as:

Low latency for interactive applications
According to our results, whatever the background load is, for interactive tasks
the storage device is virtually as responsive as if it was idle. For example,
even if one or more of the following background workloads are being served in
parallel:
* one or more large files are being read or written,
* a tree of source files is being compiled,
* one or more virtual machines are performing I/O,
* a software update is in progress,
* indexing daemons are scanning the filesystems and updating their databases,

The above description covers reads as well as *writes*. But on some of the demo
videos I saw online, IIRC it only demonstrated about the read scenarios.

Hence, I'm writing here with some details about my write use case. It sure is an
exceptional case, but I think it shouldn't qualify as a corner case.

Machine:
Intel Haswell - Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz
RAM - 8 GiB
Swap - 8 GiB

rrs@chutzpah:~$ free -m
              total        used        free      shared  buff/cache   available
Mem:           7387        1554         155         169        5678        5379
Swap:          8579         268        8311

Disk
Rotational Disk, with rootfs fully encrypted with dm-crypt (CPU does have crypt
extensions)

rrs@chutzpah:~$ fdisk -l /dev/sda
Disk /dev/sda: 465.8 GiB, 500107862016 bytes, 976773168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0xa5703559

Device     Boot     Start       End   Sectors   Size Id Type
/dev/sda1            2048   6146047   6144000     3G 83 Linux
/dev/sda2         6146048 419432447 413286400 197.1G 8e Linux LVM
/dev/sda3  *    419432448 421386239   1953792   954M 83 Linux
/dev/sda4       421388286 976771071 555382786 264.8G  5 Extended
/dev/sda5       421388288 976771071 555382784 264.8G 83 Linux

With just a single I/O thread of dd, I have been able to always bring the
machine to a hung state. The only clean exit is to let it be as is, and the I/O
thread to complete. Upon completion, the kernel is capable of recovering.

Point to keep in mind, in this case, is:

* Single I/O thread
* Ensure that you write data more than your full RAM capacity

Command run: dd if=/dev/zero of=/tmp/foo.img bs=1M count=10000 conv=fsync;

With the above command running, as soon as the memory gets full, the OS goes
into a hung state. It does recover but I'm sure I can make it thrash further if
I increase the number of processes contending for I/O.

Note: The catch is to pump total I/O, which is more than the amount of physical
RAM you have. So, in above example, I chose 10000 MiB because I have around 8
GiB of physical RAM.

Attached is the dmesg log, where the kernel went haywire, once the I/O and
Memory pressure increased.

I was hoping that BFQ was capable of improving such scenarios. But I was able to
make the kernel stall, maybe not as bad as CFQ, but still it stalled. IIRC, cfq
used to completely thrash for hours, under this test.

So, is this test valid ? If so, can BFQ be improved to tackle this scenario?
Maybe other BFQ users could try the above scenario and share their results. Keep
in mind that the issue is quickly reproducible on rotational backing device, but
I've also been able to reproduce it on SSDs.

--
Ritesh Raj Sarraf
RESEARCHUT - http://www.researchut.com
"Necessity is the mother of invention."

bfq-hang-dmesg.txt

signature.asc

Eric Wheeler

unread,

Aug 13, 2016, 6:47:52 PM8/13/16

to bfq-i...@googlegroups.com

On Sat, 13 Aug 2016, Ritesh Raj Sarraf wrote:

> Hi,
>
> As per the description, BFQ is touted as a low-latency scheduler for
> interactivity.

[...]

> The above description covers reads as well as *writes*. But on some of the demo
> videos I saw online, IIRC it only demonstrated about the read scenarios.
>
> Hence, I'm writing here with some details about my write use case. It sure is an
> exceptional case, but I think it shouldn't qualify as a corner case.

[...]

> With just a single I/O thread of dd, I have been able to always bring the
> machine to a hung state. The only clean exit is to let it be as is, and the I/O
> thread to complete. Upon completion, the kernel is capable of recovering.
>
> Point to keep in mind, in this case, is:
>
> * Single I/O thread
> * Ensure that you write data more than your full RAM capacity
>
> Command run: dd if=/dev/zero of=/tmp/foo.img bs=1M count=10000 conv=fsync;
>
> With the above command running, as soon as the memory gets full, the OS goes
> into a hung state. It does recover but I'm sure I can make it thrash further if
> I increase the number of processes contending for I/O.
>
> Note: The catch is to pump total I/O, which is more than the amount of physical
> RAM you have. So, in above example, I chose 10000 MiB because I have around 8
> GiB of physical RAM.
>
> Attached is the dmesg log, where the kernel went haywire, once the I/O and
> Memory pressure increased.
>
> I was hoping that BFQ was capable of improving such scenarios. But I was able to
> make the kernel stall, maybe not as bad as CFQ, but still it stalled. IIRC, cfq
> used to completely thrash for hours, under this test.

I think there are two issues going on here. One is a VMM issue because
you have exhausted your machine to the point that your wifi driver [wl]
can't allocate even a single memory page for its use (order 0). To solve
that problem, try this:

echo 262144 > /proc/sys/vm/min_free_kbytes

The second issue you may not have measured, as it pertains the IO latency
of dd (which you might not care about and dd doesn't measure). The
networking group would call the issue issue "buffer bloat".

In your scenario, you have 8GB of buffers. You don't need that much
buffer to exhaust your IO bandwidth and keep the pipe full, but the kernel
VFS/VMM pagecache is willing to accommodate. To solve that, create a
memory cgroup, say 512mb, and run `dd` under that cgroup (or run dd with
whatever flag sets O_DIRECT to completely bypass the pagecache).

The cgroup would give ~512mb of buffer to dd which should be plenty for a
linear write without excessive pagecache eviction; your remaining 7.5GB
could then be useful for other application page caches (filesystem and
shared library mmaps). Also, 512mb reduces the per-IO completion latency
of dd's IO path from $(8192mb/disk-bandwidth)$ to
$(512mb/disk-bandwidth)$. If disk-bandwidth is 128mb/sec, then you have
queue latency of 64s in the former case, and 4s in the latter. Imagine, a
64s ping!

While targeted at the network stack, this article provides a wonderful
discussion of queue-size latency considerations:
http://wiki.linuxwall.info/doku.php/en:ressources:dossiers:networking:traffic_control

In Linux net/ parlance, BFQ would be called a queuing discipline
(qdisc); the most similar queuing discipline to BFQ is probably HTB -
Hierarchical Token Bucket. A disk or net queue is still a queue, so while
disk and network queues have different fairness and implementation
considerations, the fundamental concepts are similar and both can be tuned
for latency vs bandwidth.

--
Eric Wheeler

>
>
> So, is this test valid ? If so, can BFQ be improved to tackle this scenario?
> Maybe other BFQ users could try the above scenario and share their results. Keep
> in mind that the issue is quickly reproducible on rotational backing device, but
> I've also been able to reproduce it on SSDs.
>
>
> --
> Ritesh Raj Sarraf
> RESEARCHUT - http://www.researchut.com
> "Necessity is the mother of invention."
>

> --
> You received this message because you are subscribed to the Google Groups "bfq-iosched" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to bfq-iosched...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

JW

unread,

Aug 13, 2016, 10:45:21 PM8/13/16

to bfq-i...@googlegroups.com

On Sat, Aug 13, 2016 at 6:57 AM, Ritesh Raj Sarraf <r...@researchut.com> wrote:

> Command run: dd if=/dev/zero of=/tmp/foo.img bs=1M count=10000 conv=fsync;
>
> With the above command running, as soon as the memory gets full, the OS goes
> into a hung state. It does recover but I'm sure I can make it thrash further if
> I increase the number of processes contending for I/O.

Have you tried it with virtual memory disabled (swapoff)? RAM is cheap
these days, so I always run with VM off. I'm not sure if it makes a
difference with new kernels with how the dirty writeback is handled
when buffers are full.

This dirty writeback problem has existed for many years, and it is
still not solved. Jens Axboe has been working on a patchset that is
supposed to help, but I do not know if it is ready yet, or if it will
help in your particular case.

Here are several LWN articles, over several years, that discuss the
problem and some fixes that have been considered:

https://lwn.net/Articles/685894/

https://lwn.net/Articles/490114/

https://lwn.net/Articles/384093/

https://lwn.net/Articles/326552/

I am by no means a kernel expert, and I probably understood less than
half of what is in those articles. But I long ago discovered by trial
and error that I could mitigate the problem considerably by simply
reducing the size of the write cache.

What are your values for

$ ls /proc/sys/vm/dirty_*
/proc/sys/vm/dirty_background_bytes /proc/sys/vm/dirty_expire_centisecs
/proc/sys/vm/dirty_background_ratio /proc/sys/vm/dirty_ratio
/proc/sys/vm/dirty_bytes /proc/sys/vm/dirty_writeback_centisecs

I found by trial and error that setting dirty_background_bytes and
dirty_bytes to the vicinity of 32MiB and 256MiB helps a great deal.

$ cat /proc/sys/vm/dirty_background_bytes
33554432

$ cat /proc/sys/vm/dirty_bytes
268435456

If you have a great deal of random write IO, you may want to try
somewhat higher values, but really, the benefit of a write cache
larger than about 256MiB seems small to me. I'd rather let the system
get to writing the dirty pages out as soon as possible.

I'm not sure why the BFQ IO scheduler is not able to give enough time
to interactive tasks when there is a very large writeback happening.
If I understand some of the comments in the articles I referenced, it
may have something to do with the kernel flusher tasks creating so
many jobs that it just overloads the IO scheduler. Or maybe when the
buffers are full the kernel flushers bypass the IO scheduler? As I
said, I am not a kernel expert.

All I know is that reducing the write cache with dirty_bytes and
dirty_background_bytes largely mitigated the issue for me.

Paolo Valente

unread,

Aug 14, 2016, 6:01:51 AM8/14/16

to bfq-i...@googlegroups.com

Il giorno 14/ago/2016, alle ore 04:45, JW <jwilli...@gmail.com> ha scritto:

> On Sat, Aug 13, 2016 at 6:57 AM, Ritesh Raj Sarraf <r...@researchut.com> wrote:
>
>> Command run: dd if=/dev/zero of=/tmp/foo.img bs=1M count=10000 conv=fsync;
>>
>> With the above command running, as soon as the memory gets full, the OS goes
>> into a hung state. It does recover but I'm sure I can make it thrash further if
>> I increase the number of processes contending for I/O.
>
> Have you tried it with virtual memory disabled (swapoff)? RAM is cheap
> these days, so I always run with VM off. I'm not sure if it makes a
> difference with new kernels with how the dirty writeback is handled
> when buffers are full.
>
> This dirty writeback problem has existed for many years, and it is
> still not solved. Jens Axboe has been working on a patchset that is
> supposed to help, but I do not know if it is ready yet, or if it will
> help in your particular case.
>
> Here are several LWN articles, over several years, that discuss the
> problem and some fixes that have been considered:
>
> https://lwn.net/Articles/685894/
>
> https://lwn.net/Articles/490114/
>
> https://lwn.net/Articles/384093/
>
> https://lwn.net/Articles/326552/
>

Yes, that’s a long-standing, hard-to-solve problem.

The root of this nasty problem is that I/O requests of unlucky processes are simply not issued. We have already discussed this problem also in this group a while ago. Very briefly: when the system detects dirty-page pressure, it takes several countermeasures to reduce writeback rate, until the state gets hopefully back to normal. These countermeasures consists in throttling read and write syscalls, with a logic that has no concerns for application latency or bandwidth guarantees. Unlucky processes just get blocked, even for seconds, on each read or write attempt. If they have a lot of reads/writes to make, then it’s pure starvation for them. An I/O scheduler cannot help in any way, as it can do its job only on the read/write requests that it does receive.

We have made a solution to this problem, and tested it on smartphones and PCs. It works, but it is just a prototype. The main problem is, as always, that I don't have enough time and resources to keep up with all the tasks. Volunteers welcome as usual! :) A little kernel expertise needed, but solutions and code are ready already ...

Thanks a lot for reporting these issues and sharing your knowledge,
Paolo

> All I know is that reducing the write cache with dirty_bytes and
> dirty_background_bytes largely mitigated the issue for me.
>

Ritesh Raj Sarraf

unread,

Aug 14, 2016, 6:19:43 AM8/14/16

to bfq-iosched

Sending through Google Groups.

On Sunday, August 14, 2016 at 4:17:52 AM UTC+5:30, Eric Wheeler wrote:

I think there are two issues going on here. One is a VMM issue because
you have exhausted your machine to the point that your wifi driver [wl]
can't allocate even a single memory page for its use (order 0). To solve
that problem, try this:

echo 262144 > /proc/sys/vm/min_free_kbytes

Yes. Understandable.

The second issue you may not have measured, as it pertains the IO latency
of dd (which you might not care about and dd doesn't measure). The
networking group would call the issue issue "buffer bloat".

In your scenario, you have 8GB of buffers. You don't need that much
buffer to exhaust your IO bandwidth and keep the pipe full, but the kernel
VFS/VMM pagecache is willing to accommodate. To solve that, create a
memory cgroup, say 512mb, and run `dd` under that cgroup (or run dd with
whatever flag sets O_DIRECT to completely bypass the pagecache).

The cgroup would give ~512mb of buffer to dd which should be plenty for a
linear write without excessive pagecache eviction; your remaining 7.5GB
could then be useful for other application page caches (filesystem and
shared library mmaps). Also, 512mb reduces the per-IO completion latency
of dd's IO path from $(8192mb/disk-bandwidth)$ to
$(512mb/disk-bandwidth)$. If disk-bandwidth is 128mb/sec, then you have
queue latency of 64s in the former case, and 4s in the latter. Imagine, a
64s ping!

Thanks for explaining it. This is exactly what I've been doing since cgroups because more generically usable.

Even general purpose applications, that may have some buggy behavior, trigger this bug.

For example, With Digikam 4.x, which had a memory leak when writing metadata to files, you could easily bring the machine to a halt.

And I did a similar cgroup setup for digikam to make it manageable.

The only question if I may ask:

Should it not be the kernel taking care of this ?

If a userspace application can be capable to bringing the OS to its knees, isn't it a kernel bug ?

Many know of this bug for a long time. The reason I brought it here on this list, was hoping bfq may have a different view of this bug.

Ritesh Raj Sarraf

unread,

Aug 14, 2016, 6:28:17 AM8/14/16

to bfq-iosched

Replying through Google Groups. Please excuse email format.

On Sunday, August 14, 2016 at 8:15:21 AM UTC+5:30, J Will wrote:

On Sat, Aug 13, 2016 at 6:57 AM, Ritesh Raj Sarraf <r...@researchut.com> wrote:

> Command run: dd if=/dev/zero of=/tmp/foo.img bs=1M count=10000 conv=fsync;
>
> With the above command running, as soon as the memory gets full, the OS goes
> into a hung state. It does recover but I'm sure I can make it thrash further if
> I increase the number of processes contending for I/O.

Have you tried it with virtual memory disabled (swapoff)? RAM is cheap
these days, so I always run with VM off. I'm not sure if it makes a
difference with new kernels with how the dirty writeback is handled
when buffers are full.

I tried it now, after you mentioned. And thank you. The machine was responsive throughout the entire 10 GiB write.

So it does strengthen my understanding that the issue is not really with slow backing device, but rather with the free pages reclaim logic.

This dirty writeback problem has existed for many years, and it is
still not solved. Jens Axboe has been working on a patchset that is
supposed to help, but I do not know if it is ready yet, or if it will
help in your particular case.

Initially, when blk-mq was released, there was some news claiming that it'd solve the writeback issue. But I don't think that ever worked.

I brought this topic here on the bfq list, only to check if bfq had a solution to this problem.

I tend to avoid this approach because it has other effects. But yes, I agree that lowering the cache amount improves the situation.

But it just makes the OS more power hungry.

I'm not sure why the BFQ IO scheduler is not able to give enough time
to interactive tasks when there is a very large writeback happening.
If I understand some of the comments in the articles I referenced, it
may have something to do with the kernel flusher tasks creating so
many jobs that it just overloads the IO scheduler. Or maybe when the
buffers are full the kernel flushers bypass the IO scheduler? As I
said, I am not a kernel expert.

The way I understand, when the mem is full, the kernel does a page scan, while at the same time the entire system resources are exhausted because of continuous I/O pressure.

That is why the cgroups trick works well. Because you tell the I/O process, in its view, that it has only a subset of the total available memory.

All I know is that reducing the write cache with dirty_bytes and
dirty_background_bytes largely mitigated the issue for me.

Thanks for sharing your experience. I can see there are many (silent) users who still see and acknowledge the problem.

Ritesh Raj Sarraf

unread,

Aug 14, 2016, 6:37:01 AM8/14/16

to bfq-iosched

On Sunday, August 14, 2016 at 3:31:51 PM UTC+5:30, paolo wrote:

The root of this nasty problem is that I/O requests of unlucky processes are simply not issued. We have already discussed this problem also in this group a while ago. Very briefly: when the system detects dirty-page pressure, it takes several countermeasures to reduce writeback rate, until the state gets hopefully back to normal. These countermeasures consists in throttling read and write syscalls, with a logic that has no concerns for application latency or bandwidth guarantees. Unlucky processes just get blocked, even for seconds, on each read or write attempt. If they have a lot of reads/writes to make, then it’s pure starvation for them. An I/O scheduler cannot help in any way, as it can do its job only on the read/write requests that it does receive.

We have made a solution to this problem, and tested it on smartphones and PCs. It works, but it is just a prototype. The main problem is, as always, that I don't have enough time and resources to keep up with all the tasks. Volunteers welcome as usual! :) A little kernel expertise needed, but solutions and code are ready already ...

Thanks a lot for reporting these issues and sharing your knowledge,
Paolo

Hello Paolo,

Is this solution (or workaround) that you mentioned, part of the bfq patchsets ?

Otherwise, where can we look at it ?

BTW, like I mentioned initially, bfq was able to recover eventually. Under cfq, I've had a very hard time (multiple hours) to get the machine into a usable state. So, thank you.