disk I/O, VFS hirunningspace

Jerry Toung

unread,

Jul 13, 2010, 6:59:08 PM7/13/10

to freebsd...@freebsd.org

Hello List,
I am on 8.0 RELEASE amd64. My system has 2 RAID arrays connected to 2
separate
controllers.
My I/O throughput tests jumped by ~100MB/sec on both channels, when I
commented out the
following piece of code from kern/vfs_bio.c

void
waitrunningbufspace(void)
{
/*
mtx_lock(&rbreqlock);
while (runningbufspace > hirunningspace) {
++runningbufreq;
msleep(&runningbufreq, &rbreqlock, PVM, "wdrain", 0);
}
mtx_unlock(&rbreqlock);
*/
}

so far, I can't observe any side effects of not running it. Am I on a time
bomb?

Thank you,
Jerry
_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hacke...@freebsd.org"

Matthew Dillon

unread,

Jul 14, 2010, 2:20:52 AM7/14/10

to Jerry Toung, freebsd...@freebsd.org

:void

:waitrunningbufspace(void)
:{
:/*
: mtx_lock(&rbreqlock);
: while (runningbufspace > hirunningspace) {
: ++runningbufreq;
: msleep(&runningbufreq, &rbreqlock, PVM, "wdrain", 0);
: }
: mtx_unlock(&rbreqlock);
:*/
:}
:
:so far, I can't observe any side effects of not running it. Am I on a time
:bomb?
:
:Thank you,
:Jerry

You can bump up the related sysctl for hirunningspace if it helps
you, no kernel code modification is needed. I recommend setting it
to at least 8MB (8388608).

sysctl vfs.hirunningspace=8388608
sysctl vfs.lorunningspace=1048576

The waitrunningbufspace() code is designed to protect the system from
several degenerate situations and should be left in place.
One is where a large backlog of issued WRITE BIOs can accumulate on
block devices. Because the related buffers are locked during the I/O,
any attempt to access the data via the buffer cache will unnecessarily
stall the thread trying to access it. Without a limit several seconds
worth of BIOs can accumulate (sometimes tens of seconds worth if the
I/O is non-linear). Both accesses to file data and accesses to meta-data
can wind up stalling, reducing filesystem peformance.

A second issue is that system buffer cache algorithms will become
severely inefficient if too much of the buffer cache is held in a
locked state.

That said, the defaults in bufinit() (lines 623 and 624) are a bit
too low for today's high-speed I/O subsystems. They appear to be set
to fixed assignments of 512K for lo and 1MB for hi. Even though the
defaults are too low they still ought to be enough to maintain maximum
I/O throughput since WRITE BIOs usually complete very quickly (they
just go into the target device's own write cache and complete). The
pipeline should be maintained if the hysteresis is working properly.
Perhaps there is something else broken that is causing the hystersis
to not work properly.

-Matt

Gary Jennejohn

unread,

Jul 14, 2010, 3:05:50 AM7/14/10

to Jerry Toung, freebsd...@freebsd.org

On Tue, 13 Jul 2010 15:34:12 -0700
Jerry Toung <jryt...@gmail.com> wrote:

> Hello List,
> I am on 8.0 RELEASE amd64. My system has 2 RAID arrays connected to 2
> separate
> controllers.
> My I/O throughput tests jumped by ~100MB/sec on both channels, when I
> commented out the
> following piece of code from kern/vfs_bio.c
>
> void
> waitrunningbufspace(void)
> {
> /*
> mtx_lock(&rbreqlock);
> while (runningbufspace > hirunningspace) {
> ++runningbufreq;
> msleep(&runningbufreq, &rbreqlock, PVM, "wdrain", 0);
> }
> mtx_unlock(&rbreqlock);
> */
> }
>
> so far, I can't observe any side effects of not running it. Am I on a time
> bomb?
>

Rather than commenting out the code try setting the sysctl
vfs.hirunningspace to various powers-of-two. Default seems to be
1MB. I just changed it on the command line as a test to 2MB.

You can do this in /etc/sysctl.conf.

--
Gary Jennejohn

Jerry Toung

unread,

Jul 14, 2010, 12:27:53 PM7/14/10

to gljen...@googlemail.com, freebsd...@freebsd.org

On Wed, Jul 14, 2010 at 12:04 AM, Gary Jennejohn
<gljen...@googlemail.com>wrote:

>
>
> Rather than commenting out the code try setting the sysctl
> vfs.hirunningspace to various powers-of-two. Default seems to be
> 1MB. I just changed it on the command line as a test to 2MB.
>
> You can do this in /etc/sysctl.conf.
>
>

thank you all, that did it. The settings that Matt recommended are giving
the same numbers
I had with the code commented out. I was concerned that the lock or msleep
may be a problem.

Jerry

Ivan Voras

unread,

Jul 15, 2010, 9:01:59 AM7/15/10

to freebsd...@freebsd.org

On 07/14/10 18:27, Jerry Toung wrote:
> On Wed, Jul 14, 2010 at 12:04 AM, Gary Jennejohn
> <gljen...@googlemail.com>wrote:
>
>>
>>
>> Rather than commenting out the code try setting the sysctl
>> vfs.hirunningspace to various powers-of-two. Default seems to be
>> 1MB. I just changed it on the command line as a test to 2MB.
>>
>> You can do this in /etc/sysctl.conf.
>>
>>
> thank you all, that did it. The settings that Matt recommended are giving
> the same numbers

Any objections to raising the defaults to 8 MB / 1 MB in HEAD?

Alan Cox

unread,

Jul 15, 2010, 2:53:21 PM7/15/10

to Ivan Voras, freebsd...@freebsd.org

On Thu, Jul 15, 2010 at 8:01 AM, Ivan Voras <ivo...@freebsd.org> wrote:

> On 07/14/10 18:27, Jerry Toung wrote:
> > On Wed, Jul 14, 2010 at 12:04 AM, Gary Jennejohn
> > <gljen...@googlemail.com>wrote:
> >
> >>
> >>
> >> Rather than commenting out the code try setting the sysctl
> >> vfs.hirunningspace to various powers-of-two. Default seems to be
> >> 1MB. I just changed it on the command line as a test to 2MB.
> >>
> >> You can do this in /etc/sysctl.conf.
> >>
> >>
> > thank you all, that did it. The settings that Matt recommended are giving
> > the same numbers
>
> Any objections to raising the defaults to 8 MB / 1 MB in HEAD?
>
>
>

Keep in mind that we still run on some fairly small systems with limited I/O
capabilities, e.g., a typical arm platform. More generally, with the range
of systems that FreeBSD runs on today, any particular choice of constants is
going to perform poorly for someone. If nothing else, making these sysctls
a function of the buffer cache size is probably better than any particular
constants.

Alan

Attilio Rao

unread,

Jul 15, 2010, 8:26:40 PM7/15/10

to a...@freebsd.org, freebsd...@freebsd.org, Ivan Voras

2010/7/15 Alan Cox <alan....@gmail.com>:

> On Thu, Jul 15, 2010 at 8:01 AM, Ivan Voras <ivo...@freebsd.org> wrote:
>
>> On 07/14/10 18:27, Jerry Toung wrote:
>> > On Wed, Jul 14, 2010 at 12:04 AM, Gary Jennejohn
>> > <gljen...@googlemail.com>wrote:
>> >
>> >>
>> >>
>> >> Rather than commenting out the code try setting the sysctl
>> >> vfs.hirunningspace to various powers-of-two. Default seems to be
>> >> 1MB. I just changed it on the command line as a test to 2MB.
>> >>
>> >> You can do this in /etc/sysctl.conf.
>> >>
>> >>
>> > thank you all, that did it. The settings that Matt recommended are giving
>> > the same numbers
>>
>> Any objections to raising the defaults to 8 MB / 1 MB in HEAD?
>>
>>
>>
> Keep in mind that we still run on some fairly small systems with limited I/O
> capabilities, e.g., a typical arm platform. More generally, with the range
> of systems that FreeBSD runs on today, any particular choice of constants is
> going to perform poorly for someone. If nothing else, making these sysctls
> a function of the buffer cache size is probably better than any particular
> constants.

What about making a MD sysctl?

Attilio

--
Peace can only be achieved by understanding - A. Einstein

Peter Jeremy

unread,

Jul 16, 2010, 5:31:46 AM7/16/10

to a...@freebsd.org, freebsd...@freebsd.org

Regarding vfs.lorunningspace and vfs.hirunningspace...

On 2010-Jul-15 13:52:43 -0500, Alan Cox <alan....@gmail.com> wrote:
>Keep in mind that we still run on some fairly small systems with limited I/O
>capabilities, e.g., a typical arm platform. More generally, with the range
>of systems that FreeBSD runs on today, any particular choice of constants is
>going to perform poorly for someone. If nothing else, making these sysctls
>a function of the buffer cache size is probably better than any particular
>constants.

That sounds reasonable but brings up a related issue - the buffer
cache. Given the unified VM system no longer needs a traditional Unix
buffer cache, what is the buffer cache still used for? Is the current
tuning formula still reasonable (for virtually all current systems
it's basically 10MB + 10% RAM)? How can I measure the effectiveness
of the buffer cache?

The buffer cache size is also very tightly constrained (vfs.hibufspace
and vfs.lobufspace differ by 64KB) and at least one of the underlying
tuning parameters have comments at variance with current reality:
In <sys/param.h>:

* MAXBSIZE - Filesystems are made out of blocks of at most MAXBSIZE bytes
* per block. MAXBSIZE may be made larger without effecting
..
*
* BKVASIZE - Nominal buffer space per buffer, in bytes. BKVASIZE is the
..
* The default is 16384, roughly 2x the block size used by a
* normal UFS filesystem.
*/
#define MAXBSIZE 65536 /* must be power of 2 */
#define BKVASIZE 16384 /* must be power of 2 */

There's no mention of the 64KiB limit in newfs(8) and I recall seeing
occasional comments from people who have either tried or suggested
trying larger blocksizes. Likewise, the default UFS blocksize has
been 16KiB for quite a while. Are the comments still valid and, if so,
should BKVASIZE be doubled to 32768 and a suitable note added to newfs(8)
regarding the maximum block size?

--
Peter Jeremy

Alan Cox

unread,

Jul 16, 2010, 7:36:11 PM7/16/10

to Peter Jeremy, a...@freebsd.org, freebsd...@freebsd.org

Peter Jeremy wrote:
> Regarding vfs.lorunningspace and vfs.hirunningspace...
>
> On 2010-Jul-15 13:52:43 -0500, Alan Cox<alan....@gmail.com> wrote:
>
>> Keep in mind that we still run on some fairly small systems with limited I/O
>> capabilities, e.g., a typical arm platform. More generally, with the range
>> of systems that FreeBSD runs on today, any particular choice of constants is
>> going to perform poorly for someone. If nothing else, making these sysctls
>> a function of the buffer cache size is probably better than any particular
>> constants.
>>
>
> That sounds reasonable but brings up a related issue - the buffer
> cache. Given the unified VM system no longer needs a traditional Unix
> buffer cache, what is the buffer cache still used for?

Today, it is essentially a mapping cache. So, what does that mean?

After you've set aside a modest amount of physical memory for the kernel
to hold its own internal data structures, all of the remaining physical
memory can potentially be used to cache file data. However, on many
architectures this is far more memory than the kernel can
instantaneously access. Consider i386. You might have 4+ GB of
physical memory, but the kernel address space is (by default) only 1
GB. So, at any instant in time, only a fraction of the physical memory
is instantaneously accessible to the kernel. In general, to access an
arbitrary physical page, the kernel is going to have to replace an
existing virtual-to-physical mapping in its address space with one for
the desired page. (Generally speaking, on most architectures, even the
kernel can't directly access physical memory that isn't mapped by a
virtual address.)

The buffer cache is essentially a region of the kernel address space
that is dedicated to mappings to physical pages containing cached file
data. As applications access files, the kernel dynamically maps (and
unmaps) physical pages containing cached file data into this region.
Once the desired pages are mapped, then read(2) and write(2) can
essentially "bcopy" from the buffer cache mapping to the application's
buffer. (Understand that this buffer cache mapping is a prerequisite
for the copy out to occur.)

So, why did I call it a mapping cache? There is generally locality in
the access to file data. So, rather than map and unmap the desired
physical pages on every read and write, the mappings to file data are
allowed to persist and are managed much like many other kinds of
caches. When the kernel needs to map a new set of file pages, it finds
an older, not-so-recently used mapping and destroys it, allowing those
kernel virtual addresses to be remapped to the new pages.

So far, I've used i386 as a motivating example. What of other
architectures? Most 64-bit machines take advantage of their large
address space by implementing some form of "direct map" that provides
instantaneous access to all of physical memory. (Again, I use
"instantaneous" to mean that the kernel doesn't have to dynamically
create a virtual-to-physical mapping before being able to access the
data.) On these machines, you could, in principle, use the direct map
to implement the "bcopy" to the application's buffer. So, what is the
point of the buffer cache on these machines?

A trivial benefit is that the file pages are mapped contiguously in the
buffer cache. Even though the underlying physical pages may be
scattered throughout the physical address space, they are mapped
contiguously. So, the "bcopy" doesn't need to worry about every page
boundary, only buffer boundaries.

The buffer cache also plays a role in the page replacement mechanism.
Once mapped into the buffer cache, a page is "wired", that is, it
removed from the paging lists, where the page daemon could reclaim it.
However, a page in the buffer cache should really be thought of as being
"active". In fact, when a page is unmapped from the buffer cache, it is
placed at the tail of the virtual memory system's "inactive" list. The
same place where the virtual memory system would place a physical page
that it is transitioning from "active" to "inactive". If an application
later performs a read(2) from or write(2) to the same page, that page
will be removed from the "inactive" list and mapped back into the buffer
cache. So, the mapping and unmapping process contributes to creating an
LRU-ordered "inactive" queue.

Finally, the buffer cache limits the amount of dirty file system data
that is cached in memory.

> ... Is the current

> tuning formula still reasonable (for virtually all current systems
> it's basically 10MB + 10% RAM)?

It's probably still good enough. However, this is not a statement for
which I have supporting data. So, I reserve the right to change my
opinion. :-)

Consider what the buffer cache now does. It's just a mapping cache.
Increasing the buffer cache size doesn't affect (much) the amount of
physical memory available for caching file data. So, unlike ancient
times, increasing the size of the buffer cache isn't going to have
nearly the same effect on the amount of actual I/O that your machine
does. For some workloads, increasing the buffer cache size may have
greater impact on CPU overhead than I/O overhead. For example, all of
your file data might fit into physical memory, but you're doing random
read accesses to it. That would cause the buffer cache to thrash, even
though you wouldn't do any actual I/O. Unfortunately, mapping pages
into the buffer cache isn't trivial. For example, it requires every
processor to be interrupted to invalidate some entries from its TLB.
(This is a so-called "TLB shootdown".)

> ... How can I measure the effectiveness
> of the buffer cache?
>
>

I'm not sure that I can give you a short answer to this question.

> The buffer cache size is also very tightly constrained (vfs.hibufspace
> and vfs.lobufspace differ by 64KB) and at least one of the underlying
> tuning parameters have comments at variance with current reality:
> In<sys/param.h>:
>
> * MAXBSIZE - Filesystems are made out of blocks of at most MAXBSIZE bytes
> * per block. MAXBSIZE may be made larger without effecting

> ...

> *
> * BKVASIZE - Nominal buffer space per buffer, in bytes. BKVASIZE is the

> ...

> * The default is 16384, roughly 2x the block size used by a
> * normal UFS filesystem.
> */
> #define MAXBSIZE 65536 /* must be power of 2 */
> #define BKVASIZE 16384 /* must be power of 2 */
>
> There's no mention of the 64KiB limit in newfs(8) and I recall seeing
> occasional comments from people who have either tried or suggested
> trying larger blocksizes.

I believe that larger than 64KB would fail an assertion.

> Likewise, the default UFS blocksize has
> been 16KiB for quite a while. Are the comments still valid and, if so,
> should BKVASIZE be doubled to 32768 and a suitable note added to newfs(8)
> regarding the maximum block size?
>
>

If I recall correctly, increasing BKVASIZE would only reduce the number
buffer headers. In other words, it might avoid wasting some memory on
buffer headers that won't be used.