Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Caching control

121 views
Skip to first unread message

phil-new...@ipal.net

unread,
Mar 2, 2009, 10:37:09 PM3/2/09
to
A feature that would be useful is one where the level of caching can be
controlled for a descriptor. An ioctl() call should be fine. The value
given would specify the maximum number of page-size units of caching to
keep for the descriptor. This would be a _maximum_ and the kernel would
be allowed to cache less than this amount (which would happen if it does
the actual physical I/O faster than the process calls write). But if the
process does call write() more often, the kernel would prevent more from
being written by blocking that write() call.

It would be OK by me if, when that blocking would happen, and when the
descriptor is set to non-blocking mode, that the write() returns with
the EAGAIN error status. But that would not actually be necessary if the
caching can be controlled as described above.

--
|WARNING: Due to extreme spam, googlegroups.com is blocked. Due to ignorance |
| by the abuse department, bellsouth.net is blocked. If you post to |
| Usenet from these places, find another Usenet provider ASAP. |
| Phil Howard KA9WGN (email for humans: first name in lower case at ipal.net) |

David Schwartz

unread,
Mar 4, 2009, 3:01:33 PM3/4/09
to
On Mar 2, 7:37 pm, phil-news-nos...@ipal.net wrote:

> A feature that would be useful is one where the level of caching can be
> controlled for a descriptor.  An ioctl() call should be fine.  The value
> given would specify the maximum number of page-size units of caching to
> keep for the descriptor.  This would be a _maximum_ and the kernel would
> be allowed to cache less than this amount (which would happen if it does
> the actual physical I/O faster than the process calls write).  But if the
> process does call write() more often, the kernel would prevent more from
> being written by blocking that write() call.

There's kind of an unwritten rule that when you propose a new feature,
you have to propose at least one use case. The more reasonable the use
case, and the more awful the best solution that doesn't require a new
feature, the better your proposal.

Is there some problem that this solves best? Or is there a whole class
of problems that this solves better than everything else? If not, the
feature wouldn't be useful.

DS

phil-new...@ipal.net

unread,
Mar 6, 2009, 2:32:01 PM3/6/09
to
On Wed, 4 Mar 2009 12:01:33 -0800 (PST) David Schwartz <dav...@webmaster.com> wrote:
| On Mar 2, 7:37?pm, phil-news-nos...@ipal.net wrote:
|
|> A feature that would be useful is one where the level of caching can be
|> controlled for a descriptor. ?An ioctl() call should be fine. ?The value

|> given would specify the maximum number of page-size units of caching to
|> keep for the descriptor. ?This would be a _maximum_ and the kernel would

|> be allowed to cache less than this amount (which would happen if it does
|> the actual physical I/O faster than the process calls write). ?But if the

|> process does call write() more often, the kernel would prevent more from
|> being written by blocking that write() call.
|
| There's kind of an unwritten rule that when you propose a new feature,
| you have to propose at least one use case. The more reasonable the use
| case, and the more awful the best solution that doesn't require a new
| feature, the better your proposal.

Don't forget that other rule: you also have to implement it yourself

It's a silly rule, but it is asserted for jusy about every feature suggestion.

It's silly because not everyone has the background knowledge of all the other
code involved to accomplish the implementation anywhere nearly as quickly as
someone who does have that background. Or, maybe it is the case that Linux
is simply unable to do this.


| Is there some problem that this solves best? Or is there a whole class
| of problems that this solves better than everything else? If not, the
| feature wouldn't be useful.

The problem it solves is to avoid flooding RAM with a large number of pages
of data that are merely going to be written to disk. When a program is going
to write a large amount of data (at least 2 times as much as there is RAM,
and maybe a lot more), there is no point in caching that data beyond just
enough to keep the I/O rate going at full speed. What happens is that this
flooding of RAM with useless cache causes other processes to be swapped out.
That act of swapping out, and back in again, slows everything down, and the
total amount of work that can get done is reduced. If the swap space is on
the same I/O channel, or even on the same disk drive, as where the bulk data
is being written, it slows down that data writing, too.

One use case is populating a disk with an initial system install, using a
formatted and mounted filesystem, and a stream of files coming from somewhere.
More pages will need to be cached for this use case to gain advantages of the
elevator logic for ordering disk writes. But it doesn't require a massive
amount of cache. Somewhere around 16MB to 128MB would be plenty.

Another use case is similar to above, but the raw disk or partition image is
what is being written. In this case, no elevator action is needed at all,
unless the disk is in use for something else, too. Images are written in a
sequential manner. Caching of just 2 to 4 times the largest I/O write unit
is the maximum needed.

Another use case that is more common is making backups to external hard drives.
This is more and more commonly done. It could be done as a raw partition for
an unmounted filesystem using a program like "dd". Otherwise it would be done
for a file tree with a program like "rsync".

BTW, one way I have done to work around the RAM flooding problem is turn off
swap altogether. I sized my system the usual ways and figured I needed around
2GB to 3GB of RAM for what I do, not considering the bulk writing. I rounded
that up to 4GB, then doubled it to 8GB. If I had used swap space I probably
would have 4GB of RAM and 2GB to 4GB of swap. This way I have just as much
memory. Now when I do bulk writes, it still floods RAM, but the impact is
limited. It can "dismiss" unmodified pages from existing processes, which
means they have to be swapped back in from their original place (executable
file or library) again. But fewer pages are affected, and only half the I/O
is needed for the ones that are affected. It definitely works better.

Another thing I have done to avoid the RAM flooding is to run my own program
that uses the O_DIRECT option on the open() call to the device. This is only
usable for copying raw images. It does slow down the I/O somewhat. It is for
this program I started wondering about a syncronized two-process writing
strategy of which one possible approach was asked about in another thread.

David Schwartz

unread,
Mar 6, 2009, 2:57:01 PM3/6/09
to
On Mar 6, 11:32 am, phil-news-nos...@ipal.net wrote:

> Don't forget that other rule:  you also have to implement it yourself

No, there's no such rule.

> It's a silly rule, but it is asserted for jusy about every feature suggestion.

If it's only useful to you, you have to implement it yourself.

> It's silly because not everyone has the background knowledge of all the other
> code involved to accomplish the implementation anywhere nearly as quickly as
> someone who does have that background.  Or, maybe it is the case that Linux
> is simply unable to do this.

Nobody's going to investigate a suggestion without a use case.

> | Is there some problem that this solves best? Or is there a whole class
> | of problems that this solves better than everything else? If not, the
> | feature wouldn't be useful.

> The problem it solves is to avoid flooding RAM with a large number of pages
> of data that are merely going to be written to disk.  When a program is going
> to write a large amount of data (at least 2 times as much as there is RAM,
> and maybe a lot more), there is no point in caching that data beyond just
> enough to keep the I/O rate going at full speed.  What happens is that this
> flooding of RAM with useless cache causes other processes to be swapped out.
> That act of swapping out, and back in again, slows everything down, and the
> total amount of work that can get done is reduced.  If the swap space is on
> the same I/O channel, or even on the same disk drive, as where the bulk data
> is being written, it slows down that data writing, too.

If this happens, it's a bug in the operating system's caching logic
(or it's badly tuned, or it's a case the logic just handles badly). It
should not allow disk cache to grow large enough to push the working
set into swap.

In any event, for this use case, there is a much better solution,
posix_fadvise(POSIX_FADV_NOREUSE). This is better for three reasons:

1) It's standardized.

2) It tells the operating system the *reason* you don't want the data
kept in cache.

3) It allows the operating system to decide what to do to best handle
that situation rather than you forcing a particular solution that may
or may not be right.

> One use case is populating a disk with an initial system install, using a
> formatted and mounted filesystem, and a stream of files coming from somewhere.
> More pages will need to be cached for this use case to gain advantages of the
> elevator logic for ordering disk writes.  But it doesn't require a massive
> amount of cache.  Somewhere around 16MB to 128MB would be plenty.

Again, this is what posix_fadvise is for.

> Another use case is similar to above, but the raw disk or partition image is
> what is being written.  In this case, no elevator action is needed at all,
> unless the disk is in use for something else, too.  Images are written in a
> sequential manner.  Caching of just 2 to 4 times the largest I/O write unit
> is the maximum needed.

Same answer.

> BTW, one way I have done to work around the RAM flooding problem is turn off
> swap altogether.  I sized my system the usual ways and figured I needed around
> 2GB to 3GB of RAM for what I do, not considering the bulk writing.  I rounded
> that up to 4GB, then doubled it to 8GB.  If I had used swap space I probably
> would have 4GB of RAM and 2GB to 4GB of swap.  This way I have just as much
> memory.  Now when I do bulk writes, it still floods RAM, but the impact is
> limited.  It can "dismiss" unmodified pages from existing processes, which
> means they have to be swapped back in from their original place (executable
> file or library) again.  But fewer pages are affected, and only half the I/O
> is needed for the ones that are affected.  It definitely works better.

This sounds like some kind of tuning problem. Is this a recent Linux
kernel? Does it have default vm tuning? Unlimited writing should *not*
cause recently-active pages to swap out.

> Another thing I have done to avoid the RAM flooding is to run my own program
> that uses the O_DIRECT option on the open() call to the device.  This is only
> usable for copying raw images.  It does slow down the I/O somewhat.  It is for
> this program I started wondering about a syncronized two-process writing
> strategy of which one possible approach was asked about in another thread.

You should not be having this problem. You should invest some time in
figuring out why you do. Have you tinkered with settings like
overcommit_memory, overcommit_ratio, swappiness, min_free_kbytes,
vfs_cache_pressure, dirty_ratio, and so on?

DS

David Schwartz

unread,
Mar 6, 2009, 9:45:09 PM3/6/09
to
On Mar 6, 11:57 am, David Schwartz <dav...@webmaster.com> wrote:

> Have you tinkered with settings like
> overcommit_memory, overcommit_ratio, swappiness, min_free_kbytes,
> vfs_cache_pressure, dirty_ratio, and so on?

Just to clarify, I'm not saying you should have to tinker with these
things to get it to work. I'm saying that if you tinkered with them
for some other reason, you may have broken the default behavior.

DS

phil-new...@ipal.net

unread,
Mar 7, 2009, 1:05:07 AM3/7/09
to
On Fri, 6 Mar 2009 18:45:09 -0800 (PST) David Schwartz <dav...@webmaster.com> wrote:

I have tinkered with some of them, only manually (so a reboot would go back
to default settings). All but one of the ones I tried did nothing that I
could detect. One of them, I don't remember which, caused the system to
start behaving funny. Even without running the bulk write program, the
system was running "in spurts". This was months ago I tried these.

phil-new...@ipal.net

unread,
Mar 7, 2009, 1:16:37 AM3/7/09
to
On Fri, 6 Mar 2009 11:57:01 -0800 (PST) David Schwartz <dav...@webmaster.com> wrote:

| If this happens, it's a bug in the operating system's caching logic
| (or it's badly tuned, or it's a case the logic just handles badly). It
| should not allow disk cache to grow large enough to push the working
| set into swap.

There most certainly still are bugs. Another related one is specific to
certain processors and causes crashes when a process demands too much
memory. It happens with Intel Core2 processors, but not earlier ones,
not with any AMD ones. But it's hard to diagnose because the only effect
in this case is a crash.


| In any event, for this use case, there is a much better solution,
| posix_fadvise(POSIX_FADV_NOREUSE). This is better for three reasons:
|
| 1) It's standardized.
|
| 2) It tells the operating system the *reason* you don't want the data
| kept in cache.
|
| 3) It allows the operating system to decide what to do to best handle
| that situation rather than you forcing a particular solution that may
| or may not be right.
|
|> One use case is populating a disk with an initial system install, using a
|> formatted and mounted filesystem, and a stream of files coming from somewhere.
|> More pages will need to be cached for this use case to gain advantages of the

|> elevator logic for ordering disk writes. ?But it doesn't require a massive
|> amount of cache. ?Somewhere around 16MB to 128MB would be plenty.


|
| Again, this is what posix_fadvise is for.
|
|> Another use case is similar to above, but the raw disk or partition image is

|> what is being written. ?In this case, no elevator action is needed at all,
|> unless the disk is in use for something else, too. ?Images are written in a
|> sequential manner. ?Caching of just 2 to 4 times the largest I/O write unit


|> is the maximum needed.
|
| Same answer.

I will try that one. There's also an option POSIX_FADV_SEQUENTIAL. But it
says "access". And even for POSIX_FADV_NOREUSE. This applies to write, too?


|> BTW, one way I have done to work around the RAM flooding problem is turn off

|> swap altogether. ?I sized my system the usual ways and figured I needed around
|> 2GB to 3GB of RAM for what I do, not considering the bulk writing. ?I rounded
|> that up to 4GB, then doubled it to 8GB. ?If I had used swap space I probably
|> would have 4GB of RAM and 2GB to 4GB of swap. ?This way I have just as much
|> memory. ?Now when I do bulk writes, it still floods RAM, but the impact is
|> limited. ?It can "dismiss" unmodified pages from existing processes, which


|> means they have to be swapped back in from their original place (executable

|> file or library) again. ?But fewer pages are affected, and only half the I/O
|> is needed for the ones that are affected. ?It definitely works better.


|
| This sounds like some kind of tuning problem. Is this a recent Linux
| kernel? Does it have default vm tuning? Unlimited writing should *not*
| cause recently-active pages to swap out.

Depends on how recent. It appears it causes the older pages to swap out.
The most recent ones don't get swapped out. But that changes later when
the older ones are swapped back in. Then the others are older.


|> Another thing I have done to avoid the RAM flooding is to run my own program

|> that uses the O_DIRECT option on the open() call to the device. ?This is only
|> usable for copying raw images. ?It does slow down the I/O somewhat. ?It is for


|> this program I started wondering about a syncronized two-process writing
|> strategy of which one possible approach was asked about in another thread.
|
| You should not be having this problem. You should invest some time in
| figuring out why you do. Have you tinkered with settings like
| overcommit_memory, overcommit_ratio, swappiness, min_free_kbytes,
| vfs_cache_pressure, dirty_ratio, and so on?

Answered to your followup.

phil-new...@ipal.net

unread,
Mar 7, 2009, 1:46:39 AM3/7/09
to
On 7 Mar 2009 06:16:37 GMT phil-new...@ipal.net wrote:

| | In any event, for this use case, there is a much better solution,
| | posix_fadvise(POSIX_FADV_NOREUSE). This is better for three reasons:
| |
| | 1) It's standardized.
| |
| | 2) It tells the operating system the *reason* you don't want the data
| | kept in cache.
| |
| | 3) It allows the operating system to decide what to do to best handle
| | that situation rather than you forcing a particular solution that may
| | or may not be right.
| |
| |> One use case is populating a disk with an initial system install, using a
| |> formatted and mounted filesystem, and a stream of files coming from somewhere.
| |> More pages will need to be cached for this use case to gain advantages of the
| |> elevator logic for ordering disk writes. ?But it doesn't require a massive
| |> amount of cache. ?Somewhere around 16MB to 128MB would be plenty.
| |
| | Again, this is what posix_fadvise is for.
| |
| |> Another use case is similar to above, but the raw disk or partition image is
| |> what is being written. ?In this case, no elevator action is needed at all,
| |> unless the disk is in use for something else, too. ?Images are written in a
| |> sequential manner. ?Caching of just 2 to 4 times the largest I/O write unit
| |> is the maximum needed.
| |
| | Same answer.
|
| I will try that one. There's also an option POSIX_FADV_SEQUENTIAL. But it
| says "access". And even for POSIX_FADV_NOREUSE. This applies to write, too?

Looks like these are not flags (that can be OR'd together) but just values
that can only be used one at a time. And it looks like, from kernel source,
that POSIX_FADV_DONTNEED might be more useful, for blocks already written.

David Schwartz

unread,
Mar 7, 2009, 4:17:45 PM3/7/09
to
On Mar 6, 10:16 pm, phil-news-nos...@ipal.net wrote:

> There most certainly still are bugs.  Another related one is specific to
> certain processors and causes crashes when a process demands too much
> memory.  It happens with Intel Core2 processors, but not earlier ones,
> not with any AMD ones.  But it's hard to diagnose because the only effect
> in this case is a crash.

I haven't heard of the bugs you're talking about. Are other people
experiencing them too? They may just be unique to the particular
hardware or software you are using.


> | Again, this is what posix_fadvise is for.

> I will try that one.  There's also an option POSIX_FADV_SEQUENTIAL.  But it


> says "access".  And even for POSIX_FADV_NOREUSE.  This applies to write, too?

Both reads and writes are accesses. Note that posix_fadvise doesn't
tell the operating system to do anything specific. It just tells it of
your expected access pattern and it's up to the operating system to
figure out the right thing and do it.

> Depends on how recent.  It appears it causes the older pages to swap out.
> The most recent ones don't get swapped out.  But that changes later when
> the older ones are swapped back in.  Then the others are older.

That's not supposed to happen. Linux has a 'swappiness' tunable (and
many other tunables that affect this a bit less directly) that, by
default, makes it very hard for disk cache to push the working set out
of physical memory. You should not have to do anything special for
this to work right, as a process writing a large file to disk on a
busy system is considered a normal use case.

DS

phil-new...@ipal.net

unread,
Mar 10, 2009, 5:40:41 PM3/10/09
to
On Sat, 7 Mar 2009 13:17:45 -0800 (PST) David Schwartz <dav...@webmaster.com> wrote:
| On Mar 6, 10:16?pm, phil-news-nos...@ipal.net wrote:
|
|> There most certainly still are bugs. ?Another related one is specific to

|> certain processors and causes crashes when a process demands too much
|> memory. ?It happens with Intel Core2 processors, but not earlier ones,
|> not with any AMD ones. ?But it's hard to diagnose because the only effect

|> in this case is a crash.
|
| I haven't heard of the bugs you're talking about. Are other people
| experiencing them too? They may just be unique to the particular
| hardware or software you are using.

I'm doing some things lot of other people never do. I build systems inside
subdirectories (runnable via chroot if I wanted to and the architecture is
the same). I copy those trees in bulk to new partitions or drives. I make
copies of the image of those partitions or whole drives (usually compressed)
so I can write them back out again when needed, and in many copies. So I do
have more occaision than most to be doing these kinds of bulk writes.


|> | Again, this is what posix_fadvise is for.
|

|> I will try that one. ?There's also an option POSIX_FADV_SEQUENTIAL. ?But it
|> says "access". ?And even for POSIX_FADV_NOREUSE. ?This applies to write, too?


|
| Both reads and writes are accesses. Note that posix_fadvise doesn't
| tell the operating system to do anything specific. It just tells it of
| your expected access pattern and it's up to the operating system to
| figure out the right thing and do it.

It appears that certain posix_fadvise() commands do tell it what _could_ be
done and they get done right then. POSIX_FADV_DONTNEED appears to be one
such command. A quick look at the code appears like it is immediately flushing
out those pages. I don't know for sure if this guarantees they will be written
or could risk them not being written (e.g. flushing not to disk, but making
them just go away ... as in "I really didn't need to write this so it does not
matter if the write to disk proceeds or not" as opposed to "I don't need to
access the data I just wrote, so it can be flushed now if it has been written
to the disk").


|> Depends on how recent. ?It appears it causes the older pages to swap out.
|> The most recent ones don't get swapped out. ?But that changes later when
|> the older ones are swapped back in. ?Then the others are older.


|
| That's not supposed to happen. Linux has a 'swappiness' tunable (and
| many other tunables that affect this a bit less directly) that, by
| default, makes it very hard for disk cache to push the working set out
| of physical memory. You should not have to do anything special for
| this to work right, as a process writing a large file to disk on a
| busy system is considered a normal use case.

I've tried the swappiness setting. It didn't seem to affect anything overall,
not just the problem I have.

I agree that a process writing a large file to disk should be considered normal.
What has been explained about this in the past is that the system treated what
was recently written as more likely to be read back, compared to some executable
pages of some other process that hasn't touched them for a longer time frame,
and as such tries harder to keep that written data in RAM than the other process.

If posix_fadvise(POSIX_FADV_NOREUSE) really would tell the kernel that what this
descriptor is writing will not need to be read back anytime soon (and more will
be written after it than can even be held in RAM at all), then that should do it.
But it seems POSIX_FADV_NOREUSE really doesn't do anything in Linux (which is
still compliant, as this is advisory).

David Schwartz

unread,
Mar 10, 2009, 7:05:47 PM3/10/09
to
On Mar 10, 2:40 pm, phil-news-nos...@ipal.net wrote:

> If posix_fadvise(POSIX_FADV_NOREUSE) really would tell the kernel that what this
> descriptor is writing will not need to be read back anytime soon (and more will
> be written after it than can even be held in RAM at all), then that should do it.

That's precisely what it does.

> But it seems POSIX_FADV_NOREUSE really doesn't do anything in Linux (which is
> still compliant, as this is advisory).

Because nothing should be necessary.

Your issue is that for some reason, large sequential writes are
forcing more swapping than they should. This is a normal use case and
the system is supposed to handle it sanely. The question is, why is
this causing you a problem when it doesn't cause other people
problems?

IMO, the five most likely problems are:

1) You have too little physical memory.

2) You have too little swap.

3) You've tweaked some setting that's affecting the system's swapiness
level.

4) You have a version of the Linux kernel with a bug or regression
that causes you a problem.

5) Your use case is so weird the kernel mishandles it.

What kernel version are you using? How much physical RAM do you have?
How much swap do you have? Do you have an estimate for your working
set size (just counting direct memory usage, not counting disk cache)?

DS

David Schwartz

unread,
Mar 10, 2009, 7:29:45 PM3/10/09
to

phil-news-nos...@ipal.net wrote:

> I've tried the swappiness setting. It didn't seem to affect anything overall,
> not just the problem I have.

Just as a test, set swappiness to zero. See if you can still replicate
this problem If you can, it's almost certainly *not* sequential writes
forcing memory pages to unmap. With swappiness set to zero, cache will
simply not cause a memory page to be unmapped.

And if zero works for you, set it to 10. Zero is unreasonable for all
but extremely latency-critical workloads. (Or as a temporary setting
while running something like 'updatedb'.)

DS

phil-new...@ipal.net

unread,
Mar 10, 2009, 8:51:13 PM3/10/09
to
On Tue, 10 Mar 2009 16:05:47 -0700 (PDT) David Schwartz <dav...@webmaster.com> wrote:
| On Mar 10, 2:40?pm, phil-news-nos...@ipal.net wrote:
|
|> If posix_fadvise(POSIX_FADV_NOREUSE) really would tell the kernel that what this
|> descriptor is writing will not need to be read back anytime soon (and more will
|> be written after it than can even be held in RAM at all), then that should do it.
|
| That's precisely what it does.
|
|> But it seems POSIX_FADV_NOREUSE really doesn't do anything in Linux (which is
|> still compliant, as this is advisory).
|
| Because nothing should be necessary.

Since it is keeping the pages in RAM after being written, this doesn't seem
to be consistent.


| Your issue is that for some reason, large sequential writes are
| forcing more swapping than they should. This is a normal use case and
| the system is supposed to handle it sanely. The question is, why is
| this causing you a problem when it doesn't cause other people
| problems?

What it does could be considered sane with respect to the speculation that
the data being written, being newer, is more likely to be read back in the
near future, than data that was mapped in by some other process a long time
ago (like many seconds to a minute).


| IMO, the five most likely problems are:
|
| 1) You have too little physical memory.

I have way way more than is needed.


| 2) You have too little swap.

Ironically, the probless is less severe, the less swap there is.


| 3) You've tweaked some setting that's affecting the system's swapiness
| level.

I didn't tweak any settings on a "persistent across reboots" way that I am
aware of.


| 4) You have a version of the Linux kernel with a bug or regression
| that causes you a problem.

Multitutes of 2.6 kernel versions. I'm currently on 2.6.26.2.


| 5) Your use case is so weird the kernel mishandles it.
|
| What kernel version are you using? How much physical RAM do you have?
| How much swap do you have? Do you have an estimate for your working
| set size (just counting direct memory usage, not counting disk cache)?

version 2.6.26.2 currently
RAM 8 GB
swap partition 2 GB
swap activated 0 GB (works best this way)
working set 1 to 2 GB est.

phil-new...@ipal.net

unread,
Mar 10, 2009, 8:53:25 PM3/10/09
to
On Tue, 10 Mar 2009 16:29:45 -0700 (PDT) David Schwartz <dav...@webmaster.com> wrote:

| Just as a test, set swappiness to zero. See if you can still replicate
| this problem If you can, it's almost certainly *not* sequential writes
| forcing memory pages to unmap. With swappiness set to zero, cache will
| simply not cause a memory page to be unmapped.
|
| And if zero works for you, set it to 10. Zero is unreasonable for all
| but extremely latency-critical workloads. (Or as a temporary setting
| while running something like 'updatedb'.)

I'll try this. Currently the drive I've been using for this is in use for
another test, so I'll have to defer that a couple days.

David Schwartz

unread,
Mar 10, 2009, 11:49:13 PM3/10/09
to
On Mar 10, 5:51 pm, phil-news-nos...@ipal.net wrote:

> Since it is keeping the pages in RAM after being written, this doesn't seem
> to be consistent.

Right, but that should not affect mapped memory. Maybe the problem is
that it's effectively flushing the disk cache, not that it's flushing
mapped memory?

Is it file I/O that's slow? Or is it disk I/O that's slow?

If it's disk I/O, swappiness isn't the problem. The problem is that
the disk cache is near-LRU.

DS

phil-new...@ipal.net

unread,
Mar 11, 2009, 11:08:23 PM3/11/09
to
On Tue, 10 Mar 2009 20:49:13 -0700 (PDT) David Schwartz <dav...@webmaster.com> wrote:
| On Mar 10, 5:51?pm, phil-news-nos...@ipal.net wrote:
|
|> Since it is keeping the pages in RAM after being written, this doesn't seem
|> to be consistent.
|
| Right, but that should not affect mapped memory. Maybe the problem is
| that it's effectively flushing the disk cache, not that it's flushing
| mapped memory?

I see resident sizes of most processes decline.


| Is it file I/O that's slow? Or is it disk I/O that's slow?

Both. The swapping causes I/O contention. It utilizes device bus bandwidth
that could have otherwise been used entirely for the bulk writing. It also
caues the disk head to move further from the bulk writing location many more
times than it otherwise would.


| If it's disk I/O, swappiness isn't the problem. The problem is that
| the disk cache is near-LRU.

It's definitely including swap I/O.

A while back (maybe a couple years) I asked about setting up Linux with a hard
boundary between RAM for disk cache and RAM for all other uses. The general
answer was "that's not how Linux is designed". If I could do it, I'd set up
maybe 1/8 of RAM for disk cache and the rest for mapped spaces and anonymous
memory uses.

David Schwartz

unread,
Mar 11, 2009, 11:56:45 PM3/11/09
to
On Mar 11, 8:08 pm, phil-news-nos...@ipal.net wrote:

> I see resident sizes of most processes decline.
>
> | Is it file I/O that's slow? Or is it disk I/O that's slow?
>
> Both.  The swapping causes I/O contention.  It utilizes device bus bandwidth
> that could have otherwise been used entirely for the bulk writing.  It also
> caues the disk head to move further from the bulk writing location many more
> times than it otherwise would.
>
> | If it's disk I/O, swappiness isn't the problem. The problem is that
> | the disk cache is near-LRU.
>
> It's definitely including swap I/O.

Sorry, I'm baffled. Linux is designed not to do this. Dropping
swappiness all the way to zero should help if it's unmapping pages.
But if it's flushing the disk cache that's the problem, there's not
much you can do.

> A while back (maybe a couple years) I asked about setting up Linux with a hard
> boundary between RAM for disk cache and RAM for all other uses.  The general
> answer was "that's not how Linux is designed".  If I could do it, I'd set up
> maybe 1/8 of RAM for disk cache and the rest for mapped spaces and anonymous
> memory uses.

That makes no sense. If the problem is mappings, reducing swappiness
is the right solution and doesn't have the downsides this change will
have. If the problem is that this keeps flushing the disk cache,
making the disk cache smaller will obviously not help

DS

phil-new...@ipal.net

unread,
Mar 12, 2009, 2:12:35 PM3/12/09
to
On Wed, 11 Mar 2009 20:56:45 -0700 (PDT) David Schwartz <dav...@webmaster.com> wrote:
| On Mar 11, 8:08?pm, phil-news-nos...@ipal.net wrote:
|
|> I see resident sizes of most processes decline.
|>
|> | Is it file I/O that's slow? Or is it disk I/O that's slow?
|>
|> Both. ?The swapping causes I/O contention. ?It utilizes device bus bandwidth
|> that could have otherwise been used entirely for the bulk writing. ?It also

|> caues the disk head to move further from the bulk writing location many more
|> times than it otherwise would.
|>
|> | If it's disk I/O, swappiness isn't the problem. The problem is that
|> | the disk cache is near-LRU.
|>
|> It's definitely including swap I/O.
|
| Sorry, I'm baffled. Linux is designed not to do this. Dropping
| swappiness all the way to zero should help if it's unmapping pages.
| But if it's flushing the disk cache that's the problem, there's not
| much you can do.

Maybe it's treating all those pages that come from executable and library
images as disk cache if they aren't modified? It can, afterall, just pluck
them out of RAM without writing, which requires them to be read back in by
the processes using them.


|> A while back (maybe a couple years) I asked about setting up Linux with a hard

|> boundary between RAM for disk cache and RAM for all other uses. ?The general
|> answer was "that's not how Linux is designed". ?If I could do it, I'd set up


|> maybe 1/8 of RAM for disk cache and the rest for mapped spaces and anonymous
|> memory uses.
|
| That makes no sense. If the problem is mappings, reducing swappiness
| is the right solution and doesn't have the downsides this change will
| have. If the problem is that this keeps flushing the disk cache,
| making the disk cache smaller will obviously not help

As the disk cache size gets larger, the improvement increment gets smaller
and smaller. It eventually gets to a point where there is very little benefit
to caching/queuing the writes. That increased size means less RAM available
for other purposes, and the change of size even triggers more disk I/O. At
some point, the cost exceeds the benefits. Just where that point it, depends
on the pattern of writing. For randomly scattered writes, more memory is
needed before reaching that point. For sequential writes, that point is
reached quite early. There is so little benefit to queuing hundreds of pages
of writes compared to a few dozen or so (depending on the maximum drive write
group size which sometimes helps). There is definitely no need to use 1/4 of
the system RAM for write cache on any but the most randomly scattered writes.

And there is no need to wait 40 to 120 seconds before starting the writes, as
an article/post by Ted T'so I saw last night suggested ext4 was doing.

David Schwartz

unread,
Mar 12, 2009, 5:20:54 PM3/12/09
to
On Mar 12, 11:12 am, phil-news-nos...@ipal.net wrote:

> | Sorry, I'm baffled. Linux is designed not to do this. Dropping
> | swappiness all the way to zero should help if it's unmapping pages.
> | But if it's flushing the disk cache that's the problem, there's not
> | much you can do.

> Maybe it's treating all those pages that come from executable and library
> images as disk cache if they aren't modified?  It can, afterall, just pluck
> them out of RAM without writing, which requires them to be read back in by
> the processes using them.

Right, but those pages are mapped into memory. It would have to
invalidate/unmap them in order to discard the data from memory. If
swapiness is set very low, it's not supposed to discard mappings just
to increase disk cache.


> | That makes no sense. If the problem is mappings, reducing swappiness
> | is the right solution and doesn't have the downsides this change will
> | have. If the problem is that this keeps flushing the disk cache,
> | making the disk cache smaller will obviously not help

> As the disk cache size gets larger, the improvement increment gets smaller
> and smaller.  It eventually gets to a point where there is very little benefit
> to caching/queuing the writes.  That increased size means less RAM available
> for other purposes, and the change of size even triggers more disk I/O.  At
> some point, the cost exceeds the benefits.  Just where that point it, depends
> on the pattern of writing.  For randomly scattered writes, more memory is
> needed before reaching that point.  For sequential writes, that point is
> reached quite early.  There is so little benefit to queuing hundreds of pages
> of writes compared to a few dozen or so (depending on the maximum drive write
> group size which sometimes helps).  There is definitely no need to use 1/4 of
> the system RAM for write cache on any but the most randomly scattered writes.

I think you're missing the thrust of my analysis. The increasing disk
cache can only be a problem for one of two reasons:

1) It's pushing mappings out of memory.

2) It's pushing other things out of disk cache.

The I/O will run at the same speed regardless of how big the disk
cache is, full speed.

If the problem is pushing mappings out, swapiness is the right fix. If
the problem is pushing other things out of disk cache, a smaller disk
cache will make things worse.

There is no scenario I can think of where shrinking the disk cache is
the right fix.

> And there is no need to wait 40 to 120 seconds before starting the writes, as
> an article/post by Ted T'so I saw last night suggested ext4 was doing.

If that's happening, it's likely a bug. I agree, the writes should
start as soon as enough of them are buffered. That should not take
more than a second under any scenario I can imagine. (One exception
might be lazy allocation, but even then, 40 seconds seems completely
unreasonable to me.)

DS

phil-new...@ipal.net

unread,
Mar 20, 2009, 1:51:03 PM3/20/09
to
On Thu, 12 Mar 2009 14:20:54 -0700 (PDT) David Schwartz <dav...@webmaster.com> wrote:
| On Mar 12, 11:12?am, phil-news-nos...@ipal.net wrote:
|
|> | Sorry, I'm baffled. Linux is designed not to do this. Dropping
|> | swappiness all the way to zero should help if it's unmapping pages.
|> | But if it's flushing the disk cache that's the problem, there's not
|> | much you can do.
|
|> Maybe it's treating all those pages that come from executable and library
|> images as disk cache if they aren't modified? ?It can, afterall, just pluck

|> them out of RAM without writing, which requires them to be read back in by
|> the processes using them.
|
| Right, but those pages are mapped into memory. It would have to
| invalidate/unmap them in order to discard the data from memory. If
| swapiness is set very low, it's not supposed to discard mappings just
| to increase disk cache.

Invalidating and unmapping is still cheap. It has to be done for
modified pages, too. But unmodified pages don't have the cost of
writing out to disk.


|> | That makes no sense. If the problem is mappings, reducing swappiness
|> | is the right solution and doesn't have the downsides this change will
|> | have. If the problem is that this keeps flushing the disk cache,
|> | making the disk cache smaller will obviously not help
|
|> As the disk cache size gets larger, the improvement increment gets smaller

|> and smaller. ?It eventually gets to a point where there is very little benefit
|> to caching/queuing the writes. ?That increased size means less RAM available
|> for other purposes, and the change of size even triggers more disk I/O. ?At
|> some point, the cost exceeds the benefits. ?Just where that point it, depends
|> on the pattern of writing. ?For randomly scattered writes, more memory is
|> needed before reaching that point. ?For sequential writes, that point is
|> reached quite early. ?There is so little benefit to queuing hundreds of pages


|> of writes compared to a few dozen or so (depending on the maximum drive write

|> group size which sometimes helps). ?There is definitely no need to use 1/4 of


|> the system RAM for write cache on any but the most randomly scattered writes.
|
| I think you're missing the thrust of my analysis. The increasing disk
| cache can only be a problem for one of two reasons:
|
| 1) It's pushing mappings out of memory.
|
| 2) It's pushing other things out of disk cache.
|
| The I/O will run at the same speed regardless of how big the disk
| cache is, full speed.

Maybe not. At least in earlier kernels this was not true. There was so
much CPU time spent figuring out what to remove from cache, that there
was a point where increasing RAM actually _reduced_ the I/O rate. I
don't see that anymore. But I did many versions ago.


| If the problem is pushing mappings out, swapiness is the right fix. If
| the problem is pushing other things out of disk cache, a smaller disk
| cache will make things worse.

What exactly is the swapiness value related to? What are its units?

What would be clear, although not necessarily optimal, would be a
reserve, stating that a specific amount of RAM can be used only for I/O
cache, or write cache, or read cache, or mappings, etc. When the
utilization is at or below the reserve, pages in the reserve class
cannot be taken at all. The sum of all reserves, plus other fixed RAM
usage, obviously must be less than RAM.


| There is no scenario I can think of where shrinking the disk cache is
| the right fix.

When the gains by a larger disk cache are less than the losses by
smaller space for other things, then I do see that as a case where a
smaller disk cache is appropriate. When disk cache (for writing) is
considered all by itself, it should have a performance curve that
approaches leveling out. Just where that happens depends on the
randomness of the I/O requests. Sequential I/O would level out very
fast (e.g. a steep initial rise in performance). Very random I/O should
be the worst performance with a slower rise and longer leveling out.

When the (write) disk cache is considered with respect to its impact on
other things, then you have a balancing act. If there are only two
things to address, then you weight the curves by importance, find the
intersection, and that's your optimal point. When there are three or
more things to address (and in reality there are many), then there is
usually no one point optimal for everything, but there will generally be
a range of points that can at least be worked with.


|> And there is no need to wait 40 to 120 seconds before starting the writes, as
|> an article/post by Ted T'so I saw last night suggested ext4 was doing.
|
| If that's happening, it's likely a bug. I agree, the writes should
| start as soon as enough of them are buffered. That should not take
| more than a second under any scenario I can imagine. (One exception
| might be lazy allocation, but even then, 40 seconds seems completely
| unreasonable to me.)

There are people apparently claiming that deferring things like that is
the way to go. It's hard to say. If file space allocation is deferred,
then an allocation made (committed) later on when it is known where a
write has to occur can "piggy back" on the drive positioning, and result
in what may be related pages being located near each other (e.g. they
may be read back together in the future). But I'm still a believer that
writes should always proceed immediately. What I would do instead of
deferred allocation, is "re-do allocation". At a later time, if related
data has to be written somewhere else, AND if the data for this write is
still intact in RAM, do the allocation over (free the other space), and
do both writes near each other together. This would have to be weighted
against how busy the disk is, since the repeated write would slow down a
very busy disk.

BTW, I'm definitely NOT going to migrate to ext4 for quite a while. I
have posted elsewhere that it appears that POSIX itself is broken with
respect ext4 being able to claim POSIX compliance while being able to
lose data by not syncronizing a file allocation with its renaming. The
order of operations _should_ be guaranteed for _related_ data.

David Schwartz

unread,
Mar 20, 2009, 2:59:01 PM3/20/09
to
On Mar 20, 10:51 am, phil-news-nos...@ipal.net wrote:

> On Thu, 12 Mar 2009 14:20:54 -0700 (PDT) David Schwartz <dav...@webmaster.com> wrote:

> | Right, but those pages are mapped into memory. It would have to
> | invalidate/unmap them in order to discard the data from memory. If
> | swapiness is set very low, it's not supposed to discard mappings just
> | to increase disk cache.

> Invalidating and unmapping is still cheap.  It has to be done for
> modified pages, too.  But unmodified pages don't have the cost of
> writing out to disk.

That's why swapiness defaults to a high value. Clean mappings are as
easy to discard as clean disk cache. In principle, a system with a
unified memory architecture shouldn't care whether disk data in cache
is mapped or not.

However, there are some pathological cases where preventing the system
from unmapping pages in response to unmapped writes is helpful. This
is why Linux provides the 'swapiness' tunable.

By default, Linux operates in the pure, technically right mode.

> | I think you're missing the thrust of my analysis. The increasing disk
> | cache can only be a problem for one of two reasons:
> |
> | 1) It's pushing mappings out of memory.
> |
> | 2) It's pushing other things out of disk cache.
> |
> | The I/O will run at the same speed regardless of how big the disk
> | cache is, full speed.

> Maybe not.  At least in earlier kernels this was not true.  There was so
> much CPU time spent figuring out what to remove from cache, that there
> was a point where increasing RAM actually _reduced_ the I/O rate.  I
> don't see that anymore.  But I did many versions ago.

That's just a bug. Sure, you might need to do crazy things to
workaround a bug. But if the problem turns out to be due to a bug, the
first thing we should try to do is fix it. (If that fails, then we can
search for workarounds.)

> | If the problem is pushing mappings out, swapiness is the right fix. If
> | the problem is pushing other things out of disk cache, a smaller disk
> | cache will make things worse.

> What exactly is the swapiness value related to?  What are its units?

Swapiness controls how likely the system is to discard a mapped page
rather than an unmapped page. Its default value is 100, which means
that mappings are treated the same as non-mapped disk cache. This is
"technically correct". There's no a reason a page read through a disk
file that's 'mmap'ped has any more right to stay in memory than one
read with 'read'. If you turn it all the way to zero, a mapped page
will almost never by discarded to make more space to fit unmapped
pages.

0 is dangerous as it can allow large mappings to make normal read/
write I/O slow to a crawl. But very small values should not be
pathological in most cases.

> What would be clear, although not necessarily optimal, would be a
> reserve, stating that a specific amount of RAM can be used only for I/O
> cache, or write cache, or read cache, or mappings, etc.  When the
> utilization is at or below the reserve, pages in the reserve class
> cannot be taken at all.  The sum of all reserves, plus other fixed RAM
> usage, obviously must be less than RAM.

If you make this small, it does no good. If you make it large, it does
harm. This is not done because it's almost never what people really
want.

> | There is no scenario I can think of where shrinking the disk cache is
> | the right fix.

> When the gains by a larger disk cache are less than the losses by
> smaller space for other things, then I do see that as a case where a
> smaller disk cache is appropriate.

What "other things"? You mean mappings? If you mean mappings, the fix
is dropping swapiness, which gives mapping priority. Do you mean
memory allocated for kernel structures? They already have priority
over the disk cache. If you mean something else, do tell.

>  When disk cache (for writing) is
> considered all by itself, it should have a performance curve that
> approaches leveling out.  Just where that happens depends on the
> randomness of the I/O requests.  Sequential I/O would level out very
> fast (e.g. a steep initial rise in performance).  Very random I/O should
> be the worst performance with a slower rise and longer leveling out.
>
> When the (write) disk cache is considered with respect to its impact on
> other things, then you have a balancing act.  If there are only two
> things to address, then you weight the curves by importance, find the
> intersection, and that's your optimal point.  When there are three or
> more things to address (and in reality there are many), then there is
> usually no one point optimal for everything, but there will generally be
> a range of points that can at least be worked with.

What "other things" are you talking about? If you mean mapped pages,
you have 'swapiness' for that.

If you mean unmapped clean pages, the problem is very different and no
change in the cache size will help. (Unless you separate clean and
dirty data, but then you'll wind up pushing the problem elsewhere.)

> |> And there is no need to wait 40 to 120 seconds before starting the writes, as
> |> an article/post by Ted T'so I saw last night suggested ext4 was doing.
> |
> | If that's happening, it's likely a bug. I agree, the writes should
> | start as soon as enough of them are buffered. That should not take
> | more than a second under any scenario I can imagine. (One exception
> | might be lazy allocation, but even then, 40 seconds seems completely
> | unreasonable to me.)

> There are people apparently claiming that deferring things like that is
> the way to go.  It's hard to say.  If file space allocation is deferred,
> then an allocation made (committed) later on when it is known where a
> write has to occur can "piggy back" on the drive positioning, and result
> in what may be related pages being located near each other (e.g. they
> may be read back together in the future).  But I'm still a believer that
> writes should always proceed immediately.  What I would do instead of
> deferred allocation, is "re-do allocation".  At a later time, if related
> data has to be written somewhere else, AND if the data for this write is
> still intact in RAM, do the allocation over (free the other space), and
> do both writes near each other together.  This would have to be weighted
> against how busy the disk is, since the repeated write would slow down a
> very busy disk.

That's a very expensive proposition. But allocating 16MB chunks will
get you about as much performance as there is to get. And you should
be able to collect 16MB of data in no more than a second or so,
typically.

> BTW, I'm definitely NOT going to migrate to ext4 for quite a while.  I
> have posted elsewhere that it appears that POSIX itself is broken with
> respect ext4 being able to claim POSIX compliance while being able to
> lose data by not syncronizing a file allocation with its renaming.  The
> order of operations _should_ be guaranteed for _related_ data.

POSIX itself is broken in quite a few ways, unfortunately. Don't ever
get me started on POSIX directory reading functions.

DS

phil-new...@ipal.net

unread,
Mar 21, 2009, 4:20:03 PM3/21/09
to
On Fri, 20 Mar 2009 11:59:01 -0700 (PDT) David Schwartz <dav...@webmaster.com> wrote:
| On Mar 20, 10:51?am, phil-news-nos...@ipal.net wrote:
|
|> On Thu, 12 Mar 2009 14:20:54 -0700 (PDT) David Schwartz <dav...@webmaster.com> wrote:
|
|> | Right, but those pages are mapped into memory. It would have to
|> | invalidate/unmap them in order to discard the data from memory. If
|> | swapiness is set very low, it's not supposed to discard mappings just
|> | to increase disk cache.
|
|> Invalidating and unmapping is still cheap. ?It has to be done for
|> modified pages, too. ?But unmodified pages don't have the cost of

|> writing out to disk.
|
| That's why swapiness defaults to a high value. Clean mappings are as
| easy to discard as clean disk cache. In principle, a system with a
| unified memory architecture shouldn't care whether disk data in cache
| is mapped or not.
|
| However, there are some pathological cases where preventing the system
| from unmapping pages in response to unmapped writes is helpful. This
| is why Linux provides the 'swapiness' tunable.
|
| By default, Linux operates in the pure, technically right mode.

There should be separate settings that influence the swappiness of
unmodified pages that can be stolen without any writing, and modified
pages that have to be written out to the swap space in order to steal
their RAM slot. Given that there is only one setting, I'd like to know
(and without having to trace it all through the source code) just how
this setting influences BOTH of these types of swapouts. In particular,
how is it balanced between them? Is there a weighting factor? I'd like
to know just how the numbers actually influence things.


|> | If the problem is pushing mappings out, swapiness is the right fix. If
|> | the problem is pushing other things out of disk cache, a smaller disk
|> | cache will make things worse.
|

|> What exactly is the swapiness value related to? ?What are its units?


|
| Swapiness controls how likely the system is to discard a mapped page
| rather than an unmapped page. Its default value is 100, which means
| that mappings are treated the same as non-mapped disk cache. This is
| "technically correct". There's no a reason a page read through a disk
| file that's 'mmap'ped has any more right to stay in memory than one
| read with 'read'. If you turn it all the way to zero, a mapped page
| will almost never by discarded to make more space to fit unmapped
| pages.

When you refer to "non-mapped disk cache", are you referring to pages
that were read in from files (or raw disk devices), or pages that are
written by processes destined to be written to the file (or disk device)?


| 0 is dangerous as it can allow large mappings to make normal read/
| write I/O slow to a crawl. But very small values should not be
| pathological in most cases.

So is this value just a statistical thing?

I'm trying to relate this to just how much RAM will get used for the
various classes of usage. And I do classify things differently than
just mapped or unmapped.

It seems in one respect I should use maximum (100?) swapiness, whereas
in another respect, I should use a low value (20?). So that tells me I
need something more than just this one setting.


|> What would be clear, although not necessarily optimal, would be a
|> reserve, stating that a specific amount of RAM can be used only for I/O

|> cache, or write cache, or read cache, or mappings, etc. ?When the


|> utilization is at or below the reserve, pages in the reserve class

|> cannot be taken at all. ?The sum of all reserves, plus other fixed RAM


|> usage, obviously must be less than RAM.
|
| If you make this small, it does no good. If you make it large, it does
| harm. This is not done because it's almost never what people really
| want.

Ideally the sum of all reserves (counting things like kernel code as a
reserve, too) should be less than 50% of all of RAM, preferrably as low
as 25%. Then the remainder can be used dynamically for whatever is
needed. Even then, there should also be some kind of time based change
impedance. For example, if 80% happens to be used for one class of use,
and a big demand begins for a different class, it shouldn't suddenly make
a big change. It should have a "soft" reserve that slowly changes with
the demand. The rate of that change should be tunable, with well chosen
defaults for common configurations.


|> | There is no scenario I can think of where shrinking the disk cache is
|> | the right fix.
|
|> When the gains by a larger disk cache are less than the losses by
|> smaller space for other things, then I do see that as a case where a
|> smaller disk cache is appropriate.
|
| What "other things"? You mean mappings? If you mean mappings, the fix
| is dropping swapiness, which gives mapping priority. Do you mean
| memory allocated for kernel structures? They already have priority
| over the disk cache. If you mean something else, do tell.

By "other things" I mean any other kind of demand that could be forced
to do more I/O if the disk cache demand pushes on it. Since Linux does
not have a swapped kernel, the kernel code is fixed in RAM, and that
class of usage won't be relevant to this issue. Any class of usage,
than when pressed on by disk write caching, would do its own I/O, then
it could be taking I/O bandwidth (especially severe if this includes
head seek time) away from the disk writes, slowing down the overall
useful data rate.

At some point of building up a cache of dirty write data, the cache
should not grow any more, and the write() should be blocked (and if my
suggesting of allowing non-blocking I/O for file and disk writes is
enabled, should result in EAGAIN or the legacy EWOULDBLOCK from an
attempt to write on a descriptor with non-blocking enabled). This point
should be enough to allow some _reasonable_ level of write order
optimization to maximize physical device throughput, while specifically
_avoiding_ causing other classes of memory usage from having to do any
I/O that is in excess of the performance gains on this writing.


|> When the (write) disk cache is considered with respect to its impact on

|> other things, then you have a balancing act. ?If there are only two


|> things to address, then you weight the curves by importance, find the

|> intersection, and that's your optimal point. ?When there are three or


|> more things to address (and in reality there are many), then there is
|> usually no one point optimal for everything, but there will generally be
|> a range of points that can at least be worked with.
|
| What "other things" are you talking about? If you mean mapped pages,
| you have 'swapiness' for that.

I'm not being specific. I'm being general. I mean anything that can
incur an I/O bandwidth usage that slows down the writing.

When I use O_DIRECT to write to disk, I avoid the issue of competing I/O
because these writes are not cached. Thus other memory uses won't slow
down the writing. The catch with O_DIRECT is, when the physical I/O is
done, there's nothing ready for the driver to immediately start on the
device again. So a round trip back to the process has to be done to get
the next data. O_DIRECT implies syncronous I/O.

The ideal scenario for what I am trying to do would be a SMALL FINITE
cache. In most cases I'm writing sequentially, so there is no gain to
a large cache. It's bulk write that won't be read later, so I have no
gain to leaving it RAM for something to read soon. What I need is for
there to be JUST ENOUGH queued to immediately keep the device busy
BEFORE making the round trip back to the process to get more data that
it is writing. In theory, I should need no more write cache than the
size that can be collected together for big I/O to the disk, times two
(to allow for data to be ready for the driver to immediately start when
the previous I/O is complete).

I'm putting a program together now to run 2 or more parallel processes
doing writes (using pwrite() calls to a descriptor with O_DIRECT). The
first should have its data going to disk immediately, while the process
is blocked on its pwrite() call waits until that I/O is down. The
second would soon do its pwrite() call, which also blocks, but this data
sits waiting for the disk to become ready (when the first write
physically gets done). Hopefully this is queued all the way down to the
driver, so the instant the interrupt gets handled in the device driver,
it can start the second chunk of I/O right then, keeping the device busy
to the max. Two such writer processes should do the job. I'm making
this program tunable so I can evaluate if maybe 3 or 4 might work better
on systems that are also busy doing other things (e.g. to help keep the
write queue have something for the driver to write).

I would rather have had a SMALL FINITE write cache WITH non-blocking I/O
on the descriptor actually work for disk, so I could do it all in ONE
process. But even without the non-blocking I/O, a managable cache would
be useful. This would mean an ability to specify per-open-descriptor,
a ceiling on the write cache size. When the cached data hits that limit,
then it should behave in a way that it would if otherwise there was no
place to put the data the process is passing through via write(). But
instead of that condition being one that impacts the whole system, it
would be set small (1M or so) so there is minimal impact on the system.

Clearly, if the system has to do other I/O for legimate reasons, and that
I/O is to a controller channel, bus path, or physical device, that uses
bandwidth the bulk writing would also be using, there is an unavoidable
competition for it. But I most certainly do not want the bulk writing to
EVER cause other processes to swap out or otherwise be forced to do I/O
that they otherwise would not do (assuming other processes are reasonable
and not currently trying to take all the system resources, either).

My new computer has 6 SATA ports. If I put 6 drives in them, I should be
able to run 6 bulk writers, one for each, all at full speed and run the
drives at their very top speed, without impacting each other, and without
flooding the cache (using 24M of RAM for cache, 4M per disk, would not be
a flood).


| If you mean unmapped clean pages, the problem is very different and no
| change in the cache size will help. (Unless you separate clean and
| dirty data, but then you'll wind up pushing the problem elsewhere.)

I don't know what you mean by "unmapped clean pages". Is that just
residual, such as what was written by a process, then written to the
disk, then is now still sitting there in RAM just in case something
happens to read that disk block and can just use this data as is?


|> |> And there is no need to wait 40 to 120 seconds before starting the writes, as
|> |> an article/post by Ted T'so I saw last night suggested ext4 was doing.
|> |
|> | If that's happening, it's likely a bug. I agree, the writes should
|> | start as soon as enough of them are buffered. That should not take
|> | more than a second under any scenario I can imagine. (One exception
|> | might be lazy allocation, but even then, 40 seconds seems completely
|> | unreasonable to me.)
|
|> There are people apparently claiming that deferring things like that is

|> the way to go. ?It's hard to say. ?If file space allocation is deferred,


|> then an allocation made (committed) later on when it is known where a
|> write has to occur can "piggy back" on the drive positioning, and result
|> in what may be related pages being located near each other (e.g. they

|> may be read back together in the future). ?But I'm still a believer that
|> writes should always proceed immediately. ?What I would do instead of
|> deferred allocation, is "re-do allocation". ?At a later time, if related


|> data has to be written somewhere else, AND if the data for this write is
|> still intact in RAM, do the allocation over (free the other space), and

|> do both writes near each other together. ?This would have to be weighted


|> against how busy the disk is, since the repeated write would slow down a
|> very busy disk.
|
| That's a very expensive proposition. But allocating 16MB chunks will
| get you about as much performance as there is to get. And you should
| be able to collect 16MB of data in no more than a second or so,
| typically.

The bug of the day in ext4 (and POSIX) is that writing data to even a small
file (won't come close to 16M), then "twisting" the directory references to
put the new file in place of the old (the usual hardlink the file to an old
name, then rename the new file to take the previous place, leaving the old
file with only an old name link), loses data (because the new file did not
have its _data_ syncronized). It is argued that the process should do an
fsync() on the file before closing it and doing the link twisting. What I
argue is that this should be the default, and a program that does not need
for a file to be immediately syncronized should be able to specify that by
some means (an open() or fcntl() flag). Correctness should be paramount by
default, and risky performance should be an available option. And yes, I
would use the performance options often. But I would expect exactly correct
operation by default, in particular because something the effect of this is
being done at a layer not near syscalls (e.g. by shell scripts).

I define correctness as achiving the same outcome as if every operation were
done completely synchronously, within the scope of view of related processes
(which would likely be those in the same process group). That should be the
default. Then performance options could include flags and other features to
allow the program to specify what it does not need to have happen.

That's my personal philosophy: correctness first, by default, absolutely,
and performance options always available.


|> BTW, I'm definitely NOT going to migrate to ext4 for quite a while. ?I


|> have posted elsewhere that it appears that POSIX itself is broken with
|> respect ext4 being able to claim POSIX compliance while being able to

|> lose data by not syncronizing a file allocation with its renaming. ?The


|> order of operations _should_ be guaranteed for _related_ data.
|
| POSIX itself is broken in quite a few ways, unfortunately. Don't ever
| get me started on POSIX directory reading functions.

If you were to start a thread here, or a blog on the web, about POSIX being
broken, I would be interested and want to read it (and maybe reply).

0 new messages