is there a user mode way to flush disk cache

Eric Taylor

unread,

Oct 19, 2005, 11:03:38 PM10/19/05

to

I'm using 2.6 from rhel4 and I need to write 2 gig files quickly.

I don't care if the 2 gigs is all in the cache and nothing has
written to disk yet. I've got plenty of memory.

I need the 2 gig writes to "complete" in less than 30 seconds.
Sometimes when it gets stuck, it can take over 5 mintues and
the disk led is on solid. This seems slow to me.

I find that I can get the writes to complete the fastest if the
disk cache is nearly empty before I start. Otherwise, even
though one would think there was plenty of room left, something
hangs my program while doing writes.

My klutzy way to flush the caches is to write 8 two-gig files in
sucessesion and them rm them all.

There's gotta be a better way than this.

thanks
eric

John Reiser

unread,

Oct 20, 2005, 12:57:52 AM10/20/05

to

> I'm using 2.6 from rhel4 and I need to write 2 gig files quickly.
>
> I don't care if the 2 gigs is all in the cache and nothing has
> written to disk yet. I've got plenty of memory.

If you have plenty of RAM, then why write to disk at all?
How much memory is there, and what is the _measured_ latency of
uncached non-overlapping memory-to-memory copy with length 50MB?
Are you using write(), mmap(), or something else for the files?
What is the length of each write()?

> I need the 2 gig writes to "complete" in less than 30 seconds.

2GB in 30 seconds is 67 MB/s. What are the characteristics of the
connection between memory and disk: EIDE, SCSI, sATA, ...; raw rates
of the channel, bus [if any], and memory controller?
What is the _measured_ latency of a 50MB transfer from memory
to filesystem on the disk [including a sync()]? Even though
you don't care if none of the first 2GB gets to disk in 30 seconds,
it probably matters for the second 2GB, or the third 2GB, etc.
Your first 2GB file may follow someone else's multiple 2GB files.

> Sometimes when it gets stuck, it can take over 5 mintues and
> the disk led is on solid. This seems slow to me.

Which on-disk filesystem? Could block allocation with journaling
be a bottleneck? Have you tried an extent-based filesystem?
What about truncate(, 2GB) immediately after open(), in order to
allocate all the space at once? Have you tried using open(, O_RDWR)
then overwriting pre-allocated space? Did you mount the filesystem
with options nodiratime and noatime? Is the disk free of low-
level hardware errors?

--

Chris Friesen

unread,

Oct 20, 2005, 1:28:41 AM10/20/05

to

Eric Taylor wrote:

> My klutzy way to flush the caches is to write 8 two-gig files in
> sucessesion and them rm them all.
>
> There's gotta be a better way than this.

"sync"

Chris

Eric Taylor

unread,

Oct 20, 2005, 2:22:24 AM10/20/05

to

John Reiser wrote:

> > I'm using 2.6 from rhel4 and I need to write 2 gig files quickly.
> >
> > I don't care if the 2 gigs is all in the cache and nothing has
> > written to disk yet. I've got plenty of memory.
>
> If you have plenty of RAM, then why write to disk at all?
> How much memory is there, and what is the _measured_ latency of
> uncached non-overlapping memory-to-memory copy with length 50MB?
> Are you using write(), mmap(), or something else for the files?
> What is the length of each write()?

These are full memory checkpoints for a large simluation.

>
>
> > I need the 2 gig writes to "complete" in less than 30 seconds.
>
> 2GB in 30 seconds is 67 MB/s. What are the characteristics of the
> connection between memory and disk: EIDE, SCSI, sATA, ...; raw rates
> of the channel, bus [if any], and memory controller?
> What is the _measured_ latency of a 50MB transfer from memory
> to filesystem on the disk [including a sync()]? Even though
> you don't care if none of the first 2GB gets to disk in 30 seconds,
> it probably matters for the second 2GB, or the third 2GB, etc.
> Your first 2GB file may follow someone else's multiple 2GB files.

In 2.4 this works because all the writes go into the file cache. I am
just looking for a way to have this happen in 2.6 where it no longer
seems to work as before.

I've written small test programs. In some cases I simply do a 2 gig write
call. In other cases I seeked to the end first. None of these make any
difference. Sometimes it blasts it out, other times not and I can't see
any pattern.

However, if top says the cache is nearly empty (100 meg or so) then
the writes go into the cache until the cache gets to about 7 gig. At this
time there is typically 2 gig free memory, with the rest being used by
the simulation an other tasks.

It could be the journaling. I will need to talk to my system admin
about mounting differently.

Oh, I've got 5 identically configured systems (12 gig ram, 2xsmp)
with scsi or ide drives. All system behave the same, so I don't think
it's a hardware issue. When these systems were running 2.4 this problem
did not occur.

Eric Taylor

unread,

Oct 20, 2005, 2:36:58 AM10/20/05

to

If only it was this easy. Unfortunately this does not 'empty' the cache
which I guess I forgot to mention.

If the cache is not emptied, then the next series of file writes will
not complete quickly as the system will hang my process while it
does some cache flushing operations.

Peter T. Breuer

unread,

Oct 20, 2005, 2:50:17 AM10/20/05

to

Eric Taylor <e...@rocketship1.com> wrote:
> It could be the journaling.

It would be the journalling (data journalling, that is). You don't have
a 2GB journal, do you? And what about atime, mtime, etc?

> I will need to talk to my system admin about mounting differently.

You first need to control the parameters, and that means finding out
what they are. What FS are you using, what are the bdflush parameters,
and so on.

To make stuff go to buffers and stay there, you need to make buffers
never age, make sure that the buffer cache never goes sync until 100%
of cache is used, and start off with a sync(2).

You are also going to be making progressive metadata changes as mre
blocks are assigned to the inode, so you need to turn off even metadata
journalling, as that will be synchronous. Perhaps you can avoid that
via preallocation.

Oh - don't just seek to +2GB, fill it in. Then run your test rewriting
to the allocation (you'll be evicting cache at that point).

Peter

Peter T. Breuer

unread,

Oct 20, 2005, 2:55:22 AM10/20/05

to

Eric Taylor <e...@rocketship1.com> wrote:
> If only it was this easy. Unfortunately this does not 'empty' the cache
> which I guess I forgot to mention.

Sync does not touch the cache. It empties buffers.

> If the cache is not emptied, then the next series of file writes will

There is no way to empty cache, except by filling buffers.

> not complete quickly as the system will hang my process while it
> does some cache flushing operations.

There is no such thing as a "cache flushing operation". I think you
have buffer and cache mixed up!

I do not think having cache filled when you start will affect you in any
noticable way. But if you are a fetishist for an empty cache, start by
filling buffers with your target one time, and then empty them with
a sync. That will fill cache with your target, and then doing your
write for real will overwrite cache. But you will never have empty
cache when you write unless you have full buffers, which yu don't want.

Peter

Kasper Dupont

unread,

Oct 20, 2005, 5:20:16 AM10/20/05

to

John Reiser wrote:
>
> What about truncate(, 2GB) immediately after open(), in order to
> allocate all the space at once?

Real file systems don't allocate space on a truncate call.
They delay allocation until writes are performed. There
will only be space allocated for those blocks which are
actually written to.

--
Kasper Dupont
Note to self: Don't try to allocate
256000 pages with GFP_KERNEL on x86.

Kasper Dupont

unread,

Oct 20, 2005, 5:30:54 AM10/20/05

to

Eric Taylor wrote:
>
> My klutzy way to flush the caches is to write 8 two-gig files in
> sucessesion and them rm them all.

Don't write files, just allocate a lot of anonymous memory
and write to it all. Here is an example:

http://www.daimi.au.dk/~kasperd/use400m.c

Other than that there is of course BLKFLSBUF ioctl. But
AFAIK it only flushes buffers for one device.

Peter T. Breuer

unread,

Oct 20, 2005, 5:49:14 AM10/20/05

to

Kasper Dupont <kas...@daimi.au.dk> wrote:
> Eric Taylor wrote:
>>
>> My klutzy way to flush the caches is to write 8 two-gig files in
>> sucessesion and them rm them all.

> Don't write files, just allocate a lot of anonymous memory
> and write to it all. Here is an example:

Flushes buffers (which then turn into cache), not cache.

Peter

John Reiser

unread,

Oct 20, 2005, 11:12:22 AM10/20/05

to

> These are full memory checkpoints for a large simluation.

Try using fork(), with the child doing the write()s and the parent
continuing to simulate. Depending on the rate at which the parent
dirties the forked pages, the parallelism may slacken the need for
30-second response on the checkpoint operation. (You did say that
you had plenty of RAM, so dirtying a page costs only the kernel
page-duplicating time, plus the copy-on-write interrupt overhead.)

Also, consider doing the write()s over a gigabit ethernet connection
to other machine(s). The desired effect is to "borrow" their RAM
and parallelize their disks for the duration of the operation.

> It could be the journaling. I will need to talk to my system admin
> about mounting differently.

Overwriting a pool of 10 pre-allocated 2B files would save the cost
of disk allocation.

--

Eric Taylor

unread,

Oct 20, 2005, 12:37:24 PM10/20/05

to

"Peter T. Breuer" wrote:

> > If the cache is not emptied, then the next series of file writes will
>
> There is no way to empty cache, except by filling buffers.
>
> > not complete quickly as the system will hang my process while it
> > does some cache flushing operations.
>
> There is no such thing as a "cache flushing operation". I think you
> have buffer and cache mixed up!

You are correct here. To me, the cache is the memory that top
says is being cached. I don't really know how this translates
into buffers, although the number that top says: buff for seems to
move up/down with the value of cached.

I have not been able to find a tutorial on how disk file caching
works on linux, do you know of a link?

But what I have found, through some testing (just writing a series
of 2 gig files and timing how long it takes) that the behavior on
2.4 vs 2.6 seems different.

On the 2.4 system, the writes go fast and the cache and number of
buff (in top) go up quickly until the amount of free mem is near zero.

On the 2.6 system, things begin to bog down when there's still 2 gigs
of free memory, and the cached value never gets more than 8gig on
a 12 gig system (while 2 gig are still free).

So, something must be different. But I will look into the journalling
parameters.

thanks

Peter T. Breuer

unread,

Oct 20, 2005, 12:56:59 PM10/20/05

to

Eric Taylor <e...@rocketship1.com> wrote:

> "Peter T. Breuer" wrote:

>> > If the cache is not emptied, then the next series of file writes will
>>
>> There is no way to empty cache, except by filling buffers.
>>
>> > not complete quickly as the system will hang my process while it
>> > does some cache flushing operations.
>>
>> There is no such thing as a "cache flushing operation". I think you
>> have buffer and cache mixed up!

> You are correct here. To me, the cache is the memory that top
> says is being cached. I don't really know how this translates
> into buffers, although the number that top says: buff for seems to
> move up/down with the value of cached.

Buffers contain data that is being written to disk but has not yet
arrived on the disk. Cache contains data that is already present on
disk, thus might be data that has either been read from or written
to disk.

If a buffer is written to disk, then it becomes cache (and can be
usurped).

> I have not been able to find a tutorial on how disk file caching
> works on linux, do you know of a link?

It "works" in the obvious way. The only question is what imanagement
strategy is used. For that you have the bdflush controls on buffer
aging and so on. I suspect there is a text on memory tuning in the
Documentation directory of the kernel source.

> But what I have found, through some testing (just writing a series
> of 2 gig files and timing how long it takes) that the behavior on
> 2.4 vs 2.6 seems different.

It is - they have different memory managers (at least, pre 2.4.10).
The older memory manager is predictive (the control circuit uses at
least first derivative info), and the newer is not (it's essentually
stop-start, bang-bang or whatever one calls a square step response
function).

> On the 2.4 system, the writes go fast and the cache and number of
> buff (in top) go up quickly until the amount of free mem is near zero.

> On the 2.6 system, things begin to bog down when there's still 2 gigs
> of free memory, and the cached value never gets more than 8gig on
> a 12 gig system (while 2 gig are still free).

You are misinterpreting what you see, I think. It may well be that
newer kernels have params set differently in terms of being more or
less aggressive about caching versus buffering versus swapping etc,
but that's not the fundamental thing. One can set those numbers any
way one likes on both.

Peter

Joe Pfeiffer

unread,

Oct 20, 2005, 11:59:02 AM10/20/05

to

Eric Taylor <e...@rocketship1.com> writes:

> I'm using 2.6 from rhel4 and I need to write 2 gig files quickly.
>
> I don't care if the 2 gigs is all in the cache and nothing has
> written to disk yet. I've got plenty of memory.
>
> I need the 2 gig writes to "complete" in less than 30 seconds.
> Sometimes when it gets stuck, it can take over 5 mintues and
> the disk led is on solid. This seems slow to me.
>
> I find that I can get the writes to complete the fastest if the
> disk cache is nearly empty before I start. Otherwise, even
> though one would think there was plenty of room left, something
> hangs my program while doing writes.

I'm not sure whether you mean the system cache or the drive's on-board
cache (I suspect both?).

sync() flushes the system buffers. But of course that leaves the
data in the device's cache (as I understand it, anyway).

It looks (browsing the /usr/include/linux for a while) like some disks
support a flush-cache ioctl() call. See /usr/include/linux/ide.h
--
Joseph J. Pfeiffer, Jr., Ph.D. Phone -- (505) 646-1605
Department of Computer Science FAX -- (505) 646-1002
New Mexico State University http://www.cs.nmsu.edu/~pfeiffer
skype: jjpfeifferjr

John Fusco

unread,

Oct 20, 2005, 1:20:46 PM10/20/05

to

I have experienced the same issue with the filesystem cache and never
found a good solution.

You can also free up cache by allocating a bunch of memory, touch each
page with a write, then free it. Touching the memory ejects the disk
cache entries, then freeing it leaves you with a less full filesystem
cache. On IA32, you probably can't get 2G contiguous memory, so you
would need to do simultaneous forks to consume more memory.

Have you considered using O_DIRECT, which would bypass the cache? This
would be slower, but should be more deterministic.

I also found that write performance improved if I turned off swap. On my
system, swap was on a low speed IDE disk while I had high speed saves
going to a RAID0. The high speed writes with the cache full caused some
(not many) swaps to occur. If your swap disk is slow, this can really
affect overall write performance.

Cheers,

John

Peter T. Breuer

unread,

Oct 20, 2005, 1:59:57 PM10/20/05

to

John Fusco <fusco...@yahoo.com> wrote:
> You can also free up cache by allocating a bunch of memory, touch each
> page with a write, then free it.

This is correct. But none of this matters since the OP isn't clear
about the difference between cache and buffers, and what he wants is
to write to buffers (which he wants to never go to disk, and hence never
become cache).

Yes - you are right. You have a clever way of wiping out cache entries.
But if he was REALLY being slowed by the presence of cached data (;),
then there would be something fundamentally wrong with memory
management.

It is possible to imagine that when memory is full and he asks for more
buffers, then more buffers are flushed first instead of cache being
freed. That would be a question of management policy (how hard to try
to preserve cache in the face of competetion), and can be changed via
bdflush parameters.

It is also possible to imagine that the _overhead_ of freeing cache
is slowing him down. But I can't do it! The point of the memory design
that is there is to let cache be freed when it's required to be freed.
And he wouldn't see a "sudden" slowdown.

All I can think is that he has memory parameters set to favour
scavenging for free buffers instead of using cache.

> Touching the memory ejects the disk
> cache entries, then freeing it leaves you with a less full filesystem
> cache. On IA32, you probably can't get 2G contiguous memory, so you
> would need to do simultaneous forks to consume more memory.

Sure.

> Have you considered using O_DIRECT, which would bypass the cache? This
> would be slower, but should be more deterministic.

He doesn't want to write to disk AT ALL.

Peter

Eric Taylor

unread,

Oct 20, 2005, 8:15:57 PM10/20/05

to

"Peter T. Breuer" wrote:

> John Fusco <fusco...@yahoo.com> wrote:
> > You can also free up cache by allocating a bunch of memory, touch each
> > page with a write, then free it.
>
> This is correct. But none of this matters since the OP isn't clear
> about the difference between cache and buffers, and what he wants is
> to write to buffers (which he wants to never go to disk, and hence never
> become cache).

Well, actually, I do want it to go to disk - eventually, over 5-10 minutes.
I just want my program to think the writes are done as soon as possible
so the program can proceed. (And this program is soooo complicated
already from 20 years of development costing many 100's of millions,
that you gotta realize why I can't just jump in and change this beast)

Ok, now that I see the distinction between cache and buffers,
let me restate some things:

First, I have a simulation that crunches on nearly 3.5 gigs of
memory, in a somewhat random (virtual memory unfriendly) way.
I expect that my process will simply be given enough of a
working set so that all of its virtual memory ends up residing
in physical ram. Experience says this is the case. Whatever I
eventually do, I can't break this or the simulation would bog
down too much. So, I have to be careful what physical memory
I reclaim. I want to ONLY reclaim disk buffers and disk cache.

Second, every 30 minutes I need to create a snapshot of this
memory (in case we get a crash and need to restart from one of
these snapshots – or checkpoints as we call them). I have 12
gigs of physical ram on our system. We have mechanisms to
recover from most sorts of crashes, including changing code
and restarting – from the snapshots. This all works pretty
well - but this is getting off topic except that we don't worry
too much about performance if we are restoring from a crash,
so we don't care if these snapshots are flushed from cache, after
they are written to disk.

When writing this snapshot, the simulation must pause so the
snapshot is consistent. If the pause time is on the order of 5
minutes or more, this eats into the time before the simulation
can resume again.. In the past, we would write the 3.5
gigs of memory into 2 files and this would be finished in less
than 1 minute. So, out of 30 minutes, we only lose 1 minute of
processing to have our snapshots. BTW, this simulation often
runs for a week or two. Think of it as a long running video
game, which we don’t want the players to notice the freeze
when we save the game state.(This is actually very close to
what we are doing - it's a simulation but with live interactions,
but 1 minute of down time goes unnoticed - 5 minutes does not).

So, by starting after a reboot with a clean cache and few
buffers, we have a lot of free memory, nearly 10 gig. When I
write the first 2 files (2 gig and 1.5 gig) this takes about
40 seconds to complete. Now, I realize that the data is not
yet out to the disk, but it will get there in probably 5-10
minutes. No sweat, since we have 30 minutes before we need to
do this again. And I get to overlap the continuing simulation
with this actual “flushing out” time. I got a dual smp.

In a 2.4 kernel, (actually redhat enterprise 3) or at least
the default parameters that come with 2.4 (we don’t mess with
them) each snapshot goes this quickly. Sometimes it takes a
little more but never more than 1.5 minutes.

In the 2.6 kernel (or really the r.h. ent 4 distro we are
using) with no tuning, we see that the first snapshot works
equally well. However, after that, I find that subsequent
snapshots sometimes take 5-10 minutes.

This is the problem I am trying to solve. BUT, if I do
my little cache/buffer emptying thing, I can get around
this problem.

My first crack at solving this, w/o writing any new code, was
to create a script that would run a small test program I had
lying around that writes large files. I write six 2 gig
files, and then delete them all via “rm”, and wait for the rm
command to complete. I do this in a second command window
while the simulation is still running. This activity does not
appear to slowdown the simulation, although there does seem to
be a lot of cpu time used up – but we have the smp so it’s ok.

My question was whether someone knew of a better way (but w/o
needing to write a lot of complicated code) to do this. And
after all, the effect of my script appears to “flush” the
cache and the buffers. At least the numbers I see in the “top”
program reflect this.

But my management thinks this is too clumbsy a way to
do things, and it does require that we have some 12 gigs of
free disk space.

tom

unread,

Oct 20, 2005, 8:31:02 PM10/20/05

to

What about using tmpfs to write the snapshot to and spawn a thread that
copies the file from tmpfs to a harddrive. This way the snapshot will be
instant (because tmpfs's backend is RAM) and in the next 30minutes the
spawned thread could easily copy the file onto a harddrive to secure the
file in case you have to reboot or the whole computer crashes.

tmpfs is great, I'm using it for /tmp and /var/tmp. Sometimes I even
copy World of Warcraft data files into /tmp and 'mount --bind' the files
so WoW reads from tmpfs rather than from the harddrive.. it's an
incredible speedup. Someone even tried putting /usr/lib into a tmpfs.
(there's a howto on gentoo-wiki.org)

tom

Peter T. Breuer

unread,

Oct 21, 2005, 5:10:50 AM10/21/05

to

Eric Taylor <e...@rocketship1.com> wrote:
> down too much. So, I have to be careful what physical memory
> I reclaim. I want to ONLY reclaim disk buffers and disk cache.

Keep your program in memory using mlockall().

> Second, every 30 minutes I need to create a snapshot of this
> memory (in case we get a crash and need to restart from one of

The 3.5G, or whatever? Snapshotting memory is not straightforward .. I
don't know how you would do that, let alone efficiently.

> When writing this snapshot, the simulation must pause so the
> snapshot is consistent.

Indeed. But if the snapshot is of "memory in general", then everything
must pause. I don't know how you would identify "your" memory.

> can resume again.. In the past, we would write the 3.5
> gigs of memory into 2 files and this would be finished in less
> than 1 minute. So, out of 30 minutes, we only lose 1 minute of
> processing to have our snapshots.

OK. This is an interesting problem. It sounds made for a lvm-like
snapshot technology working in ram instead of on disk :-).

> So, by starting after a reboot with a clean cache and few
> buffers, we have a lot of free memory, nearly 10 gig. When I
> write the first 2 files (2 gig and 1.5 gig) this takes about
> 40 seconds to complete. Now, I realize that the data is not
> yet out to the disk, but it will get there in probably 5-10
> minutes. No sweat, since we have 30 minutes before we need to
> do this again. And I get to overlap the continuing simulation
> with this actual flushing out time. I got a dual smp.

Dual SMP is helpful. You may be able to locate the simulation on one
and dedicate the other to doing the snapshot (hey, three would be better
:-).

> In a 2.4 kernel, (actually redhat enterprise 3) or at least

Dual SMP is helpful. You may be able to locate the simulation on one
and dedicate the other to doing the snapshot (hey, three would be
better :-).

> them) each snapshot goes this quickly. Sometimes it takes a
> little more but never more than 1.5 minutes.

> In the 2.6 kernel (or really the r.h. ent 4 distro we are
> using) with no tuning, we see that the first snapshot works
> equally well. However, after that, I find that subsequent
> snapshots sometimes take 5-10 minutes.

One really wants to find out why. Your claim is that it is cache
management overhhead beyond about the 2GB mark. Try emptying cache.

> This is the problem I am trying to solve. BUT, if I do
> my little cache/buffer emptying thing, I can get around
> this problem.

That's the problem - what you did seemed to not be sufficiently
controlled to distinguish between buffers and cache (afair). Can you
try again and convince us that the difference is really an empty CACHE?

> My first crack at solving this, w/o writing any new code, was
> to create a script that would run a small test program I had
> lying around that writes large files.

This will usurp cache with buffers. The buffers will then vanish
too. That sounds OK.

> I write six 2 gig

> files, and then delete them all via rm, and wait for the rm
> command to complete.

The (block) cache has vanished, but there file-system reservoirs that
still may be hanging around.

> My question was whether someone knew of a better way (but w/o
> needing to write a lot of complicated code) to do this. And
> after all, the effect of my script appears to flush the
> cache and the buffers. At least the numbers I see in the top
> program reflect this.

I don't know why you prefer "top" over "free"! Or just cat
/proc/meminfo and friends for the raw info.

It seems to me that the best method is getting six processes to write
2GB of stuff each into memory and then handshake and die. That is much
quicker than dealing with the disk.

But I would talk to the memory manager people in the kernel if what
you are seeing is some point at which cache management overhead ceases
working the way it is supposed to!

> But my management thinks this is too clumbsy a way to
> do things, and it does require that we have some 12 gigs of
> free disk space.

Peter

--
---------------------------------------------------------------------
Peter T. Breuer MA CASM PhD. Ing., Prof. Ramon y Cajal
Area de Ingenieria Telematica E-mail: p...@it.uc3m.es
Dpto. Ingenieria Tel: +34 91 624 91 80
Universidad Carlos III de Madrid Fax: +34 91 624 94 30/65
Butarque 15, E-28911 Leganes ES RL: http://www.it.uc3m.es/~ptb

John McCallum

unread,

Oct 21, 2005, 6:18:45 AM10/21/05

to

Eric Taylor wrote:

> When writing this snapshot, the simulation must pause so the
> snapshot is consistent. If the pause time is on the order of 5
> minutes or more, this eats into the time before the simulation
> can resume again.. In the past, we would write the 3.5
> gigs of memory into 2 files and this would be finished in less
> than 1 minute. So, out of 30 minutes, we only lose 1 minute of
> processing to have our snapshots. BTW, this simulation often
> runs for a week or two. Think of it as a long running video
> game, which we don’t want the players to notice the freeze
> when we save the game state.(This is actually very close to
> what we are doing - it's a simulation but with live interactions,
> but 1 minute of down time goes unnoticed - 5 minutes does not).

I must be misunderstanding something here. It seems to me that by far the
easiest way of accomplishing this is to fork() a child process to do the
disk write. Linux fork() uses copy_on_write pages and so the memory
overhead may even be relatively small, depending on how quickly your
simulation updates the pages. The seperate process would simply dump to
disk, and even if it took 5 minutes, it wouldn't hold up the main
simulation task at all.

Cheers,
--
John McCallum
Artesyn CP, Edinburgh

For email, leave the web and we're not so small.

Peter T. Breuer

unread,

Oct 21, 2005, 6:35:32 AM10/21/05

to

John McCallum <joh...@itsy-bitsy.spider.web.com> wrote:
> I must be misunderstanding something here. It seems to me that by far the

The main simulation must pause in order to allow a snapshot of something
coherent (i.e. its state) to be taken. Just like you ask people not to
move while you take a picture with your camera!

Peter

Robert Redelmeier

unread,

Oct 21, 2005, 8:00:33 AM10/21/05

to

Peter T. Breuer <p...@oboe.it.uc3m.es> wrote:
> The main simulation must pause in order to allow a snapshot
> of something coherent (i.e. its state) to be taken.
> Just like you ask people not to move while you take a
> picture with your camera!

If `sync` isn't enough for you, mount/flag the disk ops as
'synchronous`. Write won't return until it's done.

Checkpointing is an important business. If you have 8 GB to
write out, check you disk speed with `hdparm`. I don't think
you'll get more than 30 MB/s on most disks, so this'll take at
least 5 minutes to run. Much longer if it has to thrash in pages
from disk. Consider turning swap off if you have enough RAM.

If you don't care about stopping at whole complete iterations, you
could checkpoint on the fly (async) then call close() and sync().
I wouldn't worry too much about buffers. From a memory write,
Linux will use the userland pages. It will only buffer metadata.

This brings up some good points. You're using tons of contiguous
memory and disk. So configure accordingly. The largest available
blocksize at least for disk. And consider 4MB pages for the
kernel to save overhead.

-- Robert

Peter T. Breuer

unread,

Oct 21, 2005, 8:39:38 AM10/21/05

to

Robert Redelmeier <red...@ev1.net.invalid> wrote:
> Peter T. Breuer <p...@oboe.it.uc3m.es> wrote:
>> The main simulation must pause in order to allow a snapshot
>> of something coherent (i.e. its state) to be taken.
>> Just like you ask people not to move while you take a
>> picture with your camera!

> If `sync` isn't enough for you, mount/flag the disk ops as
> 'synchronous`. Write won't return until it's done.

There is no "write" as I understand it. It's in-memory state of the
program! And I don't think he needs help pausing the program ;-).

> Checkpointing is an important business. If you have 8 GB to
> write out, check you disk speed with `hdparm`.

He says he has 3.5GB, I think I recall. But that doesn't matter .. he
can write the snapshot to memory (he has 12GB) and take his own sweet
time about transfering to disk.

His problem is that the memory-to-memory write is taking more than the
30s to 1 minute it was previously taking under kernel 2.4.

Peter

John McCallum

unread,

Oct 21, 2005, 8:41:34 AM10/21/05

to

Peter T. Breuer wrote:

> The main simulation must pause in order to allow a snapshot of something
> coherent (i.e. its state) to be taken. Just like you ask people not to
> move while you take a picture with your camera!

Yup, I'd understood that. The point being that fork() takes a snapshot of
the memory state of the process when fork() was called (unless it is in
shared memory or some such annoyance, in which case this would be of no
use... partial updates and all the rest).

The child has a snapshot of memory from when fork() was called but it is
faster than a simple copy. This because the snapshot can actually be the
same physical RAM, removing some of the copying overhead until the
simulation actually changes something. That is, extra page table entries
are created for the new process pointing to the same place. If the main
simulation writes to the pages then the OS copies the memory, but you may
find that some pages do not need to be copied and some expensive copies are
saved.

Peter T. Breuer

unread,

Oct 21, 2005, 8:55:07 AM10/21/05

to

John McCallum <joh...@itsy-bitsy.spider.web.com> wrote:
> Peter T. Breuer wrote:

>> The main simulation must pause in order to allow a snapshot of something
>> coherent (i.e. its state) to be taken. Just like you ask people not to
>> move while you take a picture with your camera!

> Yup, I'd understood that. The point being that fork() takes a snapshot of
> the memory state of the process when fork() was called (unless it is in

That's very intersting as an idea. I hadn't thought of THAT snapshot
technique! I thought he would simply do some sort of dump based on
knowledge of the internals, such as a map of DB hash buckets.

But you are right - if he is ONLY snapshotting memory, he can fork,
and get a copy of that precise state at that instant. Yowww. I think
you have the jackpot. It doesn't matter how long he takes to send it
to disk.

(agreed all the provisos I am sure you listed!)

Peter

Robert Redelmeier

unread,

Oct 21, 2005, 9:04:42 AM10/21/05

to

Peter T. Breuer <p...@oboe.it.uc3m.es> wrote:

> His problem is that the [3.5 GB] memory-to-memory write

> is taking more than the 30s to 1 minute it was previously
> taking under kernel 2.4.

Thank you for the synopsis. I had trouble following.

mcopy() is a very important userland function that has to be
optimized for the exact processor (P6/P7/AMD). In kernel space,
is there some reason to believe the minor pagefualts take longer
to handle? Is he thrashing the pagetables out of cache?

-- Robert

John Reiser

unread,

Oct 21, 2005, 11:59:35 AM10/21/05

to

My summary of some highlights regarding checkpointing 2GB of address
space every 30 minutes, to enable recovery from errors or accident
while a process runs for many hours or days:

Choice of "on-disk" filesystem layout and policy matters.
Consider using a filesystem that allocates space using extents
(regions of arbitrarily many contiguous blocks) instead of single
blocks or small fixed-size groups of blocks.

Avoid writing the 2GB checkpoint file to a journaling filesystem
such as ext3. If nothing else, then create a separate 6.5GB ext2
filesystem [no journal] to hold the 3 most recent checkpoints,
and do a sync() after each checkpoint.

Writing the checkpoint to a tmpfs filesystem has appeal. tmpfs
is a RAM-resident filesystem that shares memory space with the
kernel page cache, and grows and shrinks automatically, subject to
controllable limits. See the Linux kernel source
Documentation/filesystems/tmpfs for info. By default, most RedHat
systems already mount /dev/shm as a tmpfs that may grow to 1/2 of RAM.
Copy "at leisure" from tmpfs to real disk for power-fail safety.

However, if the observed latency of several minutes really is due
to contention for the kernel page cache, then using a ramfs might be
better because the RAM for a ramfs is dedicated to the filesystem,
and not shared with the kernel page cache. So permanently dedicating
2.1 GB of RAM for a ramfs would guarantee that one checkpoint can be taken
without contention, limited only by the speed of memory-to-memory copy.
Again, copy from ramfs to real disk for power-fail safety.

The fastest logical checkpoint is fork(). The child gets an instantaneous
snapshot of the parent's address space. [However, explicit shmat() or
mmap(,,,MAP_SHARED,,) still will be shared.] The kernel RAM requirements
are at most the page table for the child process (2MB for a 2GB process),
and sometimes even the page table can be shared with the parent. The
kernel implements fork() using copy-on-write page protection. The first
actual write to a copy-on-write page, by either the parent or the child,
causes a page fault. If the page is still shared, then the kernel
duplicates the contents, updates the page table, and continues the
process. So the parent can continue simulating while the child writes
its snapshot to disk, and the only downside is the overhead of faulting
for pages that get duplicated. If the child writes 4MB blocks to disk and
munmap()s the RAM as it goes, then perhaps much of the physical copying
can be avoided: the share count might have been reduced to 1 by the time
the pagefault occurs. Depending on the rates (child writing to disk,
parent breaking copy-on-write), the child may want to use a tmpfs or
ramfs as an intermediate buffer before copying to hard disk. The fork()
allows the parent to continue immediately, and the child's unmapping of
pages copied to RAM-based filesystem enables the kernel to skip the
copy when the parent writes to those pages.

--

tom

unread,

Oct 21, 2005, 1:13:35 PM10/21/05

to

John Reiser wrote:
> My summary of some highlights regarding checkpointing 2GB of address
> space every 30 minutes, to enable recovery from errors or accident
> while a process runs for many hours or days:
>
> Choice of "on-disk" filesystem layout and policy matters.
> Consider using a filesystem that allocates space using extents
> (regions of arbitrarily many contiguous blocks) instead of single
> blocks or small fixed-size groups of blocks.
>
> Avoid writing the 2GB checkpoint file to a journaling filesystem
> such as ext3. If nothing else, then create a separate 6.5GB ext2
> filesystem [no journal] to hold the 3 most recent checkpoints,
> and do a sync() after each checkpoint.
>

Also consider using XFS realtime subvolumes.. they are great if you're
handling with large files (videos etc).

tom

John Fusco

unread,

Oct 21, 2005, 4:10:27 PM10/21/05

to

Wouldn't copy-on-write take care of that? This sounds like a decent
solution to me.

John

Peter T. Breuer

unread,

Oct 21, 2005, 6:05:46 PM10/21/05

to

John Fusco <fusco...@yahoo.com> wrote:
> Peter T. Breuer wrote:
>> John McCallum <joh...@itsy-bitsy.spider.web.com> wrote:
>>
>>>I must be misunderstanding something here. It seems to me that by far the
>>
>> The main simulation must pause in order to allow a snapshot of something
>> coherent (i.e. its state) to be taken. Just like you ask people not to
>> move while you take a picture with your camera!
>>

> Wouldn't copy-on-write take care of that? This sounds like a decent
> solution to me.

I suppose it would, if (non-shared) memory is all he is snapshotting.
But we don't know that.

Peter

Nix

unread,

Oct 22, 2005, 5:44:53 PM10/22/05

to

On Thu, 20 Oct 2005, Peter T. Breuer uttered the following:

> Buffers contain data that is being written to disk but has not yet
> arrived on the disk. Cache contains data that is already present on
> disk, thus might be data that has either been read from or written
> to disk.
>
> If a buffer is written to disk, then it becomes cache (and can be
> usurped).

My understanding is that the buffer cache and page cache are completely
separate entities: one caches information keyed by (block device, block
number) while the other caches information keyed by (dev, inum); both
can have entries that are dirty (requiring flushing before discarding)
and entries that are clean (which can just be discarded). They're
maintained by separate functions and never turn into each other.

Now you can redefine `buffer' and `cache' to mean something completely
different, but then you might have trouble communicating with other
people :)

>> But what I have found, through some testing (just writing a series
>> of 2 gig files and timing how long it takes) that the behavior on
>> 2.4 vs 2.6 seems different.
>
> It is - they have different memory managers (at least, pre 2.4.10).

post-2.4.10 too: 2.6.11+ has four-level page tables, the swap token,
reverse-mapping (`objrmap'), and the block device scheduler to contend
with (although none of these are likely to be terribly significant in
the case of Damn Big Writes with not much memory pressure).

--
`"Gun-wielding recluse gunned down by local police" isn't the epitaph
I want. I am hoping for "Witnesses reported the sound up to two hundred
kilometers away" or "Last body part finally located".' --- James Nicoll

Peter T. Breuer

unread,

Oct 22, 2005, 6:14:58 PM10/22/05

to

Nix <nix-ra...@esperi.org.uk> wrote:
> On Thu, 20 Oct 2005, Peter T. Breuer uttered the following:
>> Buffers contain data that is being written to disk but has not yet
>> arrived on the disk. Cache contains data that is already present on
>> disk, thus might be data that has either been read from or written
>> to disk.
>>
>> If a buffer is written to disk, then it becomes cache (and can be
>> usurped).

> My understanding is that the buffer cache and page cache are completely
> separate entities:

My understanding is that they are the same thing - all are "buffers"
(in the kernel sense) as pointed to by buffer heads, and it is merely a
question of which list they are on (and which flags they have) as to
where they will be counted in the totals.

>> It is - they have different memory managers (at least, pre 2.4.10).

> post-2.4.10 too: 2.6.11+ has four-level page tables, the swap token,
> reverse-mapping (`objrmap'), and the block device scheduler to contend

Block device scheduler? What? Are you saying that the order in which
the block device request functions are triggered is subject to a more
sophisticated algorithm than "in order of major" (or whatever it was)?

> with (although none of these are likely to be terribly significant in
> the case of Damn Big Writes with not much memory pressure).

Well, if they have managed to make a block device scheduler which is
clever enough not to cycle trying to release pressure on the same
device again and again when it is that device causing the pressure ...
yes, it is significant.

Peter

Eric Taylor

unread,

Oct 24, 2005, 1:47:36 PM10/24/05

to

Hmmm, fork is beginning to look very attractive. I think I
will concentrate on this approach for now.

I think our checkpointing problem is an interesting one, and
I’ve always thought that it should be something that is
supported by the kernel. I imagine there is just not enough
users that need this however. So, since it’s not in the
kernel, I’ve been working to find a solution for the last 5-6
years (after our first port to linux).

So, for the curious, here is more detail about our long-
standing “checkpointing” problem.

The system in question is written in the commercially provided
simscript language. This specialized simulation language has
been around for over 20 years and it supports checkpointing.
The compiler outputs C code but does not use malloc for memory
allocation. It does use standard C absolute non-relocatable
pointers, and so any checkpoints taken have to be restored to
the same exact virtual memory addresses.

It should also be fairly evident why we can’t break this up
into a bunch of smaller cooperating processes. This would
require a whole new approach, and probably a rewrite in a
different language. With 64 bit around the corner, we’re just
trying to keep the 32-bit version going a few more years.

(Some years ago an attempt to create a checkpoint by writing
out all the little pieces of memory was a bust. With millions
of lines of code it was too difficult to find all the pieces,
and we would have had to deal with the relocation of pointers
ourselves. It would also have been a lot slower with all the
piecemeal writes, not to mention all the piecemeal allocations
on a restart).

This need for using the same virtual addresses across runs of
the program has been a sticky problem for years. The original
version allocated all memory in one contiguous chunk with sbrk
(a port from solaris) and would run into problems with shared
libraries getting in the way. Part of the problem was that our
needs were in conflict with the linux philosophy that one
should not use absolute virtual memory addresses. My attempts
to get official changes to linux were not successful. Sticky
problems like exec shield also exacerbated the problem (and
our users want to use that feature, so turning it off is not a
solution).

We break up our code into 2 pieces: a static executable and a
dynamic library. The dynamic library can be changed between a
checkpoint and a restore. This is in case we need to fix some
code after running the simulation for a few days or weeks.

But let me fast forward over the many prior solutions to the
one I thought would finally solve all our difficulties.

First off we use a redhat hugemem kernel to statically
allocate nearly 3.8 gigs of user VA space, leaving the
remaining .2 gigs for code and stack.

The problem with the location of shared libraries was finally
solved using a “blessed” way of getting the 3.8 gigs of
absolute contiguous virtual address space that I could be
assured would be there (or the program wouldn’t even run at
all). The breakthrough was the change to the elf loader to
support the notion of “huge bss” segments. This causes bss
segments to get loaded into memory before the shared libraries
and is now a standard feature of linux.

I did need to allocate memory via an asm statement, since the
gcc compiler will not let you directly allocate an array over
2 gigs in size. So, I allocate memory like this:

asm(" .comm hugebss,0xf4000000,0x1000" );

By requesting 3.8 gigs of bss memory, I can be assured that
the virtual memory range of .2 gig through 3.8 gig must have
been statically allocated for my bss segment. It simply can’t
fit anywhere else.

The remaining .2 gigs is enough room for the code and all
shared libraries. Our simulation uses very little stack so
that is not a problem. I then completely manage this 3.6 gigs
of VA space with my own memory allocation code. This lets me
snapshot memory in a few (very) large writes, although I
actually do it in 10 meg chunks so I can output a % progress
display to the terminal window.

BTW, the ultimate solution is in the works – a 64 bit AMD
system. But this won’t be ready for another year or two, and
we would still have many of the same checkpointing problems to
solve. So, for now, I will look into trying a fork approach.

thanks for all the help from all posters

Eric Taylor

unread,

Oct 24, 2005, 3:11:42 PM10/24/05

to

"Peter T. Breuer" wrote:

> Eric Taylor <e...@rocketship1.com> wrote:
> > down too much. So, I have to be careful what physical memory
> > I reclaim. I want to ONLY reclaim disk buffers and disk cache.
>
> Keep your program in memory using mlockall().
>
> > Second, every 30 minutes I need to create a snapshot of this
> > memory (in case we get a crash and need to restart from one of
>
> The 3.5G, or whatever? Snapshotting memory is not straightforward .. I
> don't know how you would do that, let alone efficiently.

Since we statically allocate all memory via a bss segment, and then
manage that memory with our own "malloc-like" routine all the
memory we need to save and restore is in a small number of
large contiguous pieces.

So, if you have what amounts to a couple of very large contiguous
chunks of memory, using some large write calls will write
this out fairly well, sans the cache/buffer difficulties that have
surfaced since the new kernel we are using (redhat ent 4's
own version of a 2.6 kernel).

>
>
> > This is the problem I am trying to solve. BUT, if I do
> > my little cache/buffer emptying thing, I can get around
> > this problem.
>
> That's the problem - what you did seemed to not be sufficiently
> controlled to distinguish between buffers and cache (afair). Can you
> try again and convince us that the difference is really an empty CACHE?

Here's a little test case I ran. I call it fr.c

gcc -o fr fr.c

then to write 3 x 2 gig files:

./fr w junkfile1 2000
./fr w junkfile2 2000
./fr w junkfile3 2000

and it will write a 2 gig file and tell you a progress at the terminal.

What I found was that on a 12 gig 2.6 system, even with about 8 gigs free
on the third running of the program, even though there was still over 2 gigs
of free memory, there would be a pause in the middle and the disk light
was solid busy. I don't know what was occurring at this time. Seems to
me that it should have gone flat out until all memory was nearly gone.

(below is kinda ugly, I appologize in advance, it was originaly a windows
test program too and I cut out all the extra junk not relevant here).

#include <time.h>
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/io.h>
#include <fcntl.h>
#include <stdlib.h>
#include <stdio.h>
#define SIZE 4 * 1024 * 250
char buffer[SIZE];
char buffer2[SIZE];
int *buff,*buff2;
#define _O_RDONLY O_RDONLY

#define _O_CREAT O_CREAT

#define _O_RDRW O_RDRW
#define _O_RDWR O_RDWR
#define _O_RANDOM 0
#define _O_BINARY 0
#define _open(a,b) open((a),(b),0777)
#define _read(a,b,c) read((a),(b),(c))
#define _lseek(a,b,c) lseek((a),(b),(c))
#define _write(a,b,c) write((a),(b),(c))
#define ERR 100
#define OK 0
char * timeis() {
struct tm *newtime;
static char buf[100];
time_t aclock;
time( &aclock ); /* Get time in seconds */
newtime = localtime( &aclock ); /* Convert time to struct */
strcpy(buf,(char *)asctime( newtime )+11);
buf[strlen(buf)-1-4] = '\0';
return buf;
}
main(int argc, char **argv) {int r=0;

if (argc <= 2) {
printf("\n");
printf("Usage: fr w file megs -- write file and init to size megs\n");
r = 0;
return r;
}
if (argv[1][0] == 's') {
} else if (argv[1][0] == 'm') {
} else if (argv[1][0] == 'w') {
r =main3(argc-1,&argv[1]);
} else {
}

printf("return = %d, finished at %s\n",r, timeis() );
return r;

}
static t[] = {0,1,2,4,8,16,32,64};

main3(int argc, char **argv) {
#if 1

int fh1,i,j,asum,try=0,pos,tb=0; clock_t start, finish;double sex,speed;
int megs;
unsigned char *b;

unsigned int nbytes = SIZE, byteswritten,sum=0;
if (argc <= 2) {
printf("usage:fr w file megs\n");
return ERR;
}
megs = atoi(argv[2]);
fh1 = _open( argv[1], _O_CREAT | _O_BINARY | _O_RDWR);
if( fh1 == -1 ){
perror( "open failed on output file" );
printf("%s\n",argv[1]);
return ERR;
}

b = buffer;
for(i = 0 ; i < nbytes ; i++ ) {
b[i] = i & 0xff;
}
time(&start);
for(i=0;i < megs; i++ ) {
byteswritten = _write( fh1, buffer, nbytes );
if (byteswritten < 0 ) {
perror( "Problem writing file");
break;
} else {
tb = tb + byteswritten;
if (asum != -1) {
for(j = 0 ; j < byteswritten ; j++ ) {
sum = sum + (b[j] & 0xff) ;
}
}
if (i%10 == 0 || byteswritten != SIZE) {
if(i!=0){
time(&finish);
sex = difftime( finish, start );
printf( "%4d meg \r",
i ,sum );fflush(stdout);
}
}
}

}
printf("%15d %d bytes written \n",sum, tb );fflush(stdout);

#endif

return OK;
}

JosephKK

unread,

Nov 2, 2005, 11:33:15 PM11/2/05

to

Peter T. Breuer wrote:

Could you give me a pointer to some documentation on how to set these
parameters? I have never been able to figure out where it is.
--
JosephKK

Peter T. Breuer

unread,

Nov 3, 2005, 2:46:55 AM11/3/05

to

JosephKK <jose...@lanset.com> wrote:
> Peter T. Breuer wrote:
>> aging and so on. I suspect there is a text on memory tuning in the
>> Documentation directory of the kernel source.
>>

> Could you give me a pointer to some documentation on how to set these
> parameters? I have never been able to figure out where it is.

The bdflush man page used to work for me. It says ...

The set of parameters, their values, and their legal
ranges are defined in the kernel source file fs/buffer.c.

I'm pretty sure there is a memory tuning text in the kernel Docu dir,
but I don't see it at a glance.

Peter