VM: dynamic swap remapping (patch)

Vladimir Dozen

unread,

Sep 29, 2001, 7:57:20 AM9/29/01

to hac...@freebsd.org

ehlo.

(Sorry for long pre-history, I believe it is necessary.)

My current employer develops large CORBA-based data mining servers.
They are usually run under HP-UX, but, following the current fashion
to build processing farms, I was targeted to build version for free
unices. Initial platform was Linux, and build itself was done smoothly,
but very soon we were got problem: we use pthreads; to be more precise,
we use thread-per-client model. This means that at the same time we may
compute from single to a few tens client sessions. Each session may eat
as much as 1G of address space, and even more (actually, there is no
limits except for hardware ones).

The problem was how Linux (and FreeBSD, as we discovered soon) treats
out-of-memory (OOM) situation.

Under HPUX memory is precommited (i.e., swap is reserved for every
allocated page), so as soon as we get into OOM, malloc() or operator
new() returns NULL or throws exception, so we have opportunity to
unroll stack, tell client we cannot perform his request currently and,
most important, are able to continue execution of other clients requests.

Linux and FreeBSD simply were killing whole our process and we have no
any chance to know we are out of memory! All our data of all our clients
(some of them were in processing days before) were lost. :(((((

Very unfriendly, and, what can be more important, this kind of interaction
(absence of it, really) between OS and application reduces chances of
porting really large applications onto FreeBSD due to fact that no one
can trust OS that can simply trash user data with no warning.

It seems to me, OS must use any chance to continue execution of
application instead of killing it. I do think it is Right Way.

I have wrote a patch that modifies behaivour (have I spelled this
word right? ;) of VM when we are out of memory. Instead of killing
largest process, we remap parts of it's address space onto temporal
files (exactly as HP-UX does when swapping into dir turned on).
Of course, we cannot do it when we absolutely out of swap, we do it
a bit early, when swap daemon founds swap free pages lowed to
nswap_lowat.

I called this patch OOM Keeper as opposite to OOM Killer used in
Linux (yah, I prefer BSD).

Here is generic algorithm:

1. Swap daemon founds vm_swap_size < nswap_lowat; it calls
vm_oomkeeper_swap_almost_full();
2. vm_oomkeeper_swap_almost_full() searches process having
largest vm_object of type OBJT_SWAP, and sends it signal
(proposed name: SIGXMEM).
3. process gets signal, and calls special syscall (proposed
name: remap).
4. (we are again in kernel, this time curproc is our big process,
in vm_oomkeeper_process).
while free swap blocks are lower than nswap_hiwat, we
do following:
a) find largest object of OBJT_SWAP in current process
b) create temporal file and unlink() it
c) save first 1M of object into file
d) cut first 1M of map (here we can get free swap blocks)
e) mmap the file onto the place where the data was before.

If any of above will fail, then old killproc() will trigger,
so system will still be able to drop buggy processes.

Note: process now has chance to do something in OOM situation.
It can simply ignore signal, and it will be killed soon. It can
call remap(), and it will be remapped onto files -- this will
slow things down, but will allow to continue processing. It can
free some space (e.g., by unmapping anonymous mmap). It can
finally save current data and terminate, if nothing of above is
acceptable.

Note also that ulimits and quota are in action since files
are created under process credentials.

This patch was tested on my home PC with 64M RAM and 64M swap; I was
able to run processes with _committed_ address space up to 512M
in various scenarios: large malloc then commit, small incremental
mallocs with immediate commit, random commit, parallel run of
two or three such memory eaters, etc. No doubts, it requires
additional testing.

The patch is at whole in separate file -- vm_oomkeeper.c, and
it requeres only single intrusion point in current code -- add
single line in swap_pager.c:swp_sizechk().

But, to fully implement it, I have to add new signal and new
syscall into system. I do not want to go so far until I'll know
if my patch acceptable for FreeBSD team.

To make it fully controllable it would also be useful to set
nswap_{hi,lo}wat via sysctl interface. In any case, when using OOMK
these two should be raised about 4 to 8 times (from 400K to 2-4M).

It would be also valueable if default action for SIGXMEM would be not
SIG_IGN, but calling remap(). This requires patching of libc. Special
environment variable ($REMAPDIR) might be used to set location of
temporal files.

I can send the vm_oomkeeper.c by request (it is 12K long, and I
do not want to post it into mail list with no permission).

Comments?

--
dozen @ home

To Unsubscribe: send mail to majo...@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message

Alfred Perlstein

unread,

Sep 29, 2001, 8:10:45 AM9/29/01

to Vladimir Dozen, hac...@freebsd.org

* Vladimir Dozen <vladimi...@mail.ru> [010929 06:57] wrote:
> ehlo.
>
> (Sorry for long pre-history, I believe it is necessary.)

[snip]
> Comments?

Wow! This is really awesome work you've done, perhaps you can put
the patch up on a URL someplace? If not mail it to me in private
and I can put it up for people to see. One thing though, I think
that this behaviour should be toggled via a sysctl, but I think I
can manage doing that for you.

One other question, why not just set an option to make FreeBSD not
overcommit? I've always wanted the ability to turn off overcommit
for exactly the same reasons you do.

--
-Alfred Perlstein [alf...@freebsd.org]
'Instead of asking why a piece of software is using "1970s technology,"
start asking why software is ignoring 30 years of accumulated wisdom.'

Wilko Bulte

unread,

Sep 29, 2001, 8:14:10 AM9/29/01

to Alfred Perlstein, Vladimir Dozen, hac...@freebsd.org

On Sat, Sep 29, 2001 at 07:10:24AM -0500, Alfred Perlstein wrote:
> * Vladimir Dozen <vladimi...@mail.ru> [010929 06:57] wrote:
> > ehlo.
> >
> > (Sorry for long pre-history, I believe it is necessary.)
> [snip]
> > Comments?
>
> Wow! This is really awesome work you've done, perhaps you can put
> the patch up on a URL someplace? If not mail it to me in private
> and I can put it up for people to see. One thing though, I think
> that this behaviour should be toggled via a sysctl, but I think I
> can manage doing that for you.
>
> One other question, why not just set an option to make FreeBSD not
> overcommit? I've always wanted the ability to turn off overcommit
> for exactly the same reasons you do.

FWIW: Tru64 has had this capability since day one. You can select
swap-overcommit mode by removing a symlink (/sbin/swapdefault -> /dev/foob)
were /dev/foob is the primary swap partition.

W/

--
| / o / /_ _ email: wi...@FreeBSD.org
|/|/ / / /( (_) Bulte Arnhem, The Netherlands

Matt Dillon

unread,

Sep 29, 2001, 12:55:53 PM9/29/01

to Wilko Bulte, Alfred Perlstein, Vladimir Dozen, hac...@freebsd.org

:> overcommit? I've always wanted the ability to turn off overcommit

:> for exactly the same reasons you do.
:
:FWIW: Tru64 has had this capability since day one. You can select
:swap-overcommit mode by removing a symlink (/sbin/swapdefault -> /dev/foob)
:were /dev/foob is the primary swap partition.
:
:W/
:
:--
:| / o / /_ _ email: wi...@FreeBSD.org
:|/|/ / / /( (_) Bulte Arnhem, The Netherlands

Well, the overcommit argument comes up once or twice a year. Frankly
I don't see much of a point to it. While it is true that you could
implement a signal the plain fact of the matter is that having to deal
with the possibility in a program at the N points (generally hundreds of
points) where that program allocates memory, either directly or
indirectly, virtually guarentees that you will introduce bugs into the
system. You also cannot guarentee that your process will have time to
cleanup prior to the system killing, nor can you guarentee that all the
standard system utilities and daemons will be able to gracefully handle
the out of memory condition. In otherwords, you could implement
the signal and even have the program use it, but you will still likely
leave gaping holes in the implementation that will result in lost data.

It is much easier to manage memory manually. For example, if these
programs require 1G of independant memory to run it ought to be a
fairly simple matter to simply create a 1GB file for each process
(using dd rather then ftruncate() to create the file so the blocks are
preallocated), mmap() it using PROT_READ|PROT_WRITE, MAP_SHARED|MAP_NOSYNC,
and do your memory management out of that. The memory space will be
backed by the file rather then by swap. You get all the benefits of
the standard overcommit capabilities of the system as well as the
ability to pre-reserve the main workspace for the programs and you
automatically get persistent storage for the data. Problem solved.

-Matt

Rik van Riel

unread,

Sep 29, 2001, 2:39:36 PM9/29/01

to Vladimir Dozen, hac...@freebsd.org

On Sat, 29 Sep 2001, Vladimir Dozen wrote:

> I have wrote a patch that modifies behaivour (have I spelled this
> word right? ;) of VM when we are out of memory. Instead of killing
> largest process, we remap parts of it's address space onto temporal
> files (exactly as HP-UX does when swapping into dir turned on).

This is not instead of killing, this is just a way to
delay the killing of processes longer. Once your disk
is full you'd still run into the choice between a
deadlock and a kill...

It's an awesome way of delaying the out of memory
problem, though, because a suspended application won't
be able to allocate anything more, giving the system a
better chance to let the running apps run to completion.

Alternatively, the one leaky application is suspended
and the rest of the system continues to run without any
problems.

In short, I like it ;)

regards,

Rik
--
IA64: a worthy successor to i860.

http://www.surriel.com/ http://distro.conectiva.com/

Send all your spam to aard...@nl.linux.org (spam digging piggy)

Vladimir Dozen

unread,

Sep 29, 2001, 3:38:58 PM9/29/01

to Matt Dillon, Wilko Bulte, Alfred Perlstein, hac...@freebsd.org

ehlo.

> You also cannot guarentee that your process will have time to
> cleanup prior to the system killing, nor can you guarentee that all the
> standard system utilities and daemons will be able to gracefully handle
> the out of memory condition. In otherwords, you could implement
> the signal and even have the program use it, but you will still likely
> leave gaping holes in the implementation that will result in lost data.

Actually, the things as I coded them better suited namely for poorly
written daemons that never check for malloc result. Precommit will just
kill them as soon as malloc() will return NULL, and they dereference it.
Killproc() will kill them too. Remapping will save them. Disk space
now is large enough to make them live till root will notice that
they grow to much and do something (kill them manually, probably ;).

> It is much easier to manage memory manually. For example, if these
> programs require 1G of independant memory to run it ought to be a
> fairly simple matter to simply create a 1GB file for each process
> (using dd rather then ftruncate() to create the file so the blocks are

> preallocated), mmap() it using PROT_READ|PROT_WRITE,MAP_SHARED|MAP_NOSYNC,

> and do your memory management out of that.

First at all, it is NOT easier. Doing own memory management is not too
simple, especially for threads and SMP -- we seen 50% performance impact
when two threads on two processors were doing intensive allocations
(it was not FreeBSD, and these was kernel threads).

Second, application not always grows to 1G, most of the time it keeps
as small as 500M ;). Why should we precommit 1G for 500M data? Doing
multi-mmap memory management is additional pain.

Third, swapping to device is faster, and, while we have enough swap,
I would prefer to swap there. Even a few percent for 5-day computation
make sense.

> Problem solved.

If I'm the developer -- probably, yes. What if I'm system administrator,
and has to run something large _and important_? The day I'll notice
that monster creates swap files I'll know I have to add RAM. I will
have time since it still works, it was not killed.

P.S. Anyway, I do NOT insist my solution is better, and even that it
is good for anything at all. It was fun for me to hack in BSD kernel,
and it was interesting challenge, and I feel need to share results
with others. At worst, I will recommend our customer to setup
processing farm under FreeBSD with applied patch.

--
dozen @ home

Alfred Perlstein

unread,

Sep 29, 2001, 6:57:15 PM9/29/01

to Vladimir Dozen, Matt Dillon, Wilko Bulte, hac...@freebsd.org

* Vladimir Dozen <vladimi...@mail.ru> [010929 14:38] wrote:

> P.S. Anyway, I do NOT insist my solution is better, and even that it
> is good for anything at all. It was fun for me to hack in BSD kernel,
> and it was interesting challenge, and I feel need to share results
> with others. At worst, I will recommend our customer to setup
> processing farm under FreeBSD with applied patch.

I'm really impressed with the work you put into this, but it seems
that you've tried to tackle two problems at the same time, and by
tying them together made it less flexible and possibly more error
prone.

My suggestion, (but not my final say, i'm still open to ideas):

Implement a memory status signal to notify processes of changes
in the relative amount of system memory.

When memory reaches a low or high watermark, the signal is
broadcast to all running processes.

The default disposition will be to ignore the signal.

The signal will be named SIGMEMINFO. (SIGXfoo means
'process has exceeded resource foo')

The signal will pass via the siginfo struct information
such that the process can determine if the system has
just exceeded the low watermark (danger) or has reclaimed
down to the high watermark (enough free memory).

This is just to provide processes with a warning to scale back
consumption, exit, or release reasources, the good part is that
it's broadcast and all interested parties will do something,
hopefully the right thing.

To achieve nearly the same effect as your patch, I would implement
the above low/high water mark notification, then either:

a) over allocate swap a bit and set the low watermark carefully.
b) do the following enhancement:

Provide a system whereby you can swap to the filesystem without
additional upcalls/syscalls from userspace, basically, provide
some means of paging to the filesystem automatically.

then, set your lowwater mark to the size of your swap partition,
now your system will alert your processes and automatically swap
_anyone_ to the filesystem.

I really think that this would be more flexible and still allow
you to achieve what you want... What do you think?

--
-Alfred Perlstein [alf...@freebsd.org]
'Instead of asking why a piece of software is using "1970s technology,"
start asking why software is ignoring 30 years of accumulated wisdom.'

To Unsubscribe: send mail to majo...@FreeBSD.org

Karsten W. Rohrbach

unread,

Sep 29, 2001, 10:34:00 PM9/29/01

to Vladimir Dozen, hac...@freebsd.org

Vladimir Dozen(vladimi...@mail.ru)@2001.09.29 15:59:41 +0000:

> ehlo.
>
> (Sorry for long pre-history, I believe it is necessary.)
>
> My current employer develops large CORBA-based data mining servers.
> They are usually run under HP-UX, but, following the current fashion
> to build processing farms, I was targeted to build version for free
> unices. Initial platform was Linux, and build itself was done smoothly,
> but very soon we were got problem: we use pthreads; to be more precise,
> we use thread-per-client model. This means that at the same time we may
> compute from single to a few tens client sessions. Each session may eat
> as much as 1G of address space, and even more (actually, there is no
> limits except for hardware ones).

IIRC from the problems we had with a project some while ago, mm might
help. [http://www.engelschall.com/sw/mm/]

it wraps malloc() and friends into a neat api, including preallocation
in fs space (the features are somewhat os dependent) and fast shared
memory.

/k

--
> Did you know that there are 71.9 acres of nipple tissue in the U.S.?
KR433/KR11-RIPE -- WebMonster Community Founder -- nGENn GmbH Senior Techie
http://www.webmonster.de/ -- ftp://ftp.webmonster.de/ -- http://www.ngenn.net/
karsten&rohrbach.de -- alpha&ngenn.net -- alpha&scene.org -- ca...@spam.de
GnuPG 0x2964BF46 2001-03-15 42F9 9FFF 50D4 2F38 DBEE DF22 3340 4F4E 2964 BF46
Please do not remove my address from To: and Cc: fields in mailing lists. 10x

Matt Dillon

unread,

Sep 30, 2001, 3:54:02 AM9/30/01

to Vladimir Dozen, Wilko Bulte, Alfred Perlstein, hac...@freebsd.org

: Second, application not always grows to 1G, most of the time it keeps

: as small as 500M ;). Why should we precommit 1G for 500M data? Doing
: multi-mmap memory management is additional pain.

Why not? Disk space is cheap. For a problem like this I would simply
throw in two 30G+ hard drives and partition them with 16G of swap each,
giving me 32G of swap for the machine. If you needed to do it cheaply
you could even use IDE, though personally I would use SCSI for
reliability. Depending on the amount of real memory in the machine
you might have to tweek a few kernel options (like matching NSWAP to
the actual number of swap devices), but basically it should just work.

Even using file-backed memory is fairly trivial. You don't need to
do multi-mmap memory management or do any kernel tweaking. Just
reserve 1G and use a single mmap() and file per process.

-Matt

Vladimir Dozen

unread,

Sep 30, 2001, 4:02:30 AM9/30/01

to Alfred Perlstein, Vladimir Dozen, Matt Dillon, Wilko Bulte, hac...@freebsd.org

ehlo.

> My suggestion, (but not my final say, i'm still open to ideas):
>
> Implement a memory status signal to notify processes of changes
> in the relative amount of system memory.
>
> When memory reaches a low or high watermark, the signal is
> broadcast to all running processes.
>
> The default disposition will be to ignore the signal.
>
> The signal will be named SIGMEMINFO. (SIGXfoo means
> 'process has exceeded resource foo')

Agreed. As for SIG_IGN, can anyone tell me -- can I force
existing application to use my signal handler? For example,
by preallocating some shared library? If so, there are no
contras for ignoring signal by default.

> The signal will pass via the siginfo struct information
> such that the process can determine if the system has
> just exceeded the low watermark (danger) or has reclaimed
> down to the high watermark (enough free memory).

Passing more info is always better. Agreed.

> a) over allocate swap a bit and set the low watermark carefully.
> b) do the following enhancement:
>
> Provide a system whereby you can swap to the filesystem without
> additional upcalls/syscalls from userspace, basically, provide
> some means of paging to the filesystem automatically.
>
> then, set your lowwater mark to the size of your swap partition,
> now your system will alert your processes and automatically swap
> _anyone_ to the filesystem.
>
> I really think that this would be more flexible and still allow
> you to achieve what you want... What do you think?

I can't say anything until I'll got detail. Sorry, English is neither
my native nor used often, so I may easely miss important details, but
here is my random comments:

Initally, I was trying the same (I think) approach, but there was
some problems. Some kernel function refused to work with VM objects
of processes differing from curproc. I.e., it could be hard to work
with bigproc inside swap daemon; and swap daemon is the only place
where we can detect OOM condition; that's why I used signal to transfer
control to user space, and then back into kernel -- already in another
process. Another reason to do it -- to make all limits and quota work
automatically. Also, I did not wanted to make swap daemon busy too long.

Also, what means "over allocate swap a bit"? How to compute the value
of that bit? At what moment should we preallocate? Should we repeat
preallocation after getting SIGMEMINFO (himark)?

Also, you cannot set low mark to size of swap partition. To create
file-based swap you need some memory (file operations requires it).
So, low mark should be a bit lower (that's why I raised value of
nswap_lowat).

Finally, if you want to over allocate swap for every process in
system, the whole swap can wind up consisting of only preallocations.
Resource management is the role of kernel. Any hard reservation
interfere with that.

--
dozen @ home

Alfred Perlstein

unread,

Sep 30, 2001, 4:10:26 AM9/30/01

to Matt Dillon, Vladimir Dozen, Wilko Bulte, hac...@freebsd.org

* Matt Dillon <dil...@earth.backplane.com> [010930 02:53] wrote:
>
> : Second, application not always grows to 1G, most of the time it keeps
> : as small as 500M ;). Why should we precommit 1G for 500M data? Doing
> : multi-mmap memory management is additional pain.
>
> Why not? Disk space is cheap. For a problem like this I would simply
> throw in two 30G+ hard drives and partition them with 16G of swap each,
> giving me 32G of swap for the machine. If you needed to do it cheaply
> you could even use IDE, though personally I would use SCSI for
> reliability. Depending on the amount of real memory in the machine
> you might have to tweek a few kernel options (like matching NSWAP to
> the actual number of swap devices), but basically it should just work.
>
> Even using file-backed memory is fairly trivial. You don't need to
> do multi-mmap memory management or do any kernel tweaking. Just
> reserve 1G and use a single mmap() and file per process.

What he needs is a system to inform him that things aren't looking
so good, check my email for what I think is a pretty good solution.

--
-Alfred Perlstein [alf...@freebsd.org]
'Instead of asking why a piece of software is using "1970s technology,"
start asking why software is ignoring 30 years of accumulated wisdom.'

To Unsubscribe: send mail to majo...@FreeBSD.org

Poul-Henning Kamp

unread,

Sep 30, 2001, 4:49:47 AM9/30/01

to Matt Dillon, Vladimir Dozen, Wilko Bulte, Alfred Perlstein, hac...@freebsd.org

In message <200109300752...@earth.backplane.com>, Matt Dillon writes:
>: Second, application not always grows to 1G, most of the time it keeps
>: as small as 500M ;). Why should we precommit 1G for 500M data? Doing
>: multi-mmap memory management is additional pain.
>

> Even using file-backed memory is fairly trivial. You don't need to
> do multi-mmap memory management or do any kernel tweaking. Just
> reserve 1G and use a single mmap() and file per process.

I once had a patch to phkmalloc() which backed all malloc'ed VM with
hidden files in the users homedir. It was written to put the VM
usage under QUOTA control, but it had many useful side effects as well.

I can't seem to find it right now, but it is trivial to do: just
replace the sbrk(2) with mmap(). Only downside is the needed
filedescriptor which some shells don't like.

--
Poul-Henning Kamp | UNIX since Zilog Zeus 3.20
p...@FreeBSD.ORG | TCP/IP since RFC 956
FreeBSD committer | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.

Alfred Perlstein

unread,

Sep 30, 2001, 4:55:56 AM9/30/01

to Vladimir Dozen, Matt Dillon, Wilko Bulte, hac...@freebsd.org

* Vladimir Dozen <vladimi...@mail.ru> [010930 03:02] wrote:
> ehlo.
>
> > My suggestion, (but not my final say, i'm still open to ideas):
> >
> > Implement a memory status signal to notify processes of changes
> > in the relative amount of system memory.
> >
> > When memory reaches a low or high watermark, the signal is
> > broadcast to all running processes.
> >
> > The default disposition will be to ignore the signal.
> >
> > The signal will be named SIGMEMINFO. (SIGXfoo means
> > 'process has exceeded resource foo')
>
> Agreed. As for SIG_IGN, can anyone tell me -- can I force
> existing application to use my signal handler? For example,
> by preallocating some shared library? If so, there are no
> contras for ignoring signal by default.

Yes, it's kind of evil, but you need to do this:

make a .c file that has your signal handler and a function
called _init that enables it. You also might want to
export if we're in the low watermark via sysctl variable
so that at startup you can set a variable to do things
like make a malloc wrapper fail...

compile it like so:
gcc -shared -fpic -fPIC -o t2.So -c t2.c ; ld t2.So -shared -o t2.so

then install it someplace, then set this in the environment:

LD_PRELOAD=/path/to/where/you/put/it/t2.so

All non-setuid and non-setgid programs will respect this.

Now just because I told you that, doesn't mean you should run off
now and use your solution, I hope you take the time to consider
what I've got to say here. :)

> > The signal will pass via the siginfo struct information
> > such that the process can determine if the system has
> > just exceeded the low watermark (danger) or has reclaimed
> > down to the high watermark (enough free memory).
>
> Passing more info is always better. Agreed.

It's just a trick so you only need one signal instead of a signal
for both SIGMEMLOW (signal memory low) and SIGMEMOK (signal memory
is ok, or enough is free).

> > a) over allocate swap a bit and set the low watermark carefully.
> > b) do the following enhancement:
> >
> > Provide a system whereby you can swap to the filesystem without
> > additional upcalls/syscalls from userspace, basically, provide
> > some means of paging to the filesystem automatically.
> >
> > then, set your lowwater mark to the size of your swap partition,
> > now your system will alert your processes and automatically swap
> > _anyone_ to the filesystem.
> >
> > I really think that this would be more flexible and still allow
> > you to achieve what you want... What do you think?
>
> I can't say anything until I'll got detail. Sorry, English is neither
> my native nor used often, so I may easely miss important details, but
> here is my random comments:
>
> Initally, I was trying the same (I think) approach, but there was
> some problems. Some kernel function refused to work with VM objects
> of processes differing from curproc. I.e., it could be hard to work
> with bigproc inside swap daemon; and swap daemon is the only place
> where we can detect OOM condition; that's why I used signal to transfer
> control to user space, and then back into kernel -- already in another
> process. Another reason to do it -- to make all limits and quota work
> automatically. Also, I did not wanted to make swap daemon busy too long.

Well you can simply grab curproc to do this and steal the context,
most likely you'll be in the context of a process that has faulted,
the only trick you need to do is to write the file instead of directly
to the device. The idea is to cause the next fault by any program
to give you a 'curproc' to do filesystem operations.

You could also wakeup() the swapper and set a flag to tell it to
allocate some filesystem space, the problem (as you've stated) is
that you can tie swapper up too long doing this.

> Also, what means "over allocate swap a bit"? How to compute the value
> of that bit? At what moment should we preallocate? Should we repeat
> preallocation after getting SIGMEMINFO (himark)?

You're still thinking of the combined solution, just think of a
system where all you have right now is the signals I mentioned.

Now remeber, your solution depends on spare space in the filesystem...

Your spare space is most likely not known.

Instead of depending on possibly non-existant spare space, just
make your swap a little bigger and set your low watermark a bit
lower. Now you have more time to do something when memory is low.

> Also, you cannot set low mark to size of swap partition. To create
> file-based swap you need some memory (file operations requires it).
> So, low mark should be a bit lower (that's why I raised value of
> nswap_lowat).

Yes, you're right, I didn't consider this. You can start swapping
to the filesystem at an earlier point then.

> Finally, if you want to over allocate swap for every process in
> system, the whole swap can wind up consisting of only preallocations.
> Resource management is the role of kernel. Any hard reservation
> interfere with that.

I'm not argueing for pre-allocation, I'm just saying that at a
certain point you're going to run out of swap. If that swap is in
your filesystem you can still run out of filesystem space (if you
even have any at that point). That's why you over allocate and
set your low watermark appropriately.

So instead of having 1gig of swap, and depending on having N blocks
free in the filesystem, you allocate 1gig+N blocks of swap and
watch for the signal to start freeing resources.

Just think what happens if your filesystems are full and you run
out of swap...

--
-Alfred Perlstein [alf...@freebsd.org]
'Instead of asking why a piece of software is using "1970s technology,"
start asking why software is ignoring 30 years of accumulated wisdom.'

To Unsubscribe: send mail to majo...@FreeBSD.org

Rik van Riel

unread,

Sep 30, 2001, 5:12:56 AM9/30/01

to Alfred Perlstein, Vladimir Dozen, Matt Dillon, Wilko Bulte, hac...@freebsd.org

On Sat, 29 Sep 2001, Alfred Perlstein wrote:
> * Vladimir Dozen <vladimi...@mail.ru> [010929 14:38] wrote:
>
> > P.S. Anyway, I do NOT insist my solution is better, and even that it
> > is good for anything at all. It was fun for me to hack in BSD kernel,
> > and it was interesting challenge, and I feel need to share results
> > with others. At worst, I will recommend our customer to setup
> > processing farm under FreeBSD with applied patch.
>
> I'm really impressed with the work you put into this, but it seems
> that you've tried to tackle two problems at the same time,

Indeed, the whole idea of swapping tasks to the filesystem
in nice, but having the task do this all by itself isn't a
good option for many people...

> My suggestion, (but not my final say, i'm still open to ideas):
>
> Implement a memory status signal to notify processes of changes
> in the relative amount of system memory.
>
> When memory reaches a low or high watermark, the signal is
> broadcast to all running processes.
>
> The default disposition will be to ignore the signal.
>
> The signal will be named SIGMEMINFO. (SIGXfoo means
> 'process has exceeded resource foo')

That'd be SIGDANGER, right ?

> b) do the following enhancement:
>
> Provide a system whereby you can swap to the filesystem without
> additional upcalls/syscalls from userspace, basically, provide
> some means of paging to the filesystem automatically.

Sounds like a winner, when swap runs out a process gets
suspended onto the filesystem automatically and SIGDANGER
is sent out to give others a chance to clean themselves
up.

If enough space is freed, the suspended process can get
back into the system.

This should also preserve leaky applications while at the
same time leaving the system intact...

regards,

Rik
--
IA64: a worthy successor to i860.

http://www.surriel.com/ http://distro.conectiva.com/

Send all your spam to aard...@nl.linux.org (spam digging piggy)

To Unsubscribe: send mail to majo...@FreeBSD.org

Vladimir Dozen

unread,

Sep 30, 2001, 5:16:06 AM9/30/01

to Matt Dillon, Vladimir Dozen, Wilko Bulte, Alfred Perlstein, hac...@freebsd.org

ehlo.

> : Second, application not always grows to 1G, most of the time it keeps
> : as small as 500M ;). Why should we precommit 1G for 500M data? Doing
> : multi-mmap memory management is additional pain.
>
> Why not? Disk space is cheap.

Developer time is expensive. Someone already wrote good allocation
routines, and they are inside libc. Reinventing bycicle in every
new large-scale application doesn't sounds good for me.

> For a problem like this I would simply
> throw in two 30G+ hard drives and partition them with 16G of swap each,
> giving me 32G of swap for the machine.

As it was said here before, there are actually two problems: notification
(avoiding silently kills) and getting more paging space. The second can
be solved by adding swap space. The first -- cannot. As developer, I'm
more interested in first. Current solution with killproc() is not
acceptable.

Just imagine any OS documentation which say: "the OS may
terminate process at any point with no warning or notification". Would
you like to use it? But this is exactly what FreeBSD does at OOM.

> Even using file-backed memory is fairly trivial. You don't need to
> do multi-mmap memory management or do any kernel tweaking. Just
> reserve 1G and use a single mmap() and file per process.

As I already said, it is not trivial. It involves writing/adopting
some allocation stuff. It means time & human resources -> money.

--
dozen @ home

Alfred Perlstein

unread,

Sep 30, 2001, 5:23:09 AM9/30/01

to Rik van Riel, Vladimir Dozen, Matt Dillon, Wilko Bulte, hac...@freebsd.org

* Rik van Riel <ri...@conectiva.com.br> [010930 04:12] wrote:
> On Sat, 29 Sep 2001, Alfred Perlstein wrote:
> > * Vladimir Dozen <vladimi...@mail.ru> [010929 14:38] wrote:
> >
> > > P.S. Anyway, I do NOT insist my solution is better, and even that it
> > > is good for anything at all. It was fun for me to hack in BSD kernel,
> > > and it was interesting challenge, and I feel need to share results
> > > with others. At worst, I will recommend our customer to setup
> > > processing farm under FreeBSD with applied patch.
> >
> > I'm really impressed with the work you put into this, but it seems
> > that you've tried to tackle two problems at the same time,
>
> Indeed, the whole idea of swapping tasks to the filesystem
> in nice, but having the task do this all by itself isn't a
> good option for many people...
>
> > My suggestion, (but not my final say, i'm still open to ideas):
> >
> > Implement a memory status signal to notify processes of changes
> > in the relative amount of system memory.
> >
> > When memory reaches a low or high watermark, the signal is
> > broadcast to all running processes.
> >
> > The default disposition will be to ignore the signal.
> >
> > The signal will be named SIGMEMINFO. (SIGXfoo means
> > 'process has exceeded resource foo')
>
> That'd be SIGDANGER, right ?

Sort of.

>
> > b) do the following enhancement:
> >
> > Provide a system whereby you can swap to the filesystem without
> > additional upcalls/syscalls from userspace, basically, provide
> > some means of paging to the filesystem automatically.
>
> Sounds like a winner, when swap runs out a process gets
> suspended onto the filesystem automatically and SIGDANGER
> is sent out to give others a chance to clean themselves
> up.

Well, no, the idea is to have a low and high watermark so that
flip-flopping on the boundry doesn't generate a lot of signals.

SIGDANGER is ok for a name, but slightly misleading because
I wanted to piggyback some info in the siginfo to tell processes
when the danger has passed. Well ok, the name is ok, but
I do want an upcall when the situation is alleviated.

Let me also state that it may be wise to add huristics to the
system to not SIGDANGER anything that is completely swapped
out or hasn't run in a long time, this would avoid a spike
in thrashing at the time of the broadcast.

> If enough space is freed, the suspended process can get
> back into the system.
>
> This should also preserve leaky applications while at the
> same time leaving the system intact...

Hopefully, also having a SIGDANGER handler may be an indication
to the kernel to give you a second chance before shooting at
you, I know it could be used to subvert behavior to have another
niave program killed, however that could be a tunable to give
those trying to do the right thing a second chance.

--
-Alfred Perlstein [alf...@freebsd.org]
'Instead of asking why a piece of software is using "1970s technology,"
start asking why software is ignoring 30 years of accumulated wisdom.'

To Unsubscribe: send mail to majo...@FreeBSD.org

Vladimir Dozen

unread,

Sep 30, 2001, 5:41:55 AM9/30/01

to Alfred Perlstein, Vladimir Dozen, Matt Dillon, Wilko Bulte, hac...@freebsd.org

ehlo.

> You're still thinking of the combined solution, just think of a
> system where all you have right now is the signals I mentioned.

Yah, now I think I got it. Well, actually, signal(s) is all
I need. The remapping was just a bonus. To be more precise,
I need the only signal -- at low mark passed. Some other
application might be interested in second -- hi mark --
signal, but my doesn't.

SIGDANGER is the signal from Irix, AFAIR?

So, how about to accept this name (just to not increase entropy
of the Universe) and send it to all processes when nswap_lowat
reached?

The only point -- I prefer to have ability to set nswap_lowat
via sysctl since I cannot predict what amount of memory can
be consumed while freeing memory ;) (e.g., throwing exception
in C++ may eat memory due to creating exception object; logging
may eat memory also).

> Just think what happens if your filesystems are full and you run
> out of swap...

The same that happens today -- killproc() will kill me.
The situation doesn't becomes worse with remapping, it just
... mmm... prolonges.

--
dozen @ home

Alfred Perlstein

unread,

Sep 30, 2001, 5:44:32 AM9/30/01

to Vladimir Dozen, Matt Dillon, Wilko Bulte, hac...@freebsd.org

* Vladimir Dozen <vladimi...@mail.ru> [010930 04:41] wrote:
> ehlo.
>
> > You're still thinking of the combined solution, just think of a
> > system where all you have right now is the signals I mentioned.
>
> Yah, now I think I got it. Well, actually, signal(s) is all
> I need. The remapping was just a bonus. To be more precise,
> I need the only signal -- at low mark passed. Some other
> application might be interested in second -- hi mark --
> signal, but my doesn't.
>
> SIGDANGER is the signal from Irix, AFAIR?
>
> So, how about to accept this name (just to not increase entropy
> of the Universe) and send it to all processes when nswap_lowat
> reached?
>
> The only point -- I prefer to have ability to set nswap_lowat
> via sysctl since I cannot predict what amount of memory can
> be consumed while freeing memory ;) (e.g., throwing exception
> in C++ may eat memory due to creating exception object; logging
> may eat memory also).

You want to submit a patch? If not I can take a look at it,
but it's been a bit since I've looked at the vm system.

>
> > Just think what happens if your filesystems are full and you run
> > out of swap...
>
> The same that happens today -- killproc() will kill me.
> The situation doesn't becomes worse with remapping, it just
> ... mmm... prolonges.
>
> --
> dozen @ home

--

-Alfred Perlstein [alf...@freebsd.org]
'Instead of asking why a piece of software is using "1970s technology,"
start asking why software is ignoring 30 years of accumulated wisdom.'

To Unsubscribe: send mail to majo...@FreeBSD.org

Vladimir Dozen

unread,

Sep 30, 2001, 6:58:00 AM9/30/01

to Alfred Perlstein, Vladimir Dozen, Matt Dillon, Wilko Bulte, hac...@freebsd.org

ehlo.

> You want to submit a patch? If not I can take a look at it,
> but it's been a bit since I've looked at the vm system.

except for sysctl, the patch is quite simple due to the fact
that histeresis is already implemented in swap_pager.c, something
like:

============================================
diff vm/swap_pager.c vm.new/swap_pager.c
217a218,219
> struct proc* p;
>
218a221,225
> /* warn all processes */
> for( p = allproc.lh_first; p != 0; p = p->p_list.le_next )
> {
> psignal(p,SIGDANGER);
> }
============================================

============================================
diff sys/signal.h sys.new/signal.h
105a106,109
> #ifndef _POSIX_SOURCE
> #define SIGDANGER 32 /* close to out-of-memory */
> #endif
>
============================================

============================================
diff kern/kern_sig.c kern.new/kern_sig.c
165a166
> SA_IGNORE /* SIGDANGER */
============================================

--
dozen @ home

Vladimir Dozen

unread,

Sep 30, 2001, 7:16:33 AM9/30/01

to Vladimir Dozen, Alfred Perlstein, Matt Dillon, Wilko Bulte, hac...@freebsd.org

ehlo.

> ============================================
> diff vm/swap_pager.c vm.new/swap_pager.c
> 217a218,219
> > struct proc* p;
> >
> 218a221,225
> > /* warn all processes */
> > for( p = allproc.lh_first; p != 0; p = p->p_list.le_next )
> > {
> > psignal(p,SIGDANGER);
> > }
> ============================================

Oops, it doesn't work. All processes died. Why?
Something should be changed in libc?

Alfred Perlstein

unread,

Sep 30, 2001, 7:19:10 AM9/30/01

to Vladimir Dozen, Matt Dillon, Wilko Bulte, hac...@freebsd.org

* Vladimir Dozen <vladimi...@mail.ru> [010930 06:16] wrote:
> ehlo.
>
> > ============================================
> > diff vm/swap_pager.c vm.new/swap_pager.c
> > 217a218,219
> > > struct proc* p;
> > >
> > 218a221,225
> > > /* warn all processes */
> > > for( p = allproc.lh_first; p != 0; p = p->p_list.le_next )
> > > {
> > > psignal(p,SIGDANGER);
> > > }
> > ============================================
>
> Oops, it doesn't work. All processes died. Why?
> Something should be changed in libc?

I'll take a look at implementing it sometime this week.

I want to do the siginfo thing if possible.

--
-Alfred Perlstein [alf...@freebsd.org]
'Instead of asking why a piece of software is using "1970s technology,"
start asking why software is ignoring 30 years of accumulated wisdom.'

To Unsubscribe: send mail to majo...@FreeBSD.org

Jos Backus

unread,

Sep 30, 2001, 1:55:23 PM9/30/01

to hac...@freebsd.org

On Sun, Sep 30, 2001 at 01:44:37PM +0000, Vladimir Dozen wrote:
> SIGDANGER is the signal from Irix, AFAIR?

AIX has SIGDANGER.

--
Jos Backus _/ _/_/_/ Santa Clara, CA
_/ _/ _/
_/ _/_/_/
_/ _/ _/ _/
jo...@cncdsl.com _/_/ _/_/_/ use Std::Disclaimer;

Alfred Perlstein

unread,

Sep 30, 2001, 3:23:42 PM9/30/01

to Jos Backus, hac...@freebsd.org

* Jos Backus <jo...@cncdsl.com> [010930 12:55] wrote:
> On Sun, Sep 30, 2001 at 01:44:37PM +0000, Vladimir Dozen wrote:
> > SIGDANGER is the signal from Irix, AFAIR?
>
> AIX has SIGDANGER.

Anyone care to tell me how it works in AIX? If the interface is
nice, cloning it would be kind of cool.

--
-Alfred Perlstein [alf...@freebsd.org]
'Instead of asking why a piece of software is using "1970s technology,"
start asking why software is ignoring 30 years of accumulated wisdom.'

To Unsubscribe: send mail to majo...@FreeBSD.org

Jos Backus

unread,

Sep 30, 2001, 3:35:10 PM9/30/01

to hac...@freebsd.org

On Sun, Sep 30, 2001 at 02:23:26PM -0500, Alfred Perlstein wrote:
> * Jos Backus <jo...@cncdsl.com> [010930 12:55] wrote:
> > AIX has SIGDANGER.
>
> Anyone care to tell me how it works in AIX? If the interface is
> nice, cloning it would be kind of cool.

I don't currently have access to an AIX system, but

http://as400bks.rochester.ibm.com/doc_link/en_US/a_doc_lib/aixbman/admnconc/pag_space_under.htm

has some (useful) info.

--
Jos Backus _/ _/_/_/ Santa Clara, CA
_/ _/ _/
_/ _/_/_/
_/ _/ _/ _/
jo...@cncdsl.com _/_/ _/_/_/ use Std::Disclaimer;

To Unsubscribe: send mail to majo...@FreeBSD.org

Alfred Perlstein

unread,

Sep 30, 2001, 3:56:24 PM9/30/01

to Jos Backus, hac...@freebsd.org

* Jos Backus <jo...@cncdsl.com> [010930 14:35] wrote:
> On Sun, Sep 30, 2001 at 02:23:26PM -0500, Alfred Perlstein wrote:
> > * Jos Backus <jo...@cncdsl.com> [010930 12:55] wrote:
> > > AIX has SIGDANGER.
> >
> > Anyone care to tell me how it works in AIX? If the interface is
> > nice, cloning it would be kind of cool.
>
> I don't currently have access to an AIX system, but
>
> http://as400bks.rochester.ibm.com/doc_link/en_US/a_doc_lib/aixbman/admnconc/pag_space_under.htm
>
> has some (useful) info.

It sure does!

I think I'm going to make a proposal on -arch about this, to be perfectly
honest, AIX has a good implementation, I haven't read it all yet, but
it doesn't look like it gives the applications a notification when
the danger is gone, we'll have to figure that out, or I'll have to
read more into this.

--
-Alfred Perlstein [alf...@freebsd.org]
'Instead of asking why a piece of software is using "1970s technology,"
start asking why software is ignoring 30 years of accumulated wisdom.'

To Unsubscribe: send mail to majo...@FreeBSD.org

Matt Dillon

unread,

Sep 30, 2001, 3:59:56 PM9/30/01

to Poul-Henning Kamp, Vladimir Dozen, Wilko Bulte, Alfred Perlstein, hac...@freebsd.org

:

:In message <200109300752...@earth.backplane.com>, Matt Dillon writes:
:>: Second, application not always grows to 1G, most of the time it keeps
:>: as small as 500M ;). Why should we precommit 1G for 500M data? Doing
:>: multi-mmap memory management is additional pain.
:>
:> Even using file-backed memory is fairly trivial. You don't need to
:> do multi-mmap memory management or do any kernel tweaking. Just
:> reserve 1G and use a single mmap() and file per process.
:
:I once had a patch to phkmalloc() which backed all malloc'ed VM with
:hidden files in the users homedir. It was written to put the VM
:usage under QUOTA control, but it had many useful side effects as well.
:
:I can't seem to find it right now, but it is trivial to do: just
:replace the sbrk(2) with mmap(). Only downside is the needed
:filedescriptor which some shells don't like.
:
:--
:Poul-Henning Kamp | UNIX since Zilog Zeus 3.20
:p...@FreeBSD.ORG | TCP/IP since RFC 956

I think the file descriptor problem can be solved easily... simply
open the file, mmap() the entire 1G segment for this special application,
and then close() the file. Then have sbrk() just eats out of the mapped
segment. Alternatively sbrk() could open/mmap/close in large 1MB or 4MB
segments, again leaving no file descriptors dangling.

-Matt

Alfred Perlstein

unread,

Sep 30, 2001, 4:06:47 PM9/30/01

to Matt Dillon, Poul-Henning Kamp, Vladimir Dozen, Wilko Bulte, hac...@freebsd.org

Won't that cause fragmentation? You're forgettng the need to
ftruncate or pre-zero the file unless that's been fixed.

--
-Alfred Perlstein [alf...@freebsd.org]
'Instead of asking why a piece of software is using "1970s technology,"
start asking why software is ignoring 30 years of accumulated wisdom.'

To Unsubscribe: send mail to majo...@FreeBSD.org

Matt Dillon

unread,

Sep 30, 2001, 4:22:37 PM9/30/01

to Alfred Perlstein, Poul-Henning Kamp, Vladimir Dozen, Wilko Bulte, hac...@freebsd.org

:> :Poul-Henning Kamp | UNIX since Zilog Zeus 3.20

:> :p...@FreeBSD.ORG | TCP/IP since RFC 956
:>
:> I think the file descriptor problem can be solved easily... simply
:> open the file, mmap() the entire 1G segment for this special application,
:> and then close() the file. Then have sbrk() just eats out of the mapped
:> segment. Alternatively sbrk() could open/mmap/close in large 1MB or 4MB
:> segments, again leaving no file descriptors dangling.
:
:Won't that cause fragmentation? You're forgettng the need to
:ftruncate or pre-zero the file unless that's been fixed.
:
:--
:-Alfred Perlstein [alf...@freebsd.org]

You have to pre-zero the file. You can do it in reasonably-sized
chunks (like 4M) without causing fragmentation. You *CANNOT* use
ftruncate() to extend the file - that will virtually guarentee massive
fragmentation.

-Matt

Warner Losh

unread,

Sep 30, 2001, 11:32:20 PM9/30/01

to Vladimir Dozen, hac...@freebsd.org

In message <200109301...@eix.do-labs.spb.ru> Vladimir Dozen writes:
: SIGDANGER is the signal from Irix, AFAIR?

AIX.

Warner

Greg Lehey

unread,

Sep 30, 2001, 11:49:30 PM9/30/01

to Alfred Perlstein, Jos Backus, hac...@freebsd.org

On Sunday, 30 September 2001 at 14:55:58 -0500, Alfred Perlstein wrote:
> * Jos Backus <jo...@cncdsl.com> [010930 14:35] wrote:
>> On Sun, Sep 30, 2001 at 02:23:26PM -0500, Alfred Perlstein wrote:
>>> * Jos Backus <jo...@cncdsl.com> [010930 12:55] wrote:
>>>> AIX has SIGDANGER.
>>>
>>> Anyone care to tell me how it works in AIX? If the interface is
>>> nice, cloning it would be kind of cool.
>>
>> I don't currently have access to an AIX system, but
>>
>> http://as400bks.rochester.ibm.com/doc_link/en_US/a_doc_lib/aixbman/admnconc/pag_space_under.htm
>>
>> has some (useful) info.
>
> It sure does!
>
> I think I'm going to make a proposal on -arch about this, to be
> perfectly honest, AIX has a good implementation, I haven't read it
> all yet, but it doesn't look like it gives the applications a
> notification when the danger is gone, we'll have to figure that out,
> or I'll have to read more into this.

If it's any help, I have an AIX box here. It belongs to IBM, so I
have to respect security issues, but I'll do what I can.

Greg
--
See complete headers for address and phone numbers

Alfred Perlstein

unread,

Oct 1, 2001, 12:41:34 AM10/1/01

to Greg Lehey, Jos Backus, hac...@freebsd.org

* Greg Lehey <gr...@FreeBSD.org> [010930 22:49] wrote:
> On Sunday, 30 September 2001 at 14:55:58 -0500, Alfred Perlstein wrote:
> > * Jos Backus <jo...@cncdsl.com> [010930 14:35] wrote:
> >> On Sun, Sep 30, 2001 at 02:23:26PM -0500, Alfred Perlstein wrote:
> >>> * Jos Backus <jo...@cncdsl.com> [010930 12:55] wrote:
> >>>> AIX has SIGDANGER.
> >>>
> >>> Anyone care to tell me how it works in AIX? If the interface is
> >>> nice, cloning it would be kind of cool.
> >>
> >> I don't currently have access to an AIX system, but
> >>
> >> http://as400bks.rochester.ibm.com/doc_link/en_US/a_doc_lib/aixbman/admnconc/pag_space_under.htm
> >>
> >> has some (useful) info.
> >
> > It sure does!
> >
> > I think I'm going to make a proposal on -arch about this, to be
> > perfectly honest, AIX has a good implementation, I haven't read it
> > all yet, but it doesn't look like it gives the applications a
> > notification when the danger is gone, we'll have to figure that out,
> > or I'll have to read more into this.
>
> If it's any help, I have an AIX box here. It belongs to IBM, so I
> have to respect security issues, but I'll do what I can.

Well Joe seems to have provided a pretty interesting document on
how it works in AIX, but I was wondering if they do anything wrt
low/high watermarks like my idea.

Basically you'd like to inform processes that the danger has been
alliviated so that they can cautiously start accepting more work
rather than freaking out and shutting out clients forever...

This might lead to a situation where SIGDANGER starts getting
sent informing that things are looking bleak, then processes
start freeing resources, they get the second SIGDANGER to let
them know that things are looking ok so they ramp up again and
the cycle repeats, I guess that's not optimal, but I'd like FreeBSD
to let processes know that things are looking better so they can
go from "scrooge mode" to "thrifty mode".

--
-Alfred Perlstein [alf...@freebsd.org]
'Instead of asking why a piece of software is using "1970s technology,"
start asking why software is ignoring 30 years of accumulated wisdom.'

To Unsubscribe: send mail to majo...@FreeBSD.org

Jos Backus

unread,

Oct 1, 2001, 1:54:27 AM10/1/01

to hac...@freebsd.org

On Sun, Sep 30, 2001 at 11:41:14PM -0500, Alfred Perlstein wrote:
> > If it's any help, I have an AIX box here. It belongs to IBM, so I
> > have to respect security issues, but I'll do what I can.

I seem to remember that one could set a watermark using the no command, but I
could be wrong. No AIX to verify this, maybe Greg can. The link below has some
info, too:

http://nscp.upenn.edu/aix4.3html/aixbman/prftungd/tunableaixparms.htm

--
JoS Backus _/ _/_/_/ Santa Clara, CA

_/ _/ _/
_/ _/_/_/
_/ _/ _/ _/
jo...@cncdsl.com _/_/ _/_/_/ use Std::Disclaimer;

To Unsubscribe: send mail to majo...@FreeBSD.org

Terry Lambert

unread,

Oct 1, 2001, 1:56:47 AM10/1/01

to Alfred Perlstein, Greg Lehey, Jos Backus, hac...@freebsd.org

Alfred Perlstein wrote:
[ ... SIGDANGER ... ]

> Well Joe seems to have provided a pretty interesting document on
> how it works in AIX, but I was wondering if they do anything wrt
> low/high watermarks like my idea.
>
> Basically you'd like to inform processes that the danger has been
> alliviated so that they can cautiously start accepting more work
> rather than freaking out and shutting out clients forever...

The process is supposed to return unused memory to the system
when it gets the signal, if it can.

It's not supposed to shed all load until it gets the "all clear"
signal.

I don't know if there are any good books on Windows Internals,
but the Windows VM system does the same thing: it notifies all
kernel subsystems that they need to free up memory, if they can.
The VFAT32 IFS will basically return exactly one page out of
many thousands it is using for cache, when it gets the request
(it is implemented as a callback, which you must provide when
you register for VM services).

> This might lead to a situation where SIGDANGER starts getting
> sent informing that things are looking bleak, then processes
> start freeing resources, they get the second SIGDANGER to let
> them know that things are looking ok so they ramp up again and
> the cycle repeats, I guess that's not optimal, but I'd like FreeBSD
> to let processes know that things are looking better so they can
> go from "scrooge mode" to "thrifty mode".

The idea is just to free resources, if you can, and to mark the
processes which are "precious" by whether or not they have a
signal handler. A close reading of the other document posted
(it seemed to be the admin manual from the URL) will indicate
that the followon SIGKILL is not sent to the processes that have
a SIGDANGER handler registered. Note that this does not mean
that your process won't be killed off as a result of a page not
present fault, so abusing the interface is not really tolerated
very well by the system.

I think signalling an "all clear" is really a bad idea; a soft
hysteresis loop is much less prone to pendulum swings than a
hard hysteresis loop (lesson #1 in the book "Fuzzy Logic").

-- Terry

Vladimir Dozen

unread,

Oct 1, 2001, 4:37:57 AM10/1/01

to Alfred Perlstein, Greg Lehey, Jos Backus, hac...@freebsd.org

ehlo.

> Well Joe seems to have provided a pretty interesting document on
> how it works in AIX, but I was wondering if they do anything wrt
> low/high watermarks like my idea.
>
> Basically you'd like to inform processes that the danger has been
> alliviated so that they can cautiously start accepting more work
> rather than freaking out and shutting out clients forever...

Actually, most of applications believe that everything OK except
something tells them it's not. Regular OOM protection may be build
as:

int on_sigdanger(int)
{
throw std::runtime_error("out of memory");
}
...

while( there_are_more_requests )
{
try
{
do_some_work_eating_lot_of_memory();
}
catch(const std::exception& ex)
{
cerr << ex.what() << endl;
}
}

I.e, we will attempt to execute user requests while we have them
in our queue, but we will get exceptions and stop processing if
system is out of memory. As soon as system will get enough free space
we will continue normal processing without any special handling from
our side.

It means that signal that opposite SIGDANGER is rarely required, if required
at all. You should be glad, it reduces work to do. ;)

P.S. I know that throwing inside signal handler is bad techique, but it works
(and works better than setting flag and testing it everywhere).

dozen

Karsten W. Rohrbach

unread,

Oct 1, 2001, 12:13:51 PM10/1/01

to Greg Lehey, Alfred Perlstein, Jos Backus, hac...@freebsd.org

i got a (way old) ppc 604e, in the corner of my office.
it's a 74p, latest 4.3.3 patchlevel from one month ago or so installed.

i could arrange ssh access to the box if somebody cares, although i am
not available 24x7 for remote hands ;-)
there's nothing critical on it, the box got 128mb ram, so contact me
off-list if you want to play around with it.

/k

Greg Lehey(gr...@FreeBSD.org)@2001.10.01 13:19:51 +0000:

--
> Gravity is an unforgiving motherfucker.
KR433/KR11-RIPE -- WebMonster Community Founder -- nGENn GmbH Senior Techie
http://www.webmonster.de/ -- ftp://ftp.webmonster.de/ -- http://www.ngenn.net/
karsten&rohrbach.de -- alpha&ngenn.net -- alpha&scene.org -- ca...@spam.de
GnuPG 0x2964BF46 2001-03-15 42F9 9FFF 50D4 2F38 DBEE DF22 3340 4F4E 2964 BF46
Please do not remove my address from To: and Cc: fields in mailing lists. 10x

Vladimir Dozen

unread,

Oct 3, 2001, 3:32:29 PM10/3/01

to Poul-Henning Kamp, Matt Dillon, Vladimir Dozen, Wilko Bulte, Alfred Perlstein, hac...@freebsd.org

ehlo.

> I once had a patch to phkmalloc() which backed all malloc'ed VM with
> hidden files in the users homedir. It was written to put the VM
> usage under QUOTA control, but it had many useful side effects as well.
>
> I can't seem to find it right now, but it is trivial to do: just
> replace the sbrk(2) with mmap(). Only downside is the needed
> filedescriptor which some shells don't like.

One small point -- machanical replace leads to segmentation faults
since brk(tail) expected always to allocate new block ending with tail;
while mmap can refuse to do it.

Actually, I repeated your work, and found that mmap() refused to map
block at 128M border; instead, it moved it somewhat higher. At the
same time, routines in libc/stdlib/malloc.c expected exactly the
same address they requested. I've patched them to get map address
from map_pages().

I've added new malloc configuration flag: 'F' (turn on file swapping) and
'f' (turn off). Then I've replaced brk/sbrk in code with mmap-based
emulations. It works. Currently whole my home host running with
'F' in /etc/malloc.conf.

I've tested it with famous 'life' game, and it showed that performance
with pure mmap() (not file swapping) increased a bit (about 2%) comparing
to original sbrk() implementation, and file swapping about 5% slower
than sbrk(). It depends on hardware, of course.

My implementation uses single file description, but dupes it to
512 (or less) to avoid problems with shells mentioned here. Mapped file
increased as neccessary and additional mmap()s called on it.

Here is patch for 4.3-RELEASE-p20:

/usr/src/libc/stdlib/malloc.c:
=============================================
100c100
<
---
>
248,250d247
< /* my last break. */
< static void *malloc_brk;
<
264a262
>
299a298,442
> * file swap options
> */
> static int malloc_file_swap;
> static char* malloc_file_swap_dir;
> static int malloc_file_swap_num;
> static int malloc_file_swap_fd;
> static int malloc_file_swap_offset;
> static int malloc_file_swap_size;
>
> /*
> * mmap-based brk/sbrk emulation
> */
> static char *malloc_brk;
> static char* sbrk_emulation(int incr)
> {
> if( incr == 0 ) return malloc_brk;
> wrterror("unsupported sbrk argument");
> };
>
> /**
> * brk emulation
> *
> * note that return value is different from brk!
> * @result 0 allocation failed, ptr -- start of new block
> * @param new_brk desired location of new top of heap
> *
> */
> static char* brk_emulation(char* new_brk)
> {
> char* p;
> char buf[4096];
> int filegrow,wr,blocksize;
> int stage;
> int tmp_fd;
>
> /* size of requested block */
> blocksize = new_brk-malloc_brk;
>
> /* increase heap size */
> if( blocksize > 0 )
> {
> if( malloc_file_swap )
> {
> /* create file at first call */
> if( malloc_file_swap_num == 0 )
> {
> /* where to put swap file */
> if( !malloc_file_swap_dir ) malloc_file_swap_dir = getenv("SWAPDIR");
> if( !malloc_file_swap_dir ) malloc_file_swap_dir = getenv("TMPDIR");
> if( !malloc_file_swap_dir ) malloc_file_swap_dir = "/tmp";
>
> /* generate random file name and open it */
> do
> {
> snprintf(buf,sizeof(buf),"%s/%08x.swap",
> malloc_file_swap_dir,malloc_file_swap_num);
> malloc_file_swap_num *= 11;
> malloc_file_swap_num += 13;
> malloc_file_swap_fd = open(buf,O_CREAT|O_EXCL|O_RDWR|O_NOFOLLOW,0600);
> }
> while( malloc_file_swap_fd < 0 && errno == EEXIST );
> if( malloc_file_swap_fd < 0 ) return 0;
>
> /*
> * some shell scripts (GNU configure?) can be
> * unhappy if we use descriptor 4 or 5; dup descriptor
> * into large enough descriptor and close original
> */
> tmp_fd = 512;
> while( tmp_fd >= 0 && dup2(malloc_file_swap_fd,tmp_fd) < 0 ) tmp_fd--;
> if( tmp_fd < 0 ) return 0;
> close(malloc_file_swap_fd);
> malloc_file_swap_fd = tmp_fd;
>
> /* unlink file to autoremove it at last reference lost */
> unlink(buf);
> }
>
> if( malloc_file_swap_offset+blocksize > malloc_file_swap_size )
> {
> /* fill tail of file with zeroes */
> memset(buf,0,sizeof(buf));
>
> /*
> * grow file
> * critical grow: if any error happens here, allocation fails
> * supplemental grow: errors are ignored
> */
> for( stage=0; stage<2; stage++ )
> {
> if( stage == 0 ) filegrow = blocksize;
> else filegrow = 1024*1024;
>
> while( filegrow > 0 )
> {
> /* note that file position is always at end of file */
> wr = write(malloc_file_swap_fd,
> buf,sizeof(buf)<filegrow?sizeof(buf):filegrow);
> if( wr < 0 )
> {
> if( errno == EINTR ) continue;
> if( stage == 0 ) return 0;
> break;
> }
> filegrow -= wr;
>
> /* keep file size for next time */
> malloc_file_swap_size += wr;
> }
> }
> }
>
> /* map file tail into address space */
> p = mmap(malloc_brk,blocksize,
> PROT_READ|PROT_WRITE,
> MAP_SHARED|MAP_NOSYNC|MAP_INHERIT,
> malloc_file_swap_fd,
> malloc_file_swap_offset);
> if( p == MAP_FAILED ) return 0;
>
> /* shift offset to use it next time in mmap */
> malloc_file_swap_offset += blocksize;
> }
> else
> {
> /* FIXME: we might use file swap if regular swapping failed;
> * but this may only happen when limit reached; should
> * we break limits with mmap()? */
> p = mmap(malloc_brk,new_brk-malloc_brk,
> PROT_READ|PROT_WRITE,
> MAP_ANON|MAP_PRIVATE,MMAP_FD,0);
> if( p == MAP_FAILED ) return 0;
> }
>
> malloc_brk = p+blocksize;
> return p;
> }
> else
> {
> /* here we must unmap memory */
> return 0;
> }
> }
>
> /*
307c450
< result = (caddr_t)pageround((u_long)sbrk(0));
---
> result = (caddr_t)pageround((u_long)sbrk_emulation(0));
310c453,454
< if (brk(tail)) {
---
> result = brk_emulation(tail);
> if( result == 0 ) {
315a460
> tail = result + (pages << malloc_pageshift);
318,321c463
< malloc_brk = tail;
<
< if ((last_index+1) >= malloc_ninfo && !extend_pgdir(last_index))
< return 0;;
---
> if ((last_index+1) >= malloc_ninfo && !extend_pgdir(last_index)) return 0;;
430a573,574
> case 'f': malloc_file_swap = 0; break;
> case 'F': malloc_file_swap = 1; break;
467c611
< malloc_origo = ((u_long)pageround((u_long)sbrk(0))) >> malloc_pageshift;
---
> malloc_origo = ((u_long)pageround((u_long)sbrk_emulation(0))) >> malloc_pageshift;
481c625
< * We can sbrk(2) further back when we keep this on a low address.
---
> * We can sbrk_emulation(2) further back when we keep this on a low address.
516c660
< if ((void*)pf->page >= (void*)sbrk(0))
---
> if ((void*)pf->page >= (void*)sbrk_emulation(0))
547,548d690
< size >>= malloc_pageshift;
<
550,551c692,693
< if (!p)
< p = map_pages(size);
---
> size >>= malloc_pageshift;
> if (!p) p = map_pages(size);
923c1065
< malloc_brk == sbrk(0)) { /* ..and it's OK to do... */
---
> malloc_brk == sbrk_emulation(0)) { /* ..and it's OK to do... */
932,933c1074,1075
< brk(pf->end);
< malloc_brk = pf->end;
---
> /* FIXME: here we must check returned address */
> brk_emulation(pf->end);
=============================================

--
dozen @ home

Vladimir Dozen

unread,

Oct 4, 2001, 12:38:50 PM10/4/01

to hac...@freebsd.org, Poul-Henning Kamp, Matt Dillon, Wilko Bulte, Alfred Perlstein

ehlo.

I was told that diff format I used is unappropriate for most cases,
so I redo it in unified (-u) format.

Purpose: to allow developers of large applications to use system
memory allocation routines for allocating in mmap()ed file
instead of writing own ones. Also, allow to run applications that
may use huge amount of memory (like Gimp) without reconfiguring
swap.

Patch description: the patch implements file-backed memory
allocation for regular malloc() routine. If 'F' flag is set
in malloc options, instead of doing mmap(MAP_ANON), malloc()
maps regions from temporal file. File is growed as neccessary,
and new regions are mapped from the same file.

Details: to avoid using two methods of allocation (brk() and mmap()) in
the same file, regular allocation altered to use mmap(). This
is done by writing emulators (brk_emulator() and sbrk_emulator()).
File allocator uses single descriptor (usually fd==512). File is
created in directory specified by $SWAPDIR, $TMPDIR or "/tmp"
(in this order). $SWAPDIR is introduced since often people use
memory file system for /tmp. Temporal file is unlinked after
creation, so it will be deleted automatically at exit.

Informal testing shows no performance hit comparing with old-style
brk() allocation, and small hit when using file-backed allocation.

Here the patch (made on 4.3-RELEASE-p20)
===============================
--- malloc.c.old Tue Oct 2 12:52:25 2001
+++ malloc.c Thu Oct 4 20:05:52 2001
@@ -97,7 +97,7 @@
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
-
+
/*
* This structure describes a page worth of chunks.
*/
@@ -245,9 +245,6 @@
#define UTRACE(a,b,c)
#endif /* HAS_UTRACE */

-/* my last break. */
-static void *malloc_brk;
-
/* one location cache for free-list holders */
static struct pgfree *px;

@@ -262,6 +259,7 @@
mmap(0, (size), PROT_READ|PROT_WRITE, MAP_ANON|MAP_PRIVATE, \
MMAP_FD, 0);

+
/*
* Necessary function declarations
*/
@@ -297,6 +295,167 @@
}

/*
+ * file swap options
+ */
+static int malloc_file_swap;
+static char* malloc_file_swap_dir;
+static int malloc_file_swap_num;
+static int malloc_file_swap_fd;
+static int malloc_file_swap_offset;
+static int malloc_file_swap_size;
+
+/*
+ * mmap-based brk/sbrk emulation
+ */
+static char *malloc_brk;
+static char* sbrk_emulation(int incr)
+{
+ if( incr == 0 ) return malloc_brk;
+ wrterror("unsupported sbrk argument");
+};
+
+/**
+ * brk emulation
+ *
+ * note that return value is different from brk!
+ * @result 0 allocation failed, ptr -- start of new block
+ * @param new_brk desired location of new top of heap
+ *
+ */
+static char* brk_emulation(char* new_brk)
+{
+ char* p;
+ char buf[4096];
+ int filegrow,wr,blocksize;
+ int stage;
+ int tmp_fd;
+
+ /* size of requested block */
+ blocksize = new_brk-malloc_brk;
+
+ /* increase heap size */
+ if( blocksize > 0 )
+ {
+ if( malloc_file_swap )
+ {
+ /* create file at first call */
+ if( malloc_file_swap_num == 0 )
+ {
+ /* where to put swap file */
+ if( !malloc_file_swap_dir ) malloc_file_swap_dir = getenv("SWAPDIR");
+ if( !malloc_file_swap_dir ) malloc_file_swap_dir = getenv("TMPDIR");
+ if( !malloc_file_swap_dir ) malloc_file_swap_dir = "/tmp";
+
+ /* generate random file name and open it */
+ do
+ {
+ snprintf(buf,sizeof(buf),"%s/%08x.swap",
+ malloc_file_swap_dir,malloc_file_swap_num);
+ malloc_file_swap_num *= 11;
+ malloc_file_swap_num += 13;
+ malloc_file_swap_fd = open(buf,O_CREAT|O_EXCL|O_RDWR|O_NOFOLLOW,0600);
+ }
+ while( malloc_file_swap_fd < 0 && errno == EEXIST );
+ if( malloc_file_swap_fd < 0 ) return 0;
+
+ /*
+ * some shell scripts (GNU configure?) can be
+ * unhappy if we use descriptor 4 or 5; also qmail-send
+ * uses descriptors up to 6 in normal mode.
+ * so we dup descriptor into large enough and close original
+ */
+ tmp_fd = 512;
+ while( tmp_fd >= 0 && dup2(malloc_file_swap_fd,tmp_fd) < 0 ) tmp_fd--;
+ if( tmp_fd < 0 ) return 0;
+ close(malloc_file_swap_fd);
+ malloc_file_swap_fd = tmp_fd;
+
+ /* unlink file to autoremove it at last reference lost */
+ unlink(buf);
+ }
+
+ if( malloc_file_swap_offset+blocksize > malloc_file_swap_size )
+ {
+ /* fill tail of file with zeroes */
+ memset(buf,0,sizeof(buf));
+
+ /*
+ * grow file
+ * critical grow:
+ * allocate requested size; if any error happens here,
+ * whole allocation fails;
+ * supplemental grow:
+ * pre-allocate one more megabyte; errors are ignored
+ */
+ for( stage=0; stage<2; stage++ )
+ {
+ if( stage == 0 ) filegrow = blocksize;
+ else filegrow = 1024*1024;
+
+ while( filegrow > 0 )
+ {
+ /* note that file position is always at end of file */
+ wr = write(malloc_file_swap_fd,
+ buf,sizeof(buf)<filegrow?sizeof(buf):filegrow);
+ if( wr < 0 )
+ {
+ if( errno == EINTR ) continue;
+ if( stage == 0 ) return 0;
+ break;
+ }
+ filegrow -= wr;
+
+ /* keep file size for next time */
+ malloc_file_swap_size += wr;
+ }
+ }
+ }
+
+ /* map file tail into address space */
+ p = mmap(malloc_brk,blocksize,
+ PROT_READ|PROT_WRITE,
+ MAP_SHARED|MAP_NOSYNC|MAP_INHERIT,
+ malloc_file_swap_fd,
+ malloc_file_swap_offset);
+ if( p == MAP_FAILED ) return 0;
+
+ /* shift offset to use it next time in mmap */
+ malloc_file_swap_offset += blocksize;
+ }
+ else
+ {
+ /* FIXME: we might use file swap if regular swapping failed;
+ * but this may only happen when limit reached; can
+ * we break limits with mmap()? */
+ p = mmap(malloc_brk,new_brk-malloc_brk,
+ PROT_READ|PROT_WRITE,
+ MAP_ANON|MAP_PRIVATE,MMAP_FD,0);
+ if( p == MAP_FAILED ) return 0;
+ }
+
+ malloc_brk = p+blocksize;
+ return p;
+ }
+ else
+ {
+ /* here we must unmap memory */
+ if( malloc_file_swap )
+ {
+ /* for file-backed allocation just shift offset back */
+ malloc_file_swap_offset -= blocksize;
+ return malloc_brk;
+ }
+ else
+ {
+ /* i'm not sure if unmap is good idea, but ... */
+ munmap(new_brk,blocksize);
+ malloc_brk = new_brk;
+ return malloc_brk;
+ }
+ }
+}
+
+/*
* Allocate a number of pages from the OS
*/
static void *
@@ -304,21 +463,20 @@
{
caddr_t result, tail;

- result = (caddr_t)pageround((u_long)sbrk(0));
+ result = (caddr_t)pageround((u_long)sbrk_emulation(0));

tail = result + (pages << malloc_pageshift);

- if (brk(tail)) {
+ result = brk_emulation(tail);
+ if( result == 0 ) {
#ifdef EXTRA_SANITY
wrterror("(ES): map_pages fails\n");
#endif /* EXTRA_SANITY */
return 0;
}
+ tail = result + (pages << malloc_pageshift);

last_index = ptr2index(tail) - 1;
- malloc_brk = tail;
-
- if ((last_index+1) >= malloc_ninfo && !extend_pgdir(last_index))
- return 0;;
+ if ((last_index+1) >= malloc_ninfo && !extend_pgdir(last_index)) return 0;;

return result;
}
@@ -428,6 +586,8 @@
case 'X': malloc_xmalloc = 1; break;
case 'z': malloc_zero = 0; break;
case 'Z': malloc_zero = 1; break;
+ case 'f': malloc_file_swap = 0; break;
+ case 'F': malloc_file_swap = 1; break;
default:
j = malloc_abort;
malloc_abort = 0;
@@ -464,7 +624,7 @@
* We need a maximum of malloc_pageshift buckets, steal these from the
* front of the page_directory;
*/
- malloc_origo = ((u_long)pageround((u_long)sbrk(0))) >> malloc_pageshift;
+ malloc_origo = ((u_long)pageround((u_long)sbrk_emulation(0))) >> malloc_pageshift;
malloc_origo -= malloc_pageshift;

malloc_ninfo = malloc_pagesize / sizeof *page_dir;
@@ -478,7 +638,7 @@

/*
* This is a nice hack from Kaleb Keithly (ka...@x.org).
- * We can sbrk(2) further back when we keep this on a low address.
+ * We can sbrk_emulation(2) further back when we keep this on a low address.
*/
px = (struct pgfree *) imalloc (sizeof *px);

@@ -513,7 +673,7 @@
wrterror("(ES): zero entry on free_list\n");
if (pf->page > pf->end)
wrterror("(ES): sick entry on free_list\n");
- if ((void*)pf->page >= (void*)sbrk(0))
+ if ((void*)pf->page >= (void*)sbrk_emulation(0))
wrterror("(ES): entry on free_list past brk\n");
if (page_dir[ptr2index(pf->page)] != MALLOC_FREE)
wrterror("(ES): non-free first page on free-list\n");
@@ -544,11 +704,9 @@
wrterror("(ES): allocated non-free page on free-list\n");
#endif /* EXTRA_SANITY */

- size >>= malloc_pageshift;
-
/* Map new pages */
- if (!p)
- p = map_pages(size);
+ size >>= malloc_pageshift;
+ if (!p) p = map_pages(size);

if (p) {

@@ -920,7 +1078,7 @@
if (!pf->next && /* If we're the last one, */
pf->size > malloc_cache && /* ..and the cache is full, */
pf->end == malloc_brk && /* ..and none behind us, */
- malloc_brk == sbrk(0)) { /* ..and it's OK to do... */
+ malloc_brk == sbrk_emulation(0)) { /* ..and it's OK to do... */

/*
* Keep the cache intact. Notice that the '>' above guarantees that
@@ -929,8 +1087,8 @@
pf->end = (char *)pf->page + malloc_cache;
pf->size = malloc_cache;

- brk(pf->end);
- malloc_brk = pf->end;
+ /* FIXME: here we must check returned address */
+ brk_emulation(pf->end);

index = ptr2index(pf->end);
last_index = index - 1;

Matthew Dillon

unread,

Mar 14, 2002, 2:25:14 AM3/14/02

to Vladimir Dozen, hac...@freebsd.org, Poul-Henning Kamp, Wilko Bulte, Alfred Perlstein

Ok, there are two rather serious problems with this patch:

(1) When you use a MAP_PRIVATE mapping, modifications to the mapped memory
are backed by swap, not by the file. That is what MAP_PRIVATE does
by definition.

(2) You can't safely use MAP_SHARED unless you also deal with fork()
or you will share what is supposed to be process-private memory
across the fork().

-Matt
Matthew Dillon
<dil...@backplane.com>
:ehlo.

:...

Matthew Dillon

unread,

Mar 14, 2002, 2:26:15 AM3/14/02

to Vladimir Dozen, hac...@freebsd.org, Poul-Henning Kamp, Wilko Bulte, Alfred Perlstein

Oops. I'm sorry! This was an email all the from last october! I didn't
mean to resurrect the thread :-)

-Matt
:ehlo.

:
: I was told that diff format I used is unappropriate for most cases,
: so I redo it in unified (-u) format.
:
: Purpose: to allow developers of large applications to use system

:...