I was quite surprised to discover this in a recent conversation that
my program had with the malloc function. The dialog when something
like this:
program: Hey malloc! I'd like a few megabytes please.
malloc: No problem, here you go -- it's at this address.
program: Thanks very much. I'll just go and fill this memory, and --
malloc: Just kidding! I didn't really give you the memory. And by
the way, you're going to die if you continue to use it. Ha!
It turns out that when my program went to fill in the memory, it was sent
a SIGKILL by AIX. This was clearly a bug of some kind, so I reported it.
I sent a demo program that just malloc'd as big a buffer as it could get,
and then started zeroing a byte every 4k. When I started up a couple of
them, they all died. Some got bus errors, others were sent SIGKILL.
IBM's response is that this is working as designed!
IBM said that they don't allocate the paging space until it's needed, in
order to accommodate programs which ask for large amounts of memory that
they never use (some nonsense about sparse arrays). I said that I need
to know if the memory is really there before I start using it. They
referred me to some sample code. If it weren't so sad, it would have
been funny.
The sample code contained a malloc wrapper that returns nonzero only if
the memory is actually there. It sets up a handler for SIGDANGER, then
calls malloc. If it returns nonzero (a virtual certainty) it proceeds
to touch pages of memory. If it gets SIGDANGER, the handler longjmps to
code which "untouches" the memory, frees the successfully-malloc'd-but-
not-really-there buffer and causes the malloc wrapper to return zero.
I have questions:
1) Has anyone seen a system where static memory may not really be
there, or where a nonzero malloc doesn't guarantee the successful
usage of the memory?
2) Has anyone heard of SIGDANGER before?
3) Read the POSIX standard for malloc from a "legal" standpoint.
If IBM claims POSIX compliance, can I use this as a weapon?
4) Even if I use this malloc wrapper everywhere in my own code,
how do I deal with third-party code I purchase that calls the
unwrapped malloc?
You might also note that "malloc(0)" currently returns "0", although it is not
guarenteed to do this in the future. This is what breaks the current version of
g++ (version 2.2.2) if you call new[] with a size of 0.
Cheers,
Tom McConnell
--
Tom McConnell | Internet: tmcc...@sedona.intel.com
Intel, Corp. C3-91 | Phone: (602)-554-8229
5000 W. Chandler Blvd. | The opinions expressed are my own. No one in
Chandler, AZ 85226 | their right mind would claim them.
When you malloc a huge chunk of memory, you
do indeed have the memory mapped into your address space. You just
haven't allocated any pages from swap. If you read this newly
malloc-ed memory (kinda boring since its all zeros) you still don't
allocate any new pages from swap. You have to actually dirty a page
in order for the system to give you one.
About sparse arrays: Some algorithms require a very large address
space, but actually write to a very small portion of it. Such
algorithms
run efficiently under AIX because of this design feature.
|> 2) Has anyone heard of SIGDANGER before?
Only on AIX
|> 3) Read the POSIX standard for malloc from a "legal" standpoint.
|> If IBM claims POSIX compliance, can I use this as a weapon?
I have no idea. Personally, I don't think it's a bad feature. Just
different. I can see your point, but I can also see advantages in
AIX as well. All in all, I think AIX does a nice job with
memory management. For instance, AIX will make some use of all your
RAM, even when you aren't running enough processes to fill it up.
AIX just uses any spare RAM for cacheing disk data. SunOS will leave
RAM lying about totally unused if you have more than you need.
|> 4) Even if I use this malloc wrapper everywhere in my own code,
|> how do I deal with third-party code I purchase that calls the
|> unwrapped malloc?
You can run lsps -a before running any third-party code to see how much
swap is available. I admit this is crude.
I haven't seen the malloc wrapper, but it sounds like it might be
overly complex. There is a function called psdanger() that reports
the amount of free swap space. Instead of writing a signal handler
for SIGDANGER and all that, you could simply see if the system has
enough swap before you malloc. Admittedly, some other process could
start gobbling pages before you could dirty all of yours, and your
process would get killed in that case. However, what would you do
if you ever got a SIGDANGER? Probably exit.
--
Steve Losen s...@virginia.edu
University of Virginia Academic Computing Center
Yeah, on AIX :-)
> 2) Has anyone heard of SIGDANGER before?
You bet. I've had entire PRODUCTION SYSTEMS come crashing down because of
this. Believe it or not, I have had >curses< programs get SIGDANGER when
the machine was heavily loaded. The result is a nice program crash.
Further, I have had processes receive SIGDANGER when only one of two paging
spaces on the system was close to filling. It seems that if >any< page
space on an AIX box gets close to being full you get the SIGDANGER, even if
there are other paging spaces with lots of room left. Yikes! So much for
the performance advantages of spreading the page space across spindles!
> 3) Read the POSIX standard for malloc from a "legal" standpoint.
> If IBM claims POSIX compliance, can I use this as a weapon?
Probably ;-)
> 4) Even if I use this malloc wrapper everywhere in my own code,
> how do I deal with third-party code I purchase that calls the
> unwrapped malloc?
You don't, other than to allow it to die. Oh, you had something >important<
in that program going on, like perhaps a financial transaction? Too bad --
that SIGKILL you just received can't be caught! So much for reliable
software.
This is one of the reasons I hate AIX. There are lots of them, but this is
definately one of the top 5. When malloc() returns non-NULL, you are supposed
to have the space available >period<. Same is true for static arrays -- I
typically will declare these for things I >must< be able to get at and can't
afford a NULL malloc() return for. It is quite a surprise to get SIGDANGER
or SIGKILL when you don't expect it, and have no way to deal with it.
Oh, you mean that large static array I have declared really >can't< be
used, and you won't tell me ahead of time?! Oh, that array is declared in a
library (like internal to Curses)? Now what the hell do I do about it?
This is a >big< problem on heavily-loaded machines.
--
Karl Denninger Inet: kden...@hq.videocart.com
VideOcart Inc. Voice: (312) 987-5022
Yes, Apollo's Aegis version 9.7 behaved this way. In future versions,
I believe they changed it to the more "traditional" method where the backing
store is also allocated a malloc time, but I believe there was an (possibly)
undocumented option to revert to the old behavior to satify those who
wanted to use sparse arrays, etc.
--
Tim Chase Introl Corp. Milwaukee, WI USA
Email: t...@introl.com Phone: +1 (414) 327-7171
From General Concepts and Procedures, GC23-2202-02, page17-2:
Fri Sep 4 21:09:06 MEZ 1992 Copyright (c) 1991 IBM Corporation Page 1
Understanding Paging Space Allocation Policies
The amount of paging space required depends on the type of activities
performed on the system. If paging space runs low, processes may be
lost, and if paging space runs out, the system may panic. When a paging
space low condition is detected, additional paging space should be
defined.
The system monitors the number of free paging space blocks and detects
when a paging space shortage condition exists. When the number of free
paging space blocks falls below a threshold known as the paging space
warning level, the system informs all processes (except kprocs) of this
condition by sending the SIGDANGER signal. If the shortage continues
and falls below a second threshold known as the paging space kill level,
the system sends the SIGKILL signal to processes that are the major
users of paging space and that do not have a signal handler for the
SIGDANGER signal (the default action for the SIGDANGER signal is to
ignore the signal). The system continues sending SIGKILL signals until
the number of free paging space blocks is above the paging space kill
level.
Processes that dynamically allocate memory can ensure that sufficient
paging space exists by monitoring the paging space levels with the
psdanger subroutine or by using special allocation routines (see the
psmalloc.c file for sample code which uses memory allocation routines
that allocate paging space at memory allocation time). Processes can
keep from getting ended when the paging space kill level is reached by
defining a signal handler for the SIGDANGER signal and by releasing
memory and paging space resources allocated in their data and stack
areas and in shared memory segments, using the disclaim subroutine.
--
Best regards, Thomas Braunbeck
AS Software Service AIX, Germany
All the opinions expressed are my own and
do not necessarily reflect those of IBM
| Mail brau...@mazvm02.vnet.ibm.com
| brau...@aixserv.mainz.ibm.de
| DEIBM3M3 at IBMMAIL
| fu...@obelix.ncs.mainz.ibm.com (IBM internal)
| Voice +49-6131-84-2445, FAX +49-6131-84-6585
In article <1992Sep04.1...@watson.ibm.com>, fu...@obelix.ncs.mainz.ibm.com (Thomas Braunbeck/131000) writes:
|> Only processe that do not have a signal handler for the SIGDANGER signal
|> will get the SIGKILL.
|> {...}
|> From General Concepts and Procedures, GC23-2202-02, page17-2:
|> Fri Sep 4 21:09:06 MEZ 1992 Copyright (c) 1991 IBM Corporation Page 1
|>
|> {...} Processes can
|> keep from getting ended when the paging space kill level is reached by
|> defining a signal handler for the SIGDANGER signal and by releasing
^^^^^^^^^
|> memory and paging space resources allocated in their data and stack
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|> areas and in shared memory segments, using the disclaim subroutine.
|> --
|>
|> Best regards, Thomas Braunbeck
As can be seen by the entire post, trapping for SIGDANGER in and of itself does
nothing. If you trap for, and receive a SIGDANGER, you must free page space
immediately, or you will receive a SIGKILL shortly thereafter.
jon
--
Jon Alperin
Bell Communications Research
---> Internet: jo...@iscp.bellcore.com
---> Voicenet: (908) 699-8674
---> UUNET: uunet!bcr!jona
* All opinions and stupid questions are my own *
[He bumped into the malloc virtual allocation nonsense again.
I still get mad just thinking about it.]
>I have questions:
> 1) Has anyone seen a system where static memory may not really be
> there, or where a nonzero malloc doesn't guarantee the successful
> usage of the memory?
Alas, yes. When this thread started about 18 months ago, I tried a
malloc-touch loop on a DG aviion. No NULLS from malloc, processes
get killed. Damned! I didn't try large static arrays.
DG has been reasonably good at not reinventing the wheel; this
wonderful feature may come to us from AT&T; netters confirmed this.
I was told, however, that it is not a part of the SVID.
> 2) Has anyone heard of SIGDANGER before?
It's an IBM innovation.
> 4) Even if I use this malloc wrapper everywhere in my own code,
> how do I deal with third-party code I purchase that calls the
> unwrapped malloc?
Or with the pre-linked code in libc.a?
Get your money back. (Hah! right.)
Some arguments to the effect that vapour-memory was a good thing were:
-Lets you use gigantic sparse arrays.
-Lets vendors ship Fortran binaries with static arrays dimensioned
to maximum size, and yet have them run on small machines for small
problems that use only part of the arrays.
I'm skeptical. Sparse arrays at 4kB/page? As for the Fortran bit, it
only makes sense on machines dedicated to a single application. That
sure isn't the way we use ours.
--
--Pierre Asselin, Magnetoresistive Head Engineering, Applied Magnetics.
p...@appmag.com the usual disclaimers apply.
This shouldn't happen. You should only get SIGDANGER or SIGKILL when the
TOTAL paging space runs low. It might be that one of your paging spaces
wasn't activated.
Scott L. Porter IBM PSP Austin / AIX Kernel Development
Internet: sc...@glasnost.austin.ibm.com
Exactly!
:This is one of the reasons I hate AIX. There are lots of them, but this is
I don't have any other important reason for AIX-hating, but the horrendously
bogous malloc() semantics may be enough...
--
Email: mart...@cadlab.sublink.org Phone: ++39 (51) 6130360
CAD.LAB s.p.a., v. Ronzani 7/29, Casalecchio, Italia Fax: ++39 (51) 6130294
:In <1992Sep3.1...@medtron.medtronic.com>
: sh0001@israel (Scott Hansohn) writes:
:
:[He bumped into the malloc virtual allocation nonsense again.
: I still get mad just thinking about it.]
DITTO! And the excuses we get about it look just like that, EXCUSES!
:Some arguments to the effect that vapour-memory was a good thing were:
: -Lets you use gigantic sparse arrays.
: -Lets vendors ship Fortran binaries with static arrays dimensioned
: to maximum size, and yet have them run on small machines for small
: problems that use only part of the arrays.
:
:I'm skeptical. Sparse arrays at 4kB/page? As for the Fortran bit, it
:only makes sense on machines dedicated to a single application. That
:sure isn't the way we use ours.
Dedicating a WS to a single (set of) application IS the typical way that
cad.lab customers would use their machines, and we are heavy Fortran
users, and the malloc()-but-not-really idea STILL stinks. We are
selling INDUSTRIAL STRENGTHS applications, that will be used for CRUCIAL
PRODUCTION WORK; it's COMPLETELY UNACCEPTABLE for our customers to lose
data because the application dumps abruptly!!! So our apps are full of
error-checks.
In particular, we do NOT place data, that will grow for large problems,
inside Fortran arrays; they reside, instead, in areas which are
dynamically allocated by an underlying library written in C, and
accessed via functions or subroutines by the Fortran portions. On
machines where malloc() semantics make sense, the C routine will return
an error indicator to the Fortran portion if it's unable to get the
memory requested; in this case, the application communicates to the
interactive user that the requested operation cannot be completed due to
running out of virtual memory, but the app is still alive and the user
can save hir work so far, and restart from there presumably after having
swapspace reconfigured.
We've been particularly careful that nothing in the save-to-disk
subsystem NEEDS to allocate extra memory, so that the saving will work
even in crucial memory-low situations; we even had to recode the
output-to-file portions as C subroutines running over low-level
systemcalls, as we found with surprise that Fortran I/O, and C stdio, on
some platforms, may need a malloc() to succeed and will die if it fails
(and, yes, our applications ARE and WILL REMAIN extremely portable
code).
All this care, of course, is for naught on the IBM R/6000 (thankfully we
don't presently run on DG Aviion, where malloc() reportedly's similarly
broken). And no, we can't just set "limit datasize" appropriately,
because it depends on what the user is doing exactly: sometimes the 3D
modeler will be running alone, other times it will be scheduled together
with the 2D drafter and/or the surface renderer and/or the relational
database and/or the tool which builds programs for numerically
controlled tools and/or... each of these applications is written to be
able to run alone OR communicate with its brethren.
We've tried the tricks IBM suggested to stop our application from dying
in unexpected places, but what happens then is that OTHER processes
die -- and the first to go is typically the X server (a memory hog, I
guess!), so the user cannot communicate with the apps to ask to save...
and NO, we CANNOT just do the saving from the SIGDANGER handler as a
safetynet; the handler can be basically entered from anywhere in the
application, including "critical sections" where the data structures
are in transition and inconsistent (and NO, we CANNOT protect the
critical sections by turning off signals there, or we'll die for
lack of SIGDANGER handling).
Yes, I know that a thousand clever tricks spring to mind to workaround
one of the other of these problems, but believe me: we must have tried
at least 900 of them and they don't work. We've spent more time and
effort on battling this malloc() idiocy than on any other single porting
problem EVER (and with the huge list of platforms we've supported over
the years we've had quite SOME such problems, believe you me!)!!! Most
porting problems come from bugs in the target system, some from bugs in
our code, but here we're fighting against something BROKEN AS DESIGNED
-- ***HORRIBLY*** BROKEN. I would say it's been half the cost of the
IBM R/6000 port, if it weren't for the fact that the monstruously slow
linker (thankfully remedied in 3.2, but this port was started right at
system announcement...) and the bugs in the early X have driven that
cost way up. Anyway, at the end, we've given up and just document to
our customers how AND WHY their work may go up in smoke on IBM R/6000
and not on DEC, Olivetti, Sun, HP, Sony or other platforms.
If IBM ever gives us a malloc() WHICH WORKS, we'll be glad to use it.
And I hope that periodically rekindled flames about it will do some
good -- if we could get together with everybody who's suffered for
this and blackmail IBM into it the world would become a better place
in at least this small way...
>Yes, I know that a thousand clever tricks spring to mind to workaround
>one of the other of these problems, but believe me: we must have tried
>at least 900 of them and they don't work. We've spent more time and
>effort on battling this malloc() idiocy than on any other single porting
>problem EVER (and with the huge list of platforms we've supported over
>the years we've had quite SOME such problems, believe you me!)!!! Most
>porting problems come from bugs in the target system, some from bugs in
>our code, but here we're fighting against something BROKEN AS DESIGNED
>-- ***HORRIBLY*** BROKEN. I would say it's been half the cost of the
>IBM R/6000 port, if it weren't for the fact that the monstruously slow
>linker (thankfully remedied in 3.2, but this port was started right at
>system announcement...) and the bugs in the early X have driven that
>cost way up. Anyway, at the end, we've given up and just document to
>our customers how AND WHY their work may go up in smoke on IBM R/6000
>and not on DEC, Olivetti, Sun, HP, Sony or other platforms.
In other posting IBM has recommended setting MALLOCTYPE=3.1, saying this works
at runtime. Have you tried this? If you did it work?
--
-------------------------------------------------------------------------------
Beirne Konarski | Reading maketh a full man, conference a
bei...@summitis.com | ready man, and writing an exact man.
| -- Francis Bacon
Setting MALLOCTYPE isn't related to the problems being discussed in this thread.
Here the issue is AIX's policy of not actually allocating malloc'ed storage
until it is touched and this basic policy is part of both 3.1 and 3.2 AIX.
--
John Gerth ge...@watson.ibm.com (914) 784-7639
In article <1992Sep10.1...@cadlab.sublink.org> mart...@cadlab.sublink.org (Alex Martelli) writes:
>
>I don't have any other important reason for AIX-hating, but the horrendously
>bogous malloc() semantics may be enough...
I've had the same frustrations with the malloc/SIGKILL problem,
BSD tty code and other AIXisms.
Anybody knows about a port of SVR4 that IBM had contracted out
to one of the UNIX/386 houses? I remember reading this about a year
ago in some UNIX magazine (maybe on a "Unix Today"). I believe
that this was a special product to be used only for large bids
that explictly asked for SVR4.
Is there any interest out there for a straight port of SVR4 to the
RS/6000? Anybody knows if this is really happening? Any other
details?
I was curious what would be the minimal requirements for such a product.
My requirements would be:
- Binary compatible with AIX 3.X applications. Not sure how
the SVR4 shared objects and the AIX shared library models
match.
- Support for most IBM provided hardware (tough I don't
care about diskless workstations in particular).
- Good support.
My non-requirements would be:
- No compatibility with kernel extensions or device drivers.
- No disk space compatibility, i.e. JFS filesystems or logical
volumes would be useless, reformating and a different filesystem
would be required (UFS maybe VXFS).
What would be your requirements?
--- Disclaimer: the sender doesn't have any interest on this posting,
--- just acting as a gateway.
Let me just add another problem with malloc: We use the amd automounter here
which every now and then dies without any obvious reason. If you run
critical applications on such machines, your apps won't be able to do any
saving just because the filesystem isn't there anymore.
Malte
I think you misunderstand the purpose of MALLOCTYPE. If so set, it will
make AIX 3.2's malloc() revert to AIX 3.1's algorithms (which I think
are: round up (requested size+overhead) to power of two, look in
appropriate bin out of 24 or whatever that is, allocate full pages,
etc), which, I believe, are faster but potentially wasteful of memory
(e.g. ask for 1000 bytes, suppose 32 of overhead are added, then 1032
is rounded to 2048, and so on). I'm not sure exactly what unportable
quirks of AIX 3.1 malloc()'s programs ported to AIX of old might be
relying on, but anyway MALLOCTYPE's purpose is to reproduce those quirks
to let such unportable programs still work.
This is most definitely NOT our problem -- our programs are mighty portable,
thank you, and the crazy semantics of "here's-the-memory-BUT-you'll-die-if-
you-try-to-really-USE-it" were in AIX 3 from the start anyway:-(.
This feature was introduced into SunOS 4.0 around 1987 and has been
there ever since.
--
-Barry Shein
Software Tool & Die | b...@world.std.com | uunet!world!bzs
Purveyors to the Trade | Voice: 617-739-0202 | Login: 617-739-WRLD
Well, I thought people may be interested to know that the
algorithm for allocating memory was not the only thing changed
with v3.2 of AIX's malloc package. The procedure that free uses
has changed with the new OS. Free use to just leave the unallocated
memory alone after it was called. In the new version, free also
zeroes out the freed block.
We discovered the above behavior while trying to discover why an
old program here (about 8 yrs) was failing. We were trying to decide
whether we should change it to be compatible with v3.2 malloc.
We could find no places were it was using more memory than it
requested, but we did discover that it was trying to read an area
of memory that had just been freed. not sure why the previous
programmer did this, but we now know what free is doing.
--
David Knight French Email Addresses: Phone:
UCSD - Dept of Chemistry INTERNET: dkfr...@ucsd.edu (voice/fax)
9500 Gilman Drive UUCP: ucsd!dkfrench (619)534-4193 /
San Diego, CA 92093-0314 BITNET: dkfr...@ucsd.bitnet (619)534-6255
Note that malloc(0) is undefined in its behavior. It is up to the system
to decide whether to return a pointer to a small amount of space, or to
return NULL. The G++ works the way it does probably because GNU malloc()
returns a pointer to a small chunk of memory. There are other non-AIX
systems which work as AIX does (as well as ones which don't ...).
Quite simply, the GNU people should stick to the standards as written, if
they expect their software to be portable.
ObDisclaimer: I don't speak for IBM.
--
John F. Haugh II | MaBellNet: (512) 823-1078 | SneakerNet: 042/2F068
InterNet: jfh%snowball.au...@ibmpa.awdpa.ibm.com [TSAKC]
< stuff deleted >
> As can be seen by the entire post, trapping for SIGDANGER in and of itself does
> nothing. If you trap for, and receive a SIGDANGER, you must free page space
> immediately, or you will receive a SIGKILL shortly thereafter.
Still not the truth. If a process catches SIGDANGER then it is exempt from
being sent SIGKILL due to a low paging space condition. The reason for sending
SIGDANGER is to notify processes to free up resources so that the system won't
run out. If paging space isn't freed up and it continues to run out then some
other process (which doesn't catch SIGDANGER) will get the SIGKILL.
>Let me just add another problem with malloc: We use the amd automounter here
>which every now and then dies without any obvious reason. If you run
>critical applications on such machines, your apps won't be able to do any
>saving just because the filesystem isn't there anymore.
what has this 2 do w/ malloc?
amd is a 3rd party PD thing rite?
have you tried the latest supported automounter on 3.2?
it is in U404456
does it coredump on you also?
--
--
cu...@aixwiz.austin.ibm.com (Curt L. Finch) | AIX NFS/NIS Field Quality
My views are unrelated to those of IBM | Austin, TX
There'll be too many elderly in 30 years for your kids to afford all the FICA.
Get a clue, guys.
Amd is a system critical piece of software that mallocs memory. AIX
may decide to kill it at random if the system goes low on paging
space. If this happens, nobody can get to their files and the system
crashes.
The fact that IBM has its own automounter is irrelevant. There are
good reasons for running the amd automounter (we run it on ALL of our
machines). More importantly, there are good reasons why IBM customers
might want to have applications that aren't gunned down at random.
Face it, guys. The whole lazy memory allocation / SIGDANGER hack is a
design flaw. Killing processes at random is not an acceptable way to
handle resource limitations. Requiring otherwise portable code to
trap for SIGDANGER and deal with it is not a reasonable constraint
just to be able to run the code on AIX. The AIX kernel simply
shouldn't be doing things like this.
Damn it, why does IBM insist on gratuitous incompabitilities in every
part of their system?
"Pray that you never have to use this system"
-- me, 2 years ago, on my first exposure to AIX
--
Keith Moore / U.Tenn CS Dept / 107 Ayres Hall / Knoxville TN 37996-1301
Internet: mo...@cs.utk.edu BITNET: moore@utkvx
No, it worked that way because someone made a mistake.
> Quite simply, the GNU people should stick to the standards as
> written, if they expect their software to be portable.
I'll omit the requisite flame.
--
\ Charles Hannum, myc...@ai.mit.edu
/\ White heterosexual atheist male (WHAM) pride!
Named 4.8.3 is a system critical piece of software in that if you can't
reach your nameserver on AIX you get to sit through MINUTES of retries on a
telnet login before the system finally allows you to get a login prompt.
While it doesn't get KILLed, it DOES freeze. Running AIX's version (which
is about 5 revs old) isn't an option -- that one will decide to go out to
infinite-loop lunch at random, eating all available CPU and not answering
any requests either. So what kind of option is avaialble here? Don't run
BIND? /etc/hosts files? Are you serious?
>The fact that IBM has its own automounter is irrelevant. There are
>good reasons for running the amd automounter (we run it on ALL of our
>machines). More importantly, there are good reasons why IBM customers
>might want to have applications that aren't gunned down at random.
Like named 4.8.3, the current version which is running all over the Internet
- except on AIX machines where it hangs 'cause it feels like it.
I've NEVER seen this on any other system.
>Face it, guys. The whole lazy memory allocation / SIGDANGER hack is a
>design flaw. Killing processes at random is not an acceptable way to
>handle resource limitations. Requiring otherwise portable code to
>trap for SIGDANGER and deal with it is not a reasonable constraint
>just to be able to run the code on AIX. The AIX kernel simply
>shouldn't be doing things like this.
Worse, you CANNOT deal with this effectively. What are you supposed to do
if you get a SIGDANGER? Exit? Then why bother trapping it? And if you
don't exit when you get one? You may get a SIGKILL shortly thereafter
which you >cannot< trap or do anything about!
What if your program MUST continue to run, no ifs ands or buts? What if
that program is (for example) reading financial transactions and its exit
means that you lose transactions (and therefore $$$)? Then you are hosed.
Shall people just forget about AIX machines for critical applications where
it cannot be relied on to leave processes alone or give them fair notice
when something isn't going to work? Why not just return NULL from that
malloc anyway?
Note that our company here >does< have mission-critical applications
running on this hardware.... for now. We too want this fixed -- and
fixed pronto.
>Damn it, why does IBM insist on gratuitous incompabitilities in every
>part of their system?
>
>"Pray that you never have to use this system"
>
> -- me, 2 years ago, on my first exposure to AIX
--
Whoa there.
I believe some versions of the Mach operating system operate on the 'lazy
allocation' scheme as well. They will tell you that a malloc(MAXINT)
succeeded but the machine will run out of VM and the processes will crash
and burn if you actually try to use those pages. This was done at CMU
precisely to support sparse arrays for some Common LISP implementations.
(The "nonsense" about sparse arrays while discussing this topic awhile
back). This means that running code on Mach machines can also cause your
program to die in mid-run even though it never got a NULL from malloc().
On the Mach based NeXT machines (at least in the early versions, I don't
know about more recent ones) they handle this in a somewhat more elegant
manner. The "swap space" is actually a normally accessable file in the
filesystem that shares disk with all the other normal files. When the
disk fills up and swapping becomes impossible, the processes that would
have overexpanded the swap file are suspended and a warning dialog comes
up telling you to free up some disk space by deleting files or killing the
larger processes. Then, after there is sufficient space for swapping to
being again, the processes are awoken and continue running.
The reason for SIGDANGER (which is IBM only as far as I know) was to allow
programs to make up for this behavior. In AIX 3.1 the (unfortunate)
behavior of the kernel was to kill the largest process, which frequently
was the one you most wanted to continue running, or X, or some other
process whose death made you want to kick the machine. In AIX 3.2 the
kernel killing behavior has changed to factor in the process's age and to
try and not kill the more recently active processes.
There are arguments for both behaviors of malloc and given there is a
mechanism for handling the differences brought about by the one chosen for
AIX, and given the runtime speedup gained by that choice it seems to have
been the right one _for most cases_ though as you found out there are
always exceptions.
+ Marc Pawliger IBM Advanced Workstations Division Palo Alto, CA +
| Internet ma...@awdpa.ibm.com UUCP uunet!ibminet!marc |
| IBMinet ma...@ibmpa.awdpa.ibm.com phone (415) 855-3493 |
+ VNET MARCP at AUSTIN IBM phone T/L 465-3493 +
These are my views, not those of my employer, etc etc etc...
I haven't checked Release 2, but in Release 1 it didn't do anything
that nice; it just popped a little box saying "SHIIIIT!!! I gave you
too much memory! G'night!" (not quite in those words), and paniced.
This is even worse than what AIX does.
One old idea is to allocate backing store for COW and zero pages, so
the VM is always there. This could require very large swap spaces for
some applications, but it would alleviate this problem. Having this as
an option couldn't hurt, anyway.
This type of behavior is also seen on SGI machines with the default
value of a sysgen parameter, which can be changed to the more
traditional behavior with a change in the parameter. SGI makes the
arguement that for fortran programs, since fortran does not have
any dynamic memory allocation capability,(At least Fortran 77 does
not), you traditionally make your arrays as large as the largest
case you might wish to run. If you run a smaller case, then the
arrays although in your address space are never really created,
and no space is required for them. As a result you can get
more jobs done in a period of time with this sort of behavior.
Thus the behavior IBM has selected has reasons for it, the best
solution would be to enable the SYSTEM Administrator to turn
the feature on and off as needed.
Like the title says, how does the O/S know a SIGDANGER was caught?
--
Bruce E. Parkin
br...@ncs.com
We have seen that the X server will just go away without a trace occasionally,
when running our application (it is an imaging application - very memory
intensive). It causes screen lock ups and very unhappy customers.
If what was stated above is true, then for if the X server would catch SIGDANGER
a lot of *old* problems would go away.
1) Is my assumption correct - does the IBM version of the X server not
catch SIGDANGER?
2) Why not?
3) Is there any plans for it to in the future?
> In article 16876, mo...@cs.utk.edu (Keith Moore) writes:
Bunch of stuff deleted here....
> was the one you most wanted to continue running, or X, or some other
> process whose death made you want to kick the machine. In AIX 3.2 the
> kernel killing behavior has changed to factor in the process's age and to
> try and not kill the more recently active processes.
NOT! The youngest process is killed on 3.2, no the oldest or the middling,
but it kills them according to age, youngest to oldest....
Richard Hasting, Computer Literacy 2000 Project Office
ric...@kunikpok.uucp
"When all is said and done, more is said than done."
UNIX has always allowed memory allocated thru brk(2) (and other anonymous
memory) to be used without the user having to worry about memory
being backed by swap or not. If the memory is succesfully allocated,
then all of it can be used.
The fact that some versions of Mach break this traditional UNIX semantic
is no excuse for not implementing the UNIX behaviour, Mach is not UNIX,
AIX is supposed to be UNIX. If the designers of AIX wanted to allow
sparse address spaces, or sparse mappings within an address space, then
they should have created mechanisms that allowed the non-standard behaviour
to be requested (on a per process basis).
It is important to note that proecesses on real UNIXes can also be killed
when they attempt to grow their stack and the growth fails because there
is no paging space available to back it. This behaivour is not seen very
frequently though.
>...
>
>There are arguments for both behaviors of malloc and given there is a
>mechanism for handling the differences brought about by the one chosen for
>AIX, ...
The SIGDANDER mechanism doesn't seem to be enough to handle the difference.
>... and given the runtime speedup ...
There is no runtime cost associated with reserving paging space when
some amount of anonymous memory is allocated. Simply decrement a counter
by that amount, the counter reserves that amount of paging space for
the mapping, when the pages are finally created (e.g. on first store
for brk(2); or when breaking copy-on-writes that resulted from a fork(2);
etc.), the paging space is actually allocated, allocation can also be
done at pageout time to aid in clustering pageout operations.
>... gained by that choice it seems to have
>been the right one _for most cases_ though as you found out there are
>always exceptions.
I believe that a more fundamental problem is that AIX uses large
arrays in the kernel and assumes that these arrays are backed by
paging space that is lazily allocated. Even if anonymous memory
allocated to back the users address space was backed by paging space
it would still be possible for the system to run out of paging space
when it attempts to grow some of its sparsely populated internal
tables.
Ramon Pantin
Our IBM SE told us AIX 3.2 would "suspend" the process hogs, not kill them.
1) Is he wrong in this?
2) If right, what does he mean by suspend (he didn't know what it meant)?
It was my understanding that the process which requested the page was sent the
signal (and killed), not necessarily the largest process (as was the case
in 3.1.5). Anyone from the kernal group want to finally put this to rest?
--
Jon Alperin
Bell Communications Research
---> Internet: jo...@iscp.bellcore.com
---> Voicenet: (908) 699-8674
---> UUNET: uunet!bcr!jona
* All opinions and stupid questions are my own *
Will the person who wrote the bloody code please stand up and tell us
what it does, or at least what it's *supposed* to do? I've seen about
5 or 6 different stories, mostly from IBM people, and they all
disagree.
1) Processes are definitely killed, at least some.
2) What, *exactly*, is the algorithm used to determine which processes
get SIGDANGER, and which ones are killed?
3) Is there, in fact, a way a process can prevent itself being killed?
If so, how?
4) What's the easiest way to touch the memory I allocate? Does
calloc() actually touch newly allocated zero pages, or try to optimize
that away?
I certainly agree that, conceptually, a lazy memory allocation can be
beneficial to users. However, as a software engineer, I would like to
politely disagree with the last paragraph. The best solution might
involve the system administrator to some degree, but most certainly
should involve the program itself.
FORTRAN programs should, perhaps, default to lazy memory allocation,
with a compiler switch or some other mechanism for switching to
non-lazy memory allocation. On the other hand, C programmers tend to
have an idea of whether or not their code will benefit from sparse
memory management. So C programs should probably default to non-lazy
memory allocations with a compiler switch, pragma, compiler macro, or
some other mechanism which allows the code to specifically use lazy
allocation where desired.
Then, on top of that, allow the system administrator to turn off lazy
allocation entirely, thus forcing attempts at explicit lazy allocation
to actually be non-lazy. I don't think you want the system
administrator to be able to force all allocations to be lazy, because
that would end up with the exact problem we have now: SIGKILL.
--
The above statements are not the opinions or policies of SPSS Inc.
The above statements may not be the opinions of Brent Lambert.
The first disclaimer is a policy of SPSS Inc.
Subsequent disclaimers are probably the opinion of Brent Lambert.
> Whoa there.
>
> I believe some versions of the Mach operating system operate on the 'lazy
> allocation' scheme as well. They will tell you that a malloc(MAXINT)
> succeeded but the machine will run out of VM and the processes will crash
> and burn if you actually try to use those pages. This was done at CMU
> precisely to support sparse arrays for some Common LISP implementations.
> (The "nonsense" about sparse arrays while discussing this topic awhile
> back). This means that running code on Mach machines can also cause your
> program to die in mid-run even though it never got a NULL from malloc().
Mach is research code. AIX is supposed to be a production-quality OS.
> The reason for SIGDANGER (which is IBM only as far as I know) was to allow
> programs to make up for this behavior. In AIX 3.1 the (unfortunate)
> behavior of the kernel was to kill the largest process, which frequently
> was the one you most wanted to continue running, or X, or some other
> process whose death made you want to kick the machine. In AIX 3.2 the
> kernel killing behavior has changed to factor in the process's age and to
> try and not kill the more recently active processes.
You're missing the point. The kernel shouldn't be making these kinds
of decisions at all.
If you had a special linker option that said "I don't need backing
store for my unused heap space", and maybe another that said "you can
nuke me from orbit if the system is out of memory", those might be
desirable features for non-critical Fortran programs.
> There are arguments for both behaviors of malloc and given there is a
> mechanism for handling the differences brought about by the one chosen for
> AIX, and given the runtime speedup gained by that choice it seems to have
> been the right one _for most cases_ though as you found out there are
> always exceptions.
Gee...the "right" choice for most cases is one that sacrifices program
correctness in favor of speed. And I thought *I* was from a strange
planet.
So when my financial application, which has been running since the system
booted, and is receiving transactions from a wire feed decides to get kicked
by the OS I can say "ok, you made the right choice" -- right?
Should I conclude that AIX is not suitable for financial applications, or
those which need to be running for long periods, or those which might cost
the company money if the application is crashed by the OS under arbitrary
conditions which I cannot predict?
It definitely did. If you're dumb enough to run anything important
under *any* version of Unix, you deserve it. B-)/2
> [...] if the application is crashed by the OS under arbitrary
> conditions which I cannot predict?
I would hardly call "running out of promised backing store" an
"arbitrary condition". The real answer is not to run anything else on
that machine; you shouldn't anyway, as doing so greatly increases the
risk of the machine crashing.
>In article <1992Sep16.1...@awdprime.austin.ibm.com>, cu...@ekhadafi.austin.ibm.com (Curt Finch 903 2F021 cu...@aixwiz.austin.ibm.com 512-838-2806) writes:
>> ma...@techfak.uni-bielefeld.de (Malte Uhl) writes:
>> >Let me just add another problem with malloc: We use the amd automounter here
>> >which every now and then dies without any obvious reason. If you run
>> >critical applications on such machines, your apps won't be able to do any
>> >saving just because the filesystem isn't there anymore.
>> have you tried the latest supported automounter on 3.2?
>> it is in U404456
>> does it coredump on you also?
>The fact that IBM has its own automounter is irrelevant. There are
>good reasons for running the amd automounter (we run it on ALL of our
>machines). More importantly, there are good reasons why IBM customers
>might want to have applications that aren't gunned down at random.
It is possible that AMD is dying because the MALLOCTYPE environment
variable isn't set to "3.1". BTW -- is there any way to select the
"old" malloc form within a C application? I tried putenv() without
success, and am wondering if there's a magic incantation of xmalloc()
that will work... Otherwise I'm faced with a shell-script wrapper or
a putenv() followed by an exec() variant.
-John
--
John P. Eisenmenger ###### MAIL ADDRESS CHANGED 8/1/92 !!!!!
Duke University
Department of Electrical Engineering
Box 90291
Durham, NC 27708-0291
Part of the problem is that the algorithm has changed over time. In releases
prior to AIX 3.2 the process selected to be killed was the process with the
largest amount of paging space. In AIX 3.2 the process selected is the
youngest process (u.u_start) which is eligible to be killed.
The following information is common to all releases:
- SIGDANGER is sent to all active processes (except kprocs) when the paging
space warning threshold is reached
- Processes which catch SIGDANGER are exempt from being killed -- they will
never get a SIGKILL due to a low paging space condition
- An 'eligible' process is any active process (other than kprocs or 'init')
which does not catch SIGDANGER
To answer your last question, calloc() does touch the allocated memory.
----
Scott L. Porter IBM AWD Austin / AIX Kernel Development
Owner of the "bloody code" Internet: sc...@glasnost.austin.ibm.com
Whenever I've seen the X server go away and the keyboard has been "locked
up", I've been able to get the HFT and keyboard back by switching the device
out of "monitor mode" and into "keyboard send/receive mode." The IOCTL
command for entering monitor mode is "HFSMON", and for leaving is "HFCMON",
so you would need to write a small C program that does the following -
-- begin hftmon.c --
#include <sys/hft.h>
#include <fcntl.h>
main (int argc, char **argv)
{
int fd;
int on;
if (argc != 3) {
printf ("usage: hftmon device [on|off]\n");
exit (1);
}
if ((fd = open (argv[1], O_RDWR|O_NDELAY)) == -1) {
perror (argv[1]);
exit (1);
}
on = strcmp (argv[2], "on") == 0;
if (ioctl (fd, on ? HFSMON:HFCMON) == -1) {
perror ("ioctl");
exit (1);
}
exit (0);
}
--
I've not compiled this, and IBM does not support this code, but it should
do what you need, and is provided as an example of how to switch between
monitor mode and normal mode. No warrantee is made as to whether or not
this code will solve your problem, it is provided on an "AS-IS" basis.
To use this command, compile as "hftmon" and request as "hftmon <device
that is stuck> off". If, for example, you opened X on /dev/hft/0 and it
is stuck, "hftmon /dev/hft/0 off" should unstick it.
--
John F. Haugh II | MaBellNet: (512) 823-8817 | SneakerNet: 042/2D034
InterNet: j...@eureka.aixserv.austin.ibm.com [TSAKC]
Disclaimer: I am not a representative of IBM. I speak for myself only.
--
+-----All Views Expressed Are My Own And Are Not Necessarily Shared By------+
+------------------------------My Employer----------------------------------+
+ Ronald S. Woan wo...@cactus.org or wo...@vnet.ibm.com +
+ other email addresses Prodigy: XTCR74A Compuserve: 73530,2537 +
Let me just mention that our Suns (after applying the oblicatory patches)
*never* crashed. I mean, the medium uptime between crashes is measured in
months not hours.
Malte
> In article <FJDFRB...@kunikpok.UUCP> ric...@kunikpok.UUCP (Richard A. Has
> >ma...@ibmpa.awdpa.ibm.com (Marc Pawliger) writes:
> >
> >> In article 16876, mo...@cs.utk.edu (Keith Moore) writes:
> >
> >Bunch of stuff deleted here....
> >
> >> was the one you most wanted to continue running, or X, or some other
> >> process whose death made you want to kick the machine. In AIX 3.2 the
> >> kernel killing behavior has changed to factor in the process's age and to
> >> try and not kill the more recently active processes.
> >
> >NOT! The youngest process is killed on 3.2, no the oldest or the middling,
> >but it kills them according to age, youngest to oldest....
> >
> >
>
> Our IBM SE told us AIX 3.2 would "suspend" the process hogs, not kill them.
>
> 1) Is he wrong in this?
> 2) If right, what does he mean by suspend (he didn't know what it meant)?
>
>
> --
>
> Bruce E. Parkin
> br...@ncs.com
This is a different item altogether. The SE is right, and I am right.
The process hogs are the CPU hogs, not the memory hogs... There are also
provisions for paging hogs. These may or may not be the same as the
memory hogs, but when the OS determines thrashing, for example, it
suspends some of the offending processes. This might just be for a
second at a time, or for several seconds (Complex algorythm here which I
don't want to get wrong) and then it releases some of the offenders. If
thrashing comes back, it suspends the offenders. WHen thrashing goes
below the threshold it reactivates "some" of these suspended processes
again... and so forth. The don't remember the details of the CPU hogs.
There was an article in the spring on this in one of the AIX mags put out
by IBM. I could look it up if anyone wishes.
Richard
Richard Hasting, Computer Literacy 2000 Project Office
Why should an otherwise portable C program that runs unmodified on
every other UNIX system in the world have to trap SIGDANGER just so it
can run on AIX? Why should I, as an application developer, have to
even waste one millisecond of my time learning about, working around,
and testing for such things? Why does IBM feel like it can change
well-established interfaces at a whim and call the result compatible
with Unix?
Why should a user or sysadmin of AIX boxes have to worry about whether
all of the applications she wants to use have added AIX-specific code
to trap for SIGDANGER in order that they not be nuked by the kernel?
The claim is that the lazy allocation strategy improves performance,
but I don't understand why it's a big performance hit for AIX to
simply maintain a count of the number of pages of backing store that
are still available, decrement this count every time storage is
allocated, and return an error if there aren't enough pages available.
Maybe you should tell that to the piece-o'-junk SPARC 2 I use, running
SunOS 4.1.2. It's basically useless as anything other than an X
terminal.
--
\ / Charles Hannum, myc...@ai.mit.edu
/\ \ RSA public key available on request.
Scheme White heterosexual atheist male (WHAM) pride!
(I deal with the paging space problem by having enough paging space.
In my case (a three user system) that is 160 megabytes, or about $320
worth of disk space.)
--
Marc Auslander (IBM)<ma...@marc.watson.ibm.com> 914 784-6699
(Internet)<ma...@watson.ibm.com>
Some mention has been made about why the AIX method of allocating paging space
is good. If you aren't impressed with an application's ability to utilize
sparse arrays then consider the following. The kernel defines many of its
data structures in virtual memory and makes them pageable. These structures
are defined to be large to support a wide range of system configurations and
loads. The sysadmin doesn't have to re-build the AIX kernel in situations where
other kernels would have to be rebuilt to increase the size of these structures.
What the sysadmin DOES have to do is make a reasonable estimate of the peak
utilization of paging space for the system and define enough paging space so
that the SIGDANGER event never occurs.
Granted, this doesn't take care of run-away processes or other unexpected uses
of paging space. Granted, the whole SIGDANGER implementation is ugly and
breaks the conventions previously associated with the use of malloc(). But if
you are running mission-critical applications it seems reasonable to require
the system to be configured (i.e. enough paging space defined) to handle
peak loads.
Providing the semantics of a malloc() which allocates paging space up front is
not an easy job on a pageable kernel. If it was we would have done so long
ago.
Scott L. Porter IBM AWD Austin / AIX Kernel Development
Internet: sc...@austin.ibm.com
That's funny, I get the same behavior out of my SPARC 2 boxes that Malte
gets, and I use my RS6000 220 as an Xterminal into the Sparc. It looks
good running OW3 ;-).
--
Dewey Paciaffi ed...@huber.com