The above mentioned server has a problem that postfix keeps terminating
at regular intervals. I believe I have narrowed the problem down to a
program eating a lot of memory, likely due to multiple forks each
wanting to allocate 10 GB, and causing postfix to quit. I have no
control over that program's source code. The CPU utilisation is above
95% nearing 100% when this happens.
I am not sure if it's the kernel's OOM killer or postfix itself which
causes it to quit. I do not see a mention in /var/log/messages about the
OOM killer killing postfix, but I do see a mention in /var/log/mail.info
about reading "postfix/master[14959]: terminating on signal 15" which is
the default kill signal.
Would postfix kill itself if it failed to allocate memory? To me that
seems the preferred behaviour for any such program. Or what is postfix'
behaviour under such high load conditions? It's being run with -o stress.
Is it possible to find out if in fact the OOM killer killed it, lacking
a specific log entry indicating it?
Thank you,
Jeroen
> I am not sure if it's the kernel's OOM killer or postfix itself which
> causes it to quit. I do not see a mention in /var/log/messages about the
> OOM killer killing postfix, but I do see a mention in /var/log/mail.info
> about reading "postfix/master[14959]: terminating on signal 15" which is
> the default kill signal.
>
> Would postfix kill itself if it failed to allocate memory? To me that
> seems the preferred behaviour for any such program. Or what is postfix'
> behaviour under such high load conditions? It's being run with -o stress.
The master(8) signal handler is receiving SIGTERM and thus terminating.
Postfix is only killing itself because it's being *told* to do so. I am not
sure how you can debug *who* is transmitting the signal.
--
Sahil Tandon <sa...@tandon.net>
1. The OOM killer _always_ uses SIGKILL. A program is not even given the
possibility to react.
2. root processes (like master) have a much lower propability of geting
killed then the processes of other users.
> Would postfix kill itself if it failed to allocate memory?
That would be either a SIGSEGV if the kernel is unable to find any
suitable memory or SIGABRT of the system library does it.
> Or what is postfix'
> behaviour under such high load conditions? It's being run with -o stress.
"-o stress" or "-o stress=yes"?
> Is it possible to find out if in fact the OOM killer killed it, lacking
> a specific log entry indicating it?
No. It does not at least match the behaviour of the OOM killer.
If you want to know the culprit, patch master to output the informations
of the siginfo_t structure, which includes things like sender
(kernel/userland) and pid.
Bastian
--
She won' go Warp 7, Cap'n! The batteries are dead!
I understand.
I should add that I was miss interpreting the "terminating on signal
15". That in fact is my monitor doing its work, which I installed after
noticing postfix would regularly quit. It's likely the actual kill
signal is a SIGKILL. And as you explain you wouldn't find a log entry in
postfix' logs about that.
> That would be either a SIGSEGV if the kernel is unable to find any
> suitable memory or SIGABRT of the system library does it.
Thanks, good to know.
> No. It does not at least match the behaviour of the OOM killer.
I turned off the kernel's overcommitting (echo "2" >
/proc/sys/vm/overcommit_memory) of memory and gave postfix a priority of
-1 and the memory consuming processes a priority of 1. That appears to
have some effect, in that it only died once instead of a few times.
> If you want to know the culprit, patch master to output the informations
> of the siginfo_t structure, which includes things like sender
> (kernel/userland) and pid.
It's a production server so I am not able to do as much as I would like to.
Another interesting thing, postfix is installed on reiserfs. Are there
known problems with this filesystem and postfix?
Thank you,
Jeroen
Internally, "postfix stop" uses SIGTERM to terminate the master
daemon. This signal is also used by system shutdown procedures.
Other processes have no business sending SIGTERM to the Postix
master daemon.
Wietse
Right, the monitor will use /etc/init.d/postfix stop|start in order to
attempt to restart postfix.
Greetings,
Jeroen
Postfix has proven to be rock solid, and there is no need to
make it less reliable with trigger-happy babysitters.
Wietse
I am not giving any value judgement to postfix. In fact it's one of the
few MTAs I would trust to run on any systems I manage.
The problem is not postfix, it's that something I have no direct control
over at regular daily intervals (say in the middle of the night)
consumes a lot of memory and CPU time causing postfix to quit. And I
assume it is the kernel's oom killer doing that. Precisely because I do
not think postfix would just die. Neither do I not know of any other
mechanism, except the oom killer, which would cause such behaviour.
Until I have found a better solution (that's why I wrote the list) a
monitor executing a restart is the best "solution". I know fully well
that's a lame solution and only avoids the real problem. Which probably
lies in the realm of convincing upity developers that doing up to 8
forks each allocating 10 GB of memory is a bad thing.
Best regards,
Jeroen
Postfix usually logs an error if it cannot proceed. What are the
logfile symptoms (excluding those caused by your babysitter)?
Postfix master does not allocate memory after it is initialized,
so the oom killer theory seems a red herring (apart from the fact
that the oom killer would not send SIGTERM).
Wietse
Around the time the problems start there are far less log entries of
postfix. And it's always writing log entries since it's a busy server.
That sounds like the server is getting overloaded and programs running
start to behave erratic.
Without the babysitter the logs would suddenly not show any postfix
entries any more and postfix wouldn't be running at all, without an
indication why. It would only be noticed once someone complained or I'd
check by hand.
I am pretty sure the babysitter does a few false restarts just because
the system is overloaded and it's not getting a response soon enough.
But without it you'd have an MTA less system for days. It's checking
connectivity on 127.0.0.1:25 with a timeout of 35 seconds, which I
believe is rather liberal.
> Postfix master does not allocate memory after it is initialized,
> so the oom killer theory seems a red herring (apart from the fact
> that the oom killer would not send SIGTERM).
Right, I was mis interpreting the SIGTERM, which clearly comes from the
init script ran by the babysitter.
Greetings,
Jeroen
It's unprodictive to kill off Postfix under overload. At the very
least you should increase your 35-second deadline.
There is an entire webpage devoted to how Postfix handles overload
and what recovery mechanisms alraedy exist.
http://www.postfix.org/STRESS_README.html
Wietse