[erlang-questions] heart prevents beam from creating crash dumps

Richard Carlsson

unread,

Aug 25, 2012, 3:39:05 PM8/25/12

to erlang-bugs, erlang-questions

We have had a long-standing problems with not getting any Erlang crash
dumps at all on our live servers. I finally figured out why it happens.
I have already reported this to the OTP folks, but I thought I should
send a summary to the mailing lists for documentation and to give people
a heads-up.

The problem occurs when you start Erlang with the -heart flag
(http://www.erlang.org/doc/man/heart.html). This spawns a small external
C program connected through a port. From Erlang's point of view it's
like any other port program. The heart program pings the Erlang side
every now and then, and if it gets no reply within HEART_BEAT_TIMEOUT
seconds, or if the connection to Erlang breaks, it assumes the Beam
process has gone bad and kills it off with a SIGKILL, and then restarts
Erlang using whatever HEART_COMMAND is set to. So far so good.

Normally, when Beam detects a critical situation (e.g., out of memory)
and decides to shut down, it will create an erl_crash.dump file (or
whatever ERL_CRASH_DUMP is set to). This information can greatly help
figuring out what went wrong. But if the system that crashed was large,
the crash dump file can take quite a long time to create. In order to
make it possible to restart the node (reusing the node name) while the
old defunct system is still writing the crash dump, Beam wants to drop
its connection to the EPMD service before it starts writing the dump,
making it look like the old node has disappeared.

The code that does this is the function prepare_crash_dump() in
erts/emulator/sys/unix/sys.c. The problem from the perspective of the C
code is that the connection to EPMD is on some unknown file descriptor
(just like heart, this has been started as a port from Erlang code). The
solution they chose, and which has been part of the OTP system for
years, is to close _all_ file descriptors except 0-2. This certainly has
the desired effect that EPMD releases the node name for reuse. But it
also, when the loop gets to file descriptor 10 or thereabouts (probably
depending on your system), has the effect of breaking the connection to
the heart program.

In these multicore days, the effect is almost instantaneous. The heart
program immediately wakes up due to the broken pipe and sends SIGKILL to
Beam for good measure, to make sure it's really gone, and then it starts
a new Erlang node. Meanwhile, the old node is still busy closing file
descriptors. Sometimes it makes it as far as 12 before SIGKILL arrives.
The poor thing never has a chance to even open the crash dump file for
writing. And your operations people only see a weird restart without any
further clues.

I don't have a good solution right now, except "don't use -heart". And
it might be that one wants to separate the automatic restarting of a
crashed node from the automatic killing of an unresponsive node anyway.
Suggestions are welcome.

/Richard
_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions

Steve Vinoski

unread,

Aug 25, 2012, 3:48:42 PM8/25/12

to Richard Carlsson, erlang-questions, erlang-bugs

Hi Richard, I hit this problem a few years ago. Here's the thread
starting from where I posted a temporary solution:

http://erlang.org/pipermail/erlang-questions/2010-August/052970.html

Unfortunately no patches came out of that conversation, but Ulf had an
idea that might be worth exploring in a followup to the post linked
above.

--steve

Richard Carlsson

unread,

Aug 25, 2012, 4:03:02 PM8/25/12

to Steve Vinoski, erlang-questions, erlang-bugs

Yes, I had seen that. (It was pretty much the only thing that Google
came up with for this particular topic.) But the key point that was
missing from that discussion was that it ironically enough is the act of
preparing to write a crash dump that ends up killing the system before
it can write the crash dump.

It would be great if there was a way of simply figuring out the file
descriptor numbers used for EPMD and/or heart from the C code. Then it
would be easy to fix this. One possibility is to add a new BIF that
stores the current EPMD port in a C variable. Then the loop that closes
all ports could be replaced with a single close.

/Richard

Reply all

Reply to author

Forward