We have had a long-standing problems with not getting any Erlang crash dumps at all on our live servers. I finally figured out why it happens. I have already reported this to the OTP folks, but I thought I should send a summary to the mailing lists for documentation and to give people a heads-up.
The problem occurs when you start Erlang with the -heart flag (http://www.erlang.org/doc/man/heart.html). This spawns a small external C program connected through a port. From Erlang's point of view it's like any other port program. The heart program pings the Erlang side every now and then, and if it gets no reply within HEART_BEAT_TIMEOUT seconds, or if the connection to Erlang breaks, it assumes the Beam process has gone bad and kills it off with a SIGKILL, and then restarts Erlang using whatever HEART_COMMAND is set to. So far so good.
Normally, when Beam detects a critical situation (e.g., out of memory) and decides to shut down, it will create an erl_crash.dump file (or whatever ERL_CRASH_DUMP is set to). This information can greatly help figuring out what went wrong. But if the system that crashed was large, the crash dump file can take quite a long time to create. In order to make it possible to restart the node (reusing the node name) while the old defunct system is still writing the crash dump, Beam wants to drop its connection to the EPMD service before it starts writing the dump, making it look like the old node has disappeared.
The code that does this is the function prepare_crash_dump() in erts/emulator/sys/unix/sys.c. The problem from the perspective of the C code is that the connection to EPMD is on some unknown file descriptor (just like heart, this has been started as a port from Erlang code). The solution they chose, and which has been part of the OTP system for years, is to close _all_ file descriptors except 0-2. This certainly has the desired effect that EPMD releases the node name for reuse. But it also, when the loop gets to file descriptor 10 or thereabouts (probably depending on your system), has the effect of breaking the connection to the heart program.
In these multicore days, the effect is almost instantaneous. The heart program immediately wakes up due to the broken pipe and sends SIGKILL to Beam for good measure, to make sure it's really gone, and then it starts a new Erlang node. Meanwhile, the old node is still busy closing file descriptors. Sometimes it makes it as far as 12 before SIGKILL arrives. The poor thing never has a chance to even open the crash dump file for writing. And your operations people only see a weird restart without any further clues.
I don't have a good solution right now, except "don't use -heart". And it might be that one wants to separate the automatic restarting of a crashed node from the automatic killing of an unresponsive node anyway. Suggestions are welcome.
<carlsson.rich...@gmail.com> wrote:
> We have had a long-standing problems with not getting any Erlang crash dumps
> at all on our live servers. I finally figured out why it happens. I have
> already reported this to the OTP folks, but I thought I should send a
> summary to the mailing lists for documentation and to give people a
> heads-up.
> The problem occurs when you start Erlang with the -heart flag
> (http://www.erlang.org/doc/man/heart.html). This spawns a small external C
> program connected through a port. From Erlang's point of view it's like any
> other port program. The heart program pings the Erlang side every now and
> then, and if it gets no reply within HEART_BEAT_TIMEOUT seconds, or if the
> connection to Erlang breaks, it assumes the Beam process has gone bad and
> kills it off with a SIGKILL, and then restarts Erlang using whatever
> HEART_COMMAND is set to. So far so good.
> Normally, when Beam detects a critical situation (e.g., out of memory) and
> decides to shut down, it will create an erl_crash.dump file (or whatever
> ERL_CRASH_DUMP is set to). This information can greatly help figuring out
> what went wrong. But if the system that crashed was large, the crash dump
> file can take quite a long time to create. In order to make it possible to
> restart the node (reusing the node name) while the old defunct system is
> still writing the crash dump, Beam wants to drop its connection to the EPMD
> service before it starts writing the dump, making it look like the old node
> has disappeared.
> The code that does this is the function prepare_crash_dump() in
> erts/emulator/sys/unix/sys.c. The problem from the perspective of the C code
> is that the connection to EPMD is on some unknown file descriptor (just like
> heart, this has been started as a port from Erlang code). The solution they
> chose, and which has been part of the OTP system for years, is to close
> _all_ file descriptors except 0-2. This certainly has the desired effect
> that EPMD releases the node name for reuse. But it also, when the loop gets
> to file descriptor 10 or thereabouts (probably depending on your system),
> has the effect of breaking the connection to the heart program.
> In these multicore days, the effect is almost instantaneous. The heart
> program immediately wakes up due to the broken pipe and sends SIGKILL to
> Beam for good measure, to make sure it's really gone, and then it starts a
> new Erlang node. Meanwhile, the old node is still busy closing file
> descriptors. Sometimes it makes it as far as 12 before SIGKILL arrives. The
> poor thing never has a chance to even open the crash dump file for writing.
> And your operations people only see a weird restart without any further
> clues.
> I don't have a good solution right now, except "don't use -heart". And it
> might be that one wants to separate the automatic restarting of a crashed
> node from the automatic killing of an unresponsive node anyway. Suggestions
> are welcome.
Hi Richard, I hit this problem a few years ago. Here's the thread
starting from where I posted a temporary solution:
> On Sat, Aug 25, 2012 at 3:39 PM, Richard Carlsson
> <carlsson.rich...@gmail.com> wrote:
>> We have had a long-standing problems with not getting any Erlang crash dumps
>> at all on our live servers. I finally figured out why it happens. I have
>> already reported this to the OTP folks, but I thought I should send a
>> summary to the mailing lists for documentation and to give people a
>> heads-up.
>> The problem occurs when you start Erlang with the -heart flag
>> (http://www.erlang.org/doc/man/heart.html). This spawns a small external C
>> program connected through a port. From Erlang's point of view it's like any
>> other port program. The heart program pings the Erlang side every now and
>> then, and if it gets no reply within HEART_BEAT_TIMEOUT seconds, or if the
>> connection to Erlang breaks, it assumes the Beam process has gone bad and
>> kills it off with a SIGKILL, and then restarts Erlang using whatever
>> HEART_COMMAND is set to. So far so good.
>> Normally, when Beam detects a critical situation (e.g., out of memory) and
>> decides to shut down, it will create an erl_crash.dump file (or whatever
>> ERL_CRASH_DUMP is set to). This information can greatly help figuring out
>> what went wrong. But if the system that crashed was large, the crash dump
>> file can take quite a long time to create. In order to make it possible to
>> restart the node (reusing the node name) while the old defunct system is
>> still writing the crash dump, Beam wants to drop its connection to the EPMD
>> service before it starts writing the dump, making it look like the old node
>> has disappeared.
>> The code that does this is the function prepare_crash_dump() in
>> erts/emulator/sys/unix/sys.c. The problem from the perspective of the C code
>> is that the connection to EPMD is on some unknown file descriptor (just like
>> heart, this has been started as a port from Erlang code). The solution they
>> chose, and which has been part of the OTP system for years, is to close
>> _all_ file descriptors except 0-2. This certainly has the desired effect
>> that EPMD releases the node name for reuse. But it also, when the loop gets
>> to file descriptor 10 or thereabouts (probably depending on your system),
>> has the effect of breaking the connection to the heart program.
>> In these multicore days, the effect is almost instantaneous. The heart
>> program immediately wakes up due to the broken pipe and sends SIGKILL to
>> Beam for good measure, to make sure it's really gone, and then it starts a
>> new Erlang node. Meanwhile, the old node is still busy closing file
>> descriptors. Sometimes it makes it as far as 12 before SIGKILL arrives. The
>> poor thing never has a chance to even open the crash dump file for writing.
>> And your operations people only see a weird restart without any further
>> clues.
>> I don't have a good solution right now, except "don't use -heart". And it
>> might be that one wants to separate the automatic restarting of a crashed
>> node from the automatic killing of an unresponsive node anyway. Suggestions
>> are welcome.
> Hi Richard, I hit this problem a few years ago. Here's the thread
> starting from where I posted a temporary solution:
Yes, I had seen that. (It was pretty much the only thing that Google came up with for this particular topic.) But the key point that was missing from that discussion was that it ironically enough is the act of preparing to write a crash dump that ends up killing the system before it can write the crash dump.
It would be great if there was a way of simply figuring out the file descriptor numbers used for EPMD and/or heart from the C code. Then it would be easy to fix this. One possibility is to add a new BIF that stores the current EPMD port in a C variable. Then the loop that closes all ports could be replaced with a single close.