Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
heart prevents beam from creating crash dumps
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  3 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Richard Carlsson  
View profile  
 More options Aug 25 2012, 3:39 pm
From: Richard Carlsson <carlsson.rich...@gmail.com>
Date: Sat, 25 Aug 2012 21:39:05 +0200
Local: Sat, Aug 25 2012 3:39 pm
Subject: [erlang-questions] heart prevents beam from creating crash dumps
We have had a long-standing problems with not getting any Erlang crash
dumps at all on our live servers. I finally figured out why it happens.
I have already reported this to the OTP folks, but I thought I should
send a summary to the mailing lists for documentation and to give people
a heads-up.

The problem occurs when you start Erlang with the -heart flag
(http://www.erlang.org/doc/man/heart.html). This spawns a small external
C program connected through a port. From Erlang's point of view it's
like any other port program. The heart program pings the Erlang side
every now and then, and if it gets no reply within HEART_BEAT_TIMEOUT
seconds, or if the connection to Erlang breaks, it assumes the Beam
process has gone bad and kills it off with a SIGKILL, and then restarts
Erlang using whatever HEART_COMMAND is set to. So far so good.

Normally, when Beam detects a critical situation (e.g., out of memory)
and decides to shut down, it will create an erl_crash.dump file (or
whatever ERL_CRASH_DUMP is set to). This information can greatly help
figuring out what went wrong. But if the system that crashed was large,
the crash dump file can take quite a long time to create. In order to
make it possible to restart the node (reusing the node name) while the
old defunct system is still writing the crash dump, Beam wants to drop
its connection to the EPMD service before it starts writing the dump,
making it look like the old node has disappeared.

The code that does this is the function prepare_crash_dump() in
erts/emulator/sys/unix/sys.c. The problem from the perspective of the C
code is that the connection to EPMD is on some unknown file descriptor
(just like heart, this has been started as a port from Erlang code). The
solution they chose, and which has been part of the OTP system for
years, is to close _all_ file descriptors except 0-2. This certainly has
the desired effect that EPMD releases the node name for reuse. But it
also, when the loop gets to file descriptor 10 or thereabouts (probably
depending on your system), has the effect of breaking the connection to
the heart program.

In these multicore days, the effect is almost instantaneous. The heart
program immediately wakes up due to the broken pipe and sends SIGKILL to
Beam for good measure, to make sure it's really gone, and then it starts
a new Erlang node. Meanwhile, the old node is still busy closing file
descriptors. Sometimes it makes it as far as 12 before SIGKILL arrives.
The poor thing never has a chance to even open the crash dump file for
writing. And your operations people only see a weird restart without any
further clues.

I don't have a good solution right now, except "don't use -heart". And
it might be that one wants to separate the automatic restarting of a
crashed node from the automatic killing of an unresponsive node anyway.
Suggestions are welcome.

     /Richard
_______________________________________________
erlang-questions mailing list
erlang-questi...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "[erlang-bugs] heart prevents beam from creating crash dumps" by Steve Vinoski
Steve Vinoski  
View profile  
 More options Aug 25 2012, 3:48 pm
From: Steve Vinoski <vino...@ieee.org>
Date: Sat, 25 Aug 2012 15:48:42 -0400
Local: Sat, Aug 25 2012 3:48 pm
Subject: Re: [erlang-questions] [erlang-bugs] heart prevents beam from creating crash dumps
On Sat, Aug 25, 2012 at 3:39 PM, Richard Carlsson

Hi Richard, I hit this problem a few years ago. Here's the thread
starting from where I posted a temporary solution:

http://erlang.org/pipermail/erlang-questions/2010-August/052970.html

Unfortunately no patches came out of that conversation, but Ulf had an
idea that might be worth exploring in a followup to the post linked
above.

--steve
_______________________________________________
erlang-questions mailing list
erlang-questi...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Richard Carlsson  
View profile  
 More options Aug 25 2012, 4:03 pm
From: Richard Carlsson <carlsson.rich...@gmail.com>
Date: Sat, 25 Aug 2012 22:03:02 +0200
Local: Sat, Aug 25 2012 4:03 pm
Subject: Re: [erlang-questions] [erlang-bugs] heart prevents beam from creating crash dumps
On 08/25/2012 09:48 PM, Steve Vinoski wrote:

Yes, I had seen that. (It was pretty much the only thing that Google
came up with for this particular topic.) But the key point that was
missing from that discussion was that it ironically enough is the act of
preparing to write a crash dump that ends up killing the system before
it can write the crash dump.

It would be great if there was a way of simply figuring out the file
descriptor numbers used for EPMD and/or heart from the C code. Then it
would be easy to fix this. One possibility is to add a new BIF that
stores the current EPMD port in a C variable. Then the loop that closes
all ports could be replaced with a single close.

     /Richard

_______________________________________________
erlang-questions mailing list
erlang-questi...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »