[erlang-questions] How to debug "Kernel pid terminated"

David Mercer

unread,

May 15, 2012, 4:48:24 PM5/15/12

to erlang-q...@erlang.org

I have a distributed application that I run on a couple of nodes. I have had various problems where one node spontaneously decides another node is not available and starts up its own instance of the application, but this one is a first for me: One of my failover nodes exited after printing the following messages:

=ERROR REPORT==== 14-May-2012::19:43:24 ===

** Generic server dist_ac terminating

** Last message in was {internal_restart_appl,cron}

** When Server state == {state,

[{appl,cron,

{failover,cron_main@MWRD},

5000,

[cron_main@MWRD,

{cron_failover@MWRD,cron_failover@merced}],

[{cron_failover@MWRD,true}]}],

[],[],

[cron_failover@MWRD],

[cron],

[],[],[],[],[]}

** Reason for termination ==

** {{case_clause,

{'EXIT',

{timeout,

{gen_server,call,

[application_controller,which_applications]}}}},

[{dist_ac,restart_appl,2,[{file,"dist_ac.erl"},{line,952}]},

{dist_ac,handle_info,2,[{file,"dist_ac.erl"},{line,697}]},

{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,597}]},

{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}

=ERROR REPORT==== 14-May-2012::19:43:24 ===

server: clickon_backup_server

error: enoent

path: <<"\\\\ftp-corp2\\SFTP-MW\\70350\\Upload\\837">>

{error_logger,{{2012,5,14},{19,43,25}},std_info,[{application,kernel},{exited,shutdown},{type,permanent}]}

{"Kernel pid terminated",application_controller,"{application_terminated,kernel,shutdown}"}

Kernel pid terminated (application_controller) ({application_terminated,kernel,shutdown})

Abnormal termination

I am guessing this node (cron_failover@MWRD) somehow lost contact with the main node (cron_main@MWRD) on the same host. I am not sure, however, why this would cause the whole Erlang node to crash. How would I go about debugging this? (1) What circumstances caused this node to lose contact with the other node on the same host? (2) What can I do to gracefully handle this situation?

Here’s my thought process so far, which doesn’t really answer any of my questions:

1. The error message seems to point me to the case statement on line 952 of dist_ac.erl (restart_appl/2). This is a call to start_appl/3, which expects either {ok, _, _} or {error, _}, but not {'EXIT', …}, which is what it received.

2. Looking at start_appl/3, I doubt it is the keysearch which is throwing the EXIT, so I’m going to assume that it is the call to start_distributed/6.

3. I can continue down this rabbit hole, but I’m not sure how it will answer either of my questions.

Can someone who perhaps knows the workings of distributed applications better than I please give me a few pointers? Please advise. Thank-you.

David Mercer

unread,

May 16, 2012, 1:00:46 PM5/16/12

to erlang-q...@erlang.org

As a follow-up question, since I had a problem again overnight where the failover took over for the main, even though the main was still running: Are Erlang distributed applications not intended to be run on multiple nodes on the same host?

Anyone have any success doing this in production? I can get it to work, it just doesn’t seem to work long-term.

I guess I don’t often see any posts on this list about the built-in distributed application functionality of Erlang/OTP. Does anyone actually use it, or am I behind the times and I should be using some sort of custom system developed by the RabbitMQ folks or something? Just wondering, because it makes a really good demo when I show people; it just doesn’t seem to be working for me long-term.

Cheers,

DBM

JD Bothma

unread,

May 16, 2012, 2:39:37 PM5/16/12

to David Mercer, erlang-q...@erlang.org

Hi

I don't really have professional experience with Erlang yet, but for a
university project we looked into many levels of redundancy and fault
tolerance with Erlang. I also felt that too little of the intentions
of fault tolerance tools were explicitly documented, but the feeling I
got was that there are many cases where one might want to run two
nodes on one machine, when that isn't really useful.

These are all my opinion and they're not battle-proven, but I've given
this a bit of thought on a big project :)

The idea I have is that, for redundancy, you can run two nodes on
different machines, and perhaps failover a distributed application
that way. Two nodes on the same machine only helps if the node
crashes. That can happen, but I think that's a more serious issue that
should be found and fixed theoretically rather than trying to do
failover. A second node on the machine might be affected anyway, since
the only VM crash I've seen has been when it ran out of memory - the
other nodes on the machine didn't have memory either. We ended up
having to do some explicit garbage collection, although the mistake
was using a process pool rather than spawning a process for each job.

Instead the processes in the application should be supervised
properly. Since your supervisors should only have one task -
supervision - it should be very easy to ensure they are "perfect". If
they crash, the supervisor above them can restart the subtree, etc.
The top supervisor should never need to crash since its job should be
the most simple - as long as the top supervisor is there, the
application stays up and functional units will be restarted if they
crash. As long as you don't crash the VM, the VM and the application
should stay up. If the application _does_ crash, you can set the
release to shutdown when a critical application goes down, and the
next layer can kick in: the hot standby VM on another machine, or a
restarted VM using heartbeat, or perhaps your hot standby on the same
machine.

If you have a hot standby machine+vm+distributed application, that can
then kick in. As to the nodes getting confused about whether the other
is online - I have no idea. I guess an intermittent connection is a
big problem for a distributed application, but I think we ended up not
using distributed applications - I have no experience of them.

Good luck!

JD

> _______________________________________________
> erlang-questions mailing list
> erlang-q...@erlang.org
> http://erlang.org/mailman/listinfo/erlang-questions
>
_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions

David Mercer

unread,

May 16, 2012, 3:43:06 PM5/16/12

to JD Bothma, erlang-q...@erlang.org

On May 16, JD Bothma wrote:

> Two nodes on the same machine only helps if the node
> crashes. That can happen, but I think that's a more serious issue that
> should be found and fixed theoretically rather than trying to do
> failover.

As we saw here, one node crashed, the other didn't. I am trying to figure
out how to diagnose this crash.

My use case for multiple nodes on the same host is to assist in upgrades.
Rather than planning an upgrade release, it's much easier just to crash the
node and restart it. However, when we crash it, we would like to have the
failover take over. In this case, it's OK for the failover to be on the
same host. We do also have a third failover on a different host in case the
main host goes down.

Over time, a few low-priority jobs have also been added to the failover's
role, so we can't just turn it off. We could migrate them over to the main
and then shut off the failover for good, but my question now is whether
anyone sees any reason why application distribution shouldn't work in this
case.

Thanks.

Cheers,

DBM

David Mercer

unread,

May 21, 2012, 10:57:11 AM5/21/12

to Rudolph van Graan, erlang-q...@erlang.org

On Monday, May 21, 2012, Rudolph van Graan wrote:

> I had a quick look at your crash below. If I have to guess, you are
calling
> an external system in a gen_server's init function and this function is
> causing a crash. It seems as though this init is part of a supervision
tree.
> My recommendation is that you move any code like this out of the inits
that
> can crash the whole tree. I would probably have implemented a different
> process that managed the call to the external application and make it
robust
> so that it can't crash when this condition occurs.

Thanks for the hint. I'm calling Richard Carlsson's file_monitor module in
init, so I'll delve into the source code there to see what it is doing.

Reply all

Reply to author

Forward