Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
How to debug "Kernel pid terminated"
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  5 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
David Mercer  
View profile  
 More options May 15 2012, 4:48 pm
From: "David Mercer" <dmer...@gmail.com>
Date: Tue, 15 May 2012 15:48:24 -0500
Local: Tues, May 15 2012 4:48 pm
Subject: [erlang-questions] How to debug "Kernel pid terminated"

I have a distributed application that I run on a couple of nodes.  I have
had various problems where one node spontaneously decides another node is
not available and starts up its own instance of the application, but this
one is a first for me: One of my failover nodes exited after printing the
following messages:

=ERROR REPORT==== 14-May-2012::19:43:24 ===

** Generic server dist_ac terminating

** Last message in was {internal_restart_appl,cron}

** When Server state == {state,

                            [{appl,cron,

                                 {failover,cron_main@MWRD},

                                 5000,

                                 [cron_main@MWRD,

{cron_failover@MWRD,cron_failover@merced}],

                                 [{cron_failover@MWRD,true}]}],

                            [],[],

                            [cron_failover@MWRD],

                            [cron],

                            [],[],[],[],[]}

** Reason for termination ==

** {{case_clause,

        {'EXIT',

            {timeout,

                {gen_server,call,

                    [application_controller,which_applications]}}}},

    [{dist_ac,restart_appl,2,[{file,"dist_ac.erl"},{line,952}]},

     {dist_ac,handle_info,2,[{file,"dist_ac.erl"},{line,697}]},

     {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,597}]},

     {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}

=ERROR REPORT==== 14-May-2012::19:43:24 ===

    server: clickon_backup_server

    error: enoent

    path: <<"\\\\ftp-corp2\\SFTP-MW\\70350\\Upload\\837">>

{error_logger,{{2012,5,14},{19,43,25}},std_info,[{application,kernel},{exit e
d,shutdown},{type,permanent}]}

{"Kernel pid
terminated",application_controller,"{application_terminated,kernel,shutdown }
"}

Kernel pid terminated (application_controller)
({application_terminated,kernel,shutdown})

Abnormal termination

I am guessing this node (cron_failover@MWRD) somehow lost contact with the
main node (cron_main@MWRD) on the same host.  I am not sure, however, why
this would cause the whole Erlang node to crash.  How would I go about
debugging this?  (1) What circumstances caused this node to lose contact
with the other node on the same host?  (2) What can I do to gracefully
handle this situation?

Here's my thought process so far, which doesn't really answer any of my
questions:

1.       The error message seems to point me to the case statement on line
952 of dist_ac.erl (restart_appl/2).  This is a call to start_appl/3, which
expects either {ok, _, _} or {error, _}, but not {'EXIT', .}, which is what
it received.

2.       Looking at start_appl/3, I doubt it is the keysearch which is
throwing the EXIT, so I'm going to assume that it is the call to
start_distributed/6.

3.       I can continue down this rabbit hole, but I'm not sure how it will
answer either of my questions.

Can someone who perhaps knows the workings of distributed applications
better than I please give me a few pointers?  Please advise.  Thank-you.

David Mercer

_______________________________________________
erlang-questions mailing list
erlang-questi...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
David Mercer  
View profile  
 More options May 16 2012, 1:00 pm
From: "David Mercer" <dmer...@gmail.com>
Date: Wed, 16 May 2012 12:00:46 -0500
Local: Wed, May 16 2012 1:00 pm
Subject: Re: [erlang-questions] How to debug "Kernel pid terminated"

As a follow-up question, since I had a problem again overnight where the
failover took over for the main, even though the main was still running: Are
Erlang distributed applications not intended to be run on multiple nodes on
the same host?

Anyone have any success doing this in production?  I can get it to work, it
just doesn't seem to work long-term.

I guess I don't often see any posts on this list about the built-in
distributed application functionality of Erlang/OTP.  Does anyone actually
use it, or am I behind the times and I should be using some sort of custom
system developed by the RabbitMQ folks or something?  Just wondering,
because it makes a really good demo when I show people; it just doesn't seem
to be working for me long-term.

Cheers,

DBM

From: David Mercer [mailto:dmer...@gmail.com]
Sent: Tuesday, May 15, 2012 3:48 PM
To: erlang-questi...@erlang.org
Subject: How to debug "Kernel pid terminated"

I have a distributed application that I run on a couple of nodes.  I have
had various problems where one node spontaneously decides another node is
not available and starts up its own instance of the application, but this
one is a first for me: One of my failover nodes exited after printing the
following messages:

=ERROR REPORT==== 14-May-2012::19:43:24 ===

** Generic server dist_ac terminating

** Last message in was {internal_restart_appl,cron}

** When Server state == {state,

                            [{appl,cron,

                                 {failover,cron_main@MWRD},

                                 5000,

                                 [cron_main@MWRD,

{cron_failover@MWRD,cron_failover@merced}],

                                 [{cron_failover@MWRD,true}]}],

                            [],[],

                            [cron_failover@MWRD],

                            [cron],

                            [],[],[],[],[]}

** Reason for termination ==

** {{case_clause,

        {'EXIT',

            {timeout,

                {gen_server,call,

                    [application_controller,which_applications]}}}},

    [{dist_ac,restart_appl,2,[{file,"dist_ac.erl"},{line,952}]},

     {dist_ac,handle_info,2,[{file,"dist_ac.erl"},{line,697}]},

     {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,597}]},

     {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}

=ERROR REPORT==== 14-May-2012::19:43:24 ===

    server: clickon_backup_server

    error: enoent

    path: <<"\\\\ftp-corp2\\SFTP-MW\\70350\\Upload\\837
<file:///\\\ftp-corp2\SFTP-MW\70350\Upload\837> ">>

{error_logger,{{2012,5,14},{19,43,25}},std_info,[{application,kernel},{exit e
d,shutdown},{type,permanent}]}

{"Kernel pid
terminated",application_controller,"{application_terminated,kernel,shutdown }
"}

Kernel pid terminated (application_controller)
({application_terminated,kernel,shutdown})

Abnormal termination

I am guessing this node (cron_failover@MWRD) somehow lost contact with the
main node (cron_main@MWRD) on the same host.  I am not sure, however, why
this would cause the whole Erlang node to crash.  How would I go about
debugging this?  (1) What circumstances caused this node to lose contact
with the other node on the same host?  (2) What can I do to gracefully
handle this situation?

Here's my thought process so far, which doesn't really answer any of my
questions:

1.       The error message seems to point me to the case statement on line
952 of dist_ac.erl (restart_appl/2).  This is a call to start_appl/3, which
expects either {ok, _, _} or {error, _}, but not {'EXIT', .}, which is what
it received.

2.       Looking at start_appl/3, I doubt it is the keysearch which is
throwing the EXIT, so I'm going to assume that it is the call to
start_distributed/6.

3.       I can continue down this rabbit hole, but I'm not sure how it will
answer either of my questions.

Can someone who perhaps knows the workings of distributed applications
better than I please give me a few pointers?  Please advise.  Thank-you.

David Mercer

_______________________________________________
erlang-questions mailing list
erlang-questi...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
JD Bothma  
View profile  
 More options May 16 2012, 2:39 pm
From: JD Bothma <jbot...@gmail.com>
Date: Wed, 16 May 2012 20:39:37 +0200
Local: Wed, May 16 2012 2:39 pm
Subject: Re: [erlang-questions] How to debug "Kernel pid terminated"
Hi

I don't really have professional experience with Erlang yet, but for a
university project we looked into many levels of redundancy and fault
tolerance with Erlang. I also felt that too little of the intentions
of fault tolerance tools were explicitly documented, but the feeling I
got was that there are many cases where one might want to run two
nodes on one machine, when that isn't really useful.

These are all my opinion and they're not battle-proven, but I've given
this a bit of thought on a big project :)

The idea I have is that, for redundancy, you can run two nodes on
different machines, and perhaps failover a distributed application
that way. Two nodes on the same machine only helps if the node
crashes. That can happen, but I think that's a more serious issue that
should be found and fixed theoretically rather than trying to do
failover. A second node on the machine might be affected anyway, since
the only VM crash I've seen has been when it ran out of memory - the
other nodes on the machine didn't have memory either. We ended up
having to do some explicit garbage collection, although the mistake
was using a process pool rather than spawning a process for each job.

Instead the processes in the application should be supervised
properly. Since your supervisors should only have one task -
supervision - it should be very easy to ensure they are "perfect". If
they crash, the supervisor above them can restart the subtree, etc.
The top supervisor should never need to crash since its job should be
the most simple - as long as the top supervisor is there, the
application stays up and functional units will be restarted if they
crash. As long as you don't crash the VM, the VM and the application
should stay up. If the application _does_ crash, you can set the
release to shutdown when a critical application goes down, and the
next layer can kick in: the hot standby VM on another machine, or a
restarted VM using heartbeat, or perhaps your hot standby on the same
machine.

If you have a hot standby machine+vm+distributed application, that can
then kick in. As to the nodes getting confused about whether the other
is online - I have no idea. I guess an intermittent connection is a
big problem for a distributed application, but I think we ended up not
using distributed applications - I have no experience of them.

Good luck!

JD

On 16 May 2012 19:00, David Mercer <dmer...@gmail.com> wrote:

_______________________________________________
erlang-questions mailing list
erlang-questi...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions

 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
David Mercer  
View profile  
 More options May 16 2012, 3:43 pm
From: "David Mercer" <dmer...@gmail.com>
Date: Wed, 16 May 2012 14:43:06 -0500
Local: Wed, May 16 2012 3:43 pm
Subject: Re: [erlang-questions] How to debug "Kernel pid terminated"
On May 16, JD Bothma wrote:

> Two nodes on the same machine only helps if the node
> crashes. That can happen, but I think that's a more serious issue that
> should be found and fixed theoretically rather than trying to do
> failover.

As we saw here, one node crashed, the other didn't.  I am trying to figure
out how to diagnose this crash.

My use case for multiple nodes on the same host is to assist in upgrades.
Rather than planning an upgrade release, it's much easier just to crash the
node and restart it.  However, when we crash it, we would like to have the
failover take over.  In this case, it's OK for the failover to be on the
same host.  We do also have a third failover on a different host in case the
main host goes down.

Over time, a few low-priority jobs have also been added to the failover's
role, so we can't just turn it off.  We could migrate them over to the main
and then shut off the failover for good, but my question now is whether
anyone sees any reason why application distribution shouldn't work in this
case.

Thanks.

Cheers,

DBM

_______________________________________________
erlang-questions mailing list
erlang-questi...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
David Mercer  
View profile  
 More options May 21 2012, 10:57 am
From: "David Mercer" <dmer...@gmail.com>
Date: Mon, 21 May 2012 09:57:11 -0500
Local: Mon, May 21 2012 10:57 am
Subject: Re: [erlang-questions] How to debug "Kernel pid terminated"

On Monday, May 21, 2012, Rudolph van Graan wrote:
> I had a quick look at your crash below. If I have to guess, you are
calling
> an external system in a gen_server's init function and this function is
> causing a crash. It seems as though this init is part of a supervision
tree.
> My recommendation is that you move any code like this out of the inits
that
> can crash the whole tree. I would probably have implemented a different
> process that managed the call to the external application and make it
robust
> so that it can't crash when this condition occurs.

Thanks for the hint.  I'm calling Richard Carlsson's file_monitor module in
init, so I'll delve into the source code there to see what it is doing.

Cheers,

DBM

_______________________________________________
erlang-questions mailing list
erlang-questi...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »