I have a distributed application that I run on a couple of nodes. I have had various problems where one node spontaneously decides another node is not available and starts up its own instance of the application, but this one is a first for me: One of my failover nodes exited after printing the following messages:
=ERROR REPORT==== 14-May-2012::19:43:24 ===
** Generic server dist_ac terminating
** Last message in was {internal_restart_appl,cron}
** When Server state == {state,
[{appl,cron,
{failover,cron_main@MWRD},
5000,
[cron_main@MWRD,
{cron_failover@MWRD,cron_failover@merced}],
[{cron_failover@MWRD,true}]}],
[],[],
[cron_failover@MWRD],
[cron],
[],[],[],[],[]}
** Reason for termination ==
** {{case_clause,
{'EXIT',
{timeout,
{gen_server,call,
[application_controller,which_applications]}}}},
[{dist_ac,restart_appl,2,[{file,"dist_ac.erl"},{line,952}]},
{dist_ac,handle_info,2,[{file,"dist_ac.erl"},{line,697}]},
{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,597}]},
{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}
=ERROR REPORT==== 14-May-2012::19:43:24 ===
server: clickon_backup_server
error: enoent
path: <<"\\\\ftp-corp2\\SFTP-MW\\70350\\Upload\\837">>
{error_logger,{{2012,5,14},{19,43,25}},std_info,[{application,kernel},{exited,shutdown},{type,permanent}]}
{"Kernel pid terminated",application_controller,"{application_terminated,kernel,shutdown}"}
Kernel pid terminated (application_controller) ({application_terminated,kernel,shutdown})
Abnormal termination
I am guessing this node (cron_failover@MWRD) somehow lost contact with the main node (cron_main@MWRD) on the same host. I am not sure, however, why this would cause the whole Erlang node to crash. How would I go about debugging this? (1) What circumstances caused this node to lose contact with the other node on the same host? (2) What can I do to gracefully handle this situation?
Here’s my thought process so far, which doesn’t really answer any of my questions:
1. The error message seems to point me to the case statement on line 952 of dist_ac.erl (restart_appl/2). This is a call to start_appl/3, which expects either {ok, _, _} or {error, _}, but not {'EXIT', …}, which is what it received.
2. Looking at start_appl/3, I doubt it is the keysearch which is throwing the EXIT, so I’m going to assume that it is the call to start_distributed/6.
3. I can continue down this rabbit hole, but I’m not sure how it will answer either of my questions.
Can someone who perhaps knows the workings of distributed applications better than I please give me a few pointers? Please advise. Thank-you.
David Mercer
As a follow-up question, since I had a problem again overnight where the failover took over for the main, even though the main was still running: Are Erlang distributed applications not intended to be run on multiple nodes on the same host?
Anyone have any success doing this in production? I can get it to work, it just doesn’t seem to work long-term.
I guess I don’t often see any posts on this list about the built-in distributed application functionality of Erlang/OTP. Does anyone actually use it, or am I behind the times and I should be using some sort of custom system developed by the RabbitMQ folks or something? Just wondering, because it makes a really good demo when I show people; it just doesn’t seem to be working for me long-term.
Cheers,
DBM