[erlang-questions] How to diagnose stuck Erlang node

362 views
Skip to first unread message

Kirill Zaborsky

unread,
Oct 26, 2011, 3:42:22 AM10/26/11
to Erlang-Questions (E-mail)
Recently we have found some problems with our Erlang application:
For some time system works ok (e.g. before today it run with no problems for at least 17 days). Then something happens and it "stucks". It does not repond to pings, http interface (mochiweb) gives no replies. The only thing that can be observed is standard "ALIVE" message sent to stdout every 15 minutes when there is no output to stdout. Messages from logs show nothing special before logging stops.
The only thing I could do is just kill the emulator. That gives me opportunity to restart the system but gives no additional information about the roots of the problem.
On JVM it's possible to get program thread dump (using QUIT signal) is there some ways to "manually" force Erlang emulator to produce crash dump without using erlang:halt/1?
Are there some other ways to diagnose this problem which I should take a look at?

Kind regars,
Kirill Zaborksy

Ahmed Omar

unread,
Oct 26, 2011, 4:19:44 AM10/26/11
to Kirill Zaborsky, Erlang-Questions (E-mail)
you can force it by: 
kill -USR1 $Pid

is your node running in embedded node? can you connect to the shell? are you able to send rpc calls from other nodes?

_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions




--
Best Regards,
- Ahmed Omar
Follow me on twitter

Kirill Zaborsky

unread,
Oct 26, 2011, 4:51:45 AM10/26/11
to Ahmed Omar, Erlang-Questions (E-mail)
System uses rebar so it's run in embedded mode (runner was constructed rebar's nodetool).
Thanks for pointing to USR1 (didn't notice it in the docs).
With "does not respond to pings" I meant Erlang's net_adm:ping/1 so it looks like Erlang distribution mechanisms stopped working.
Sadly I haven't check epmd status before restarting the system. Looks like turning on epmd loggin my bring some details
Thanks.

Kind regars,
Kirill Zaborsky

2011/10/26 Ahmed Omar <spawn...@gmail.com>

Kirill Zaborsky

unread,
Oct 28, 2011, 4:10:05 AM10/28/11
to Erlang-Questions (E-mail)
Just 2 days passed and Erlang node got stuck once again.
This time I killed it with SIGUSR1 and received a crash dump.
Checking all the logs on host didn't bring any hints where the problem may be.
And in crash dump the only suspicious thing is that user_drv has message queue length equal to 7550. The program counter points to user_drv:server_loop/5 + 48 - is there any way to get info what instruction in the source code it corresponds to?
BTW crash dump viewer says that crash dump was truncated is there any way to get full crash dump?
The system is running R14B03 if it matters.
Any advices are welcomed.

Kind regards,
Kirill Zaborsky 

2011/10/26 Kirill Zaborsky <qri...@gmail.com>

Ahmed Omar

unread,
Oct 28, 2011, 4:30:17 AM10/28/11
to Kirill Zaborsky, Erlang-Questions (E-mail)
Are you able to expand message queue of user_drv process? That might give some information.
Did you check epmd status before dumping?

_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions

Kirill Zaborsky

unread,
Oct 28, 2011, 4:47:30 AM10/28/11
to Ahmed Omar, Erlang-Questions (E-mail)
About message queue crash dump viewer says "The dump is truncated, no data available", so I've got no more infromation :-\
epmd -names showed the node running but I could not contact it.

Kind regards,
Kirill Zaborsky

2011/10/28 Ahmed Omar <spawn...@gmail.com>

Ahmed Omar

unread,
Oct 28, 2011, 7:23:12 AM10/28/11
to Kirill Zaborsky, Erlang-Questions (E-mail)
Maybe providing some information about what your application is doing might help?

Kirill Zaborsky

unread,
Oct 28, 2011, 8:30:55 AM10/28/11
to Ahmed Omar, Erlang-Questions (E-mail)
Nothing worth to mention: net_kernel just in waiting state. In dump there are only 3 processes not in waiting state: user_drv runng, and 2 processes in scheduled state in erlang:apply (1 in unicode:ml_map/3 and other in cl_consumer:consumer_wait/0 called from OSERL)

Kind regards,
Kirill Zaborsky

2011/10/28 Ahmed Omar <spawn...@gmail.com>
I saw a similar behavior but with rex process (rpc server) having a very long queue. For your case i wonder what's loading the user_drv process, but as you mentioned the crash dump was truncated.
Do you see in the crash dump anything about net_kernel process? 


On Fri, Oct 28, 2011 at 1:41 PM, Kirill Zaborsky <qri...@gmail.com> wrote:
It's a server to collect information from GPS trackers (sent by SMS or HTTP/GPRS), uses Yaws to get information by HTTP. Processed information goes to PostgreSQL (epgsql is used). Then this info could be shown with qooxdoo/mapserver, webinterface backend is done with mochiweb.
There is no custom NIFs of  C nodes so I'm not sure how could this application hang Erlang emulator.
On other Erlang system done by our company we have experienced similar behaviour when user was connected to the server shell (using ndetool's attach command i.e. to_erl program) and then ssh session was broken, Long message queue for user_drv may be connected to something like that. The problem is that for the system I'm trying to get a diagnose there was no any shell/console connected. So It's unclear what could trigger such problem.

Kirill Zaborsky

unread,
Nov 28, 2011, 1:44:42 AM11/28/11
to erlang-pr...@googlegroups.com, Erlang-Questions (E-mail)
Trying to fins any workaround to this "stuck node" scenario I've upgraded to R14B04 and turned on "heart".
But recently  the node once again stopped responding. And heart did not assume it to be stuck although I could not contact it.
I've tried to to get a crashdump with 'kill -USR1' but it appeared that once again crash dump was truncated. Does heart kills "dead" erlang node?
And the only thing that could be seen from the crash dump that the only running process was user_drv (just like in previous times) with program counter equal to "user_drv:server_loop/5 + 48". Is it possible to find out what exactly does it stands for?

P.S. It's quite strange to have a not much complex Erlang application with no way to find out what goes wrong or maybe I just don't see such ways?

Kind regards,
Kirill Zaborsky

Dennis Novikov

unread,
Nov 28, 2011, 4:47:36 AM11/28/11
to Kirill Zaborsky, Erlang-Questions (E-mail)
On Mon, 28 Nov 2011 08:44:42 +0200, Kirill Zaborsky <qri...@gmail.com>
wrote:

> Trying to fins any workaround to this "stuck node" scenario I've upgraded
> to R14B04 and turned on "heart".
> But recently the node once again stopped responding. And heart did not
> assume it to be stuck although I could not contact it.
> I've tried to to get a crashdump with 'kill -USR1' but it appeared that
> once again crash dump was truncated. Does heart kills "dead" erlang node?
> And the only thing that could be seen from the crash dump that the only
> running process was user_drv (just like in previous times) with program
> counter equal to "user_drv:server_loop/5 + 48". Is it possible to find
> out
> what exactly does it stands for?

Waiting on receive in that function. And you are observing this on a
32-bit VM.

--
WBR,
DN

Kirill Zaborsky

unread,
Nov 28, 2011, 5:01:17 AM11/28/11
to Dennis Novikov, Erlang-Questions (E-mail)
I'm using halfword emulator on 64bit Ubuntu Server
And the process state is not "waiting" but "running". Previous crash dumps show the same program counter value (and user_drv in running state)

Kind regards,
Kirill Zaborsky


2011/11/28 Dennis Novikov <dennis....@gmail.com>

dennis....@gmail.com

unread,
Nov 28, 2011, 6:51:36 AM11/28/11
to Kirill Zaborsky, Erlang-Questions (E-mail)
+48 does not point to an instruction start on a couple of 32-bit systems I
have access to, so I can not assist you further.

To get instructions dump named "user_drv.dis" in the beam process working
directory you can do

erts_debug:df(user_drv).

Happy bug-hunting.


On Mon, 28 Nov 2011 12:01:17 +0200, Kirill Zaborsky <qri...@gmail.com>
wrote:

dennis....@gmail.com

unread,
Nov 28, 2011, 6:54:27 AM11/28/11
to Kirill Zaborsky, Erlang-Questions (E-mail)
On Mon, 28 Nov 2011 13:51:36 +0200, <dennis....@gmail.com> wrote:

> +48 does not point to an instruction start on a couple of 32-bit systems

Sorry, typo: on 64-bit VMs.

On 32-bit it does point to an instruction start.

Kirill Zaborsky

unread,
Nov 28, 2011, 7:37:49 AM11/28/11
to dennis....@gmail.com, Erlang-Questions (E-mail)
Thanks, dennis,
I have created a crash dump on a test machine (using halfword emulator) and received user_drv in waiting state with
Program counter - 0x0000000002845f00 (user_drv:server_loop/5 + 48)
So it's on the same instruction (but not running)

Disassembly shows:
-----------------
0000000002845E80: i_func_info_IaaI 0 user_drv server_loop 5 
0000000002845EA8: allocate_init_tIy 7 5 y(0) 
0000000002845EC0: init_y y(1) 
0000000002845ED0: move2_xyxy x(4) y(2) x(3) y(3) 
0000000002845EE0: move2_xyxy x(2) y(4) x(1) y(5) 
0000000002845EF0: move_ry x(0) y(6) 
0000000002845F00: i_loop_rec_fr f(0000000002846C80) x(0) 
0000000002845F10: i_select_tuple_arity2_rfAfAf x(0) f(0000000002846C40) 2 f(0000000002845F40) 3 f(0000000002846418) 
0000000002845F40: i_get_tuple_element_rPx x(0) 0 x(1) 
.....
0000000002846C80: wait_f f(0000000002845F00) 
0000000002846C90: badmatch_r x(0) 
-----------------
So it's just a waiting loop. I don'see how the process could be running when the only ouput for some time was "ALIVE" messages every 15 minutes from run_erl.
Loooks like the only way to see what was going on is to get complete crash dump, but it was truncated by heart :-\

P.S. It's quite strange that crash dump shows +48

2011/11/28 <dennis....@gmail.com>

Ahmed Omar

unread,
Nov 28, 2011, 9:54:21 AM11/28/11
to Kirill Zaborsky, Erlang-Questions (E-mail)
Krill, 
How about using some dbg? 
Note : you need to be careful not to generate huge amount of debugging on your node that could actually kill your node. I don't know much about your system and how heavy the load on nodes is.

You can do something like : 
dbg:tracer(port, dbg:trace_port(file, "/tmp/usr_drv.dbg")).
dbg:p(user_drv, [m, c]).
dbg:tpl(user_drv, x).

This way you will have a file that contains tracing of all messages and local function calls done in process user_drv
When the node stuck we can see more details about what happened (note : watch for the file size too)

_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions

Kirill Zaborsky

unread,
Nov 28, 2011, 10:06:43 AM11/28/11
to Ahmed Omar, Erlang-Questions (E-mail)
Thanks Ahmed for you advices but the problem is that I do no know what triggers the problem and when it may happen.
Putting all user_drv messages into a file may result it very huge file (at the moment I'm thinking about decreasing amount of info going to stdout). But maybe it will bring some more details, thanks once again.

Kind regards,
Kirill Zaborsky

2011/11/28 Ahmed Omar <spawn...@gmail.com>

Chandru

unread,
Nov 28, 2011, 10:22:17 AM11/28/11
to Kirill Zaborsky, Erlang-Questions (E-mail)
One option is to receive the trace messages into your own process running on another node perhaps and probably only keeping the last few (the definition of few being the size of hardware available to you).

dbg:tracer(port, dbg:trace_port(ip, PortNumber).

On the receiving side:
dbg:trace_client(ip, {HostName, PortNumber}, HandlerSpec).

cheers
Chandru

Kirill Zaborsky

unread,
Nov 28, 2011, 10:38:54 AM11/28/11
to Ahmed Omar, Erlang-Questions (E-mail)
Ahmed, minor question: what does 'x' mean in dbg:tpl(user_drv, x) ? From the documentation I do not see what this match specification mya mean.

Kind regards,
Kirill Zaborsky

2011/11/28 Ahmed Omar <spawn...@gmail.com>

Ahmed Omar

unread,
Nov 28, 2011, 11:13:17 AM11/28/11
to Kirill Zaborsky, Erlang-Questions (E-mail)
It's kind of cheating, as it's not documented.(AFAIK)
x is euqal to 
calling dbg:fun2ms(fun(_)-> exception_trace() end).  
which generates MatchSpec :  
[{'_',[],[{exception_trace}]}]

from documentation of match specs
exception_trace: Same as return_trace, plus; if the traced function exits due to an exception, anexception_from trace message is generated, whether the exception is caught or not.

example: 
37> dbg:tracer().                              
{ok,<0.85.0>}
38> dbg:tpl(calendar, x).                      
{ok,[{matched,nonode@nohost,39},{saved,x}]}
39> dbg:p(all,c).                              
{ok,[{matched,nonode@nohost,26}]}
40> calendar:local_time_to_universal_time(33).
(<0.83.0>) call calendar:local_time_to_universal_time(33)
(<0.83.0>) exception_from {calendar,local_time_to_universal_time,1} {error,badarg}
** exception error: bad argument
     in function  erlang:localtime_to_universaltime/2
        called as erlang:localtime_to_universaltime(33,undefined)
     in call from erlang:localtime_to_universaltime/1


I think better use the documented form :)
dbg:tpl(usr_drv, dbg:fun2ms(fun(_)-> exception_trace() end)).

Raimo Niskanen

unread,
Nov 28, 2011, 11:33:54 AM11/28/11
to erlang-q...@erlang.org
On Mon, Nov 28, 2011 at 05:13:17PM +0100, Ahmed Omar wrote:
> It's kind of cheating, as it's not documented.(AFAIK)
> x is euqal to
> calling dbg:fun2ms(fun(_)-> exception_trace() end).
> which generates MatchSpec :
> [{'_',[],[{exception_trace}]}]
>
> from documentation of match specs
> http://www.erlang.org/doc/apps/erts/match_spec.html
> *exception_trace:* Same as *return_trace*, plus; if the traced function

> exits due to an exception, anexception_from trace message is generated,
> whether the exception is caught or not.

Well, if you try dbg:ltp() you will see all your saved match specs
along with the "predefined" match specs, and this is actually
documented in dbg.

And, the predefined match specs will most probably not be removed
nor changed. Not without very good reason.

So it is not cheating.

In a recent system you will also find predefined match specs
with the short abbreviations 'c' and 'cx'. Have a look.

> >> @spawn_think <http://twitter.com/#!/spawn_think>


> >>
> >>
> >
>
>
> --
> Best Regards,
> - Ahmed Omar
> http://nl.linkedin.com/in/adiaa
> Follow me on twitter

> @spawn_think <http://twitter.com/#!/spawn_think>

> _______________________________________________
> erlang-questions mailing list
> erlang-q...@erlang.org
> http://erlang.org/mailman/listinfo/erlang-questions


--

/ Raimo Niskanen, Erlang/OTP, Ericsson AB

Ahmed Omar

unread,
Nov 28, 2011, 1:54:14 PM11/28/11
to erlang-q...@erlang.org
That's nice to know, thanks Raimo :)
Reply all
Reply to author
Forward
0 new messages