Hello,
If you ever suspect that the Erlang VM is blocking for some reason,
first make sure that it is not something in Erlang space which is
wrong. i.e. a process waiting for a message which never arrives or
something like that.
When you are sure it is the VM that is the problem the most
informative thing to do (IMO) is to either attach with gdb to that
process or dump a core using kill -ABRT.
Once you have gdb attached do info threads or a core, do:
(gdb) info threads
and then for each thread do:
(gdb) thread ${ThreadId}
(gdb) bt
This will give you a bunch of information about what the emulator is doing.
There are also a couple of tools which can help you debug specific
things within the emulator. For instance if you do
(gdb) source $ERL_TOP/erts/etc/unix/etp-commands
(gdb) etp-help
you get a list of helpfull command which can print all sorts of
interesting data. One example is etp-stacktrace, which given a Process
* will print the stacktrace of that process. eg:
(gdb) bt
#0 0x00007ffff6aa19a8 in __GI___poll (fds=0x7ffff67bac08, nfds=2,
timeout=<optimised out>) at ../sysdeps/unix/sysv/linux/poll.c:83
#1 0x000000000062c9b0 in check_fd_events (tv=0x7fffffffbf00,
ps=0x7ffff67ba100, max_res=<optimised out>) at
sys/common/erl_poll.c:1974
#2 erts_poll_wait_nkp (ps=0x7ffff67ba100, pr=0x7fffffffb700,
len=0x7fffffffbf10, utvp=<optimised out>) at
sys/common/erl_poll.c:2087
#3 0x000000000062f528 in erts_check_io_nkp (do_wait=<optimised out>)
at sys/common/erl_check_io.c:1173
#4 0x00000000006259de in erl_sys_schedule (runnable=<optimised out>)
at sys/unix/sys.c:2734
#5 0x0000000000551de5 in scheduler_wait (rq=0x7ffff687b080,
esdp=0x7ffff687b2c0, fcalls=<synthetic pointer>) at
beam/erl_process.c:2195
#6 schedule (p=<optimised out>, calls=<optimised out>) at
beam/erl_process.c:6377
#7 0x00000000005c8867 in process_main () at beam/beam_emu.c:1229
#8 0x000000000051334c in erl_start (argc=10, argv=<optimised out>) at
beam/erl_init.c:1493
#9 0x00000000004f55f9 in main (argc=<optimised out>, argv=<optimised
out>) at sys/unix/erl_main.c:29
(gdb) f 7
#7 0x00000000005c8867 in process_main () at beam/beam_emu.c:1229
(gdb) etp-stacktrace c_p
% Stacktrace (22): <the non-value>.
#Cp<user:do_io_request/5+0x78>.
#Cp<user:server_loop/2+0x5a0>.
#Cp<user:catch_loop/3+0x90>.
#Cp<terminate process normally>.
(gdb) p c_p
$1 = (Process *) 0x7ffff7e87348
(gdb)
If the VM seems to block for a while and then continues to run it
could be because all schedulers are hitting the same lock at the same
time. Use an emulator with --enable-lock-counter[1] to figure out
which lock it is that is causing the issue. Also gprof and oprofile
can be very useful when used correctly, though their output is at
times quite hard to interpret.
If when investigating you find something that seems fishy, try to
limit the scope of the potential bug as much as you can. The more
specific you are in the description of the (miss)behaviour you are
experiencing, the more likely it will be that we can help you.
Lukas
[1]:
http://www.erlang.org/doc/apps/tools/lcnt_chapter.html