[erlang-questions] Investigate an infinite loop on production servers

Morgan Segalis

unread,

May 22, 2013, 10:00:07 PM5/22/13

to erlang-q...@erlang.org

Hello everyone,

I'm having a bit of an issue with my production servers.

At some point, it seems to enter into an infinite loop that I can't find, or reproduce by myself on the tests servers.

The bug appear completely random, 1 hour, or 10 hour after restarting the Erlang node.

The loop will eat up all my server's memory in no time, and freeze completely the Erlang node without crashing it. (most of the time)

One time I got an crash dump, and tried to investigate it with cdv, but I didn't get much informations about which process or module was eating up all the memory.

I just know that it crashed because of the crash message : "eheap_alloc: Cannot allocate 6801972448 bytes of memory (of type "heap")."

I'm surely too new to Erlang to investigate something like this with cdv, I really would like some pointers on how I can understand this problem and fix it asap.

If you need any informations about the crash dump, let me know what you need, I'll copy/paste…

I'm using Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:8:8] [async-threads:10] [kernel-poll:true]

Thank you all for your help !

Bob Ippolito

unread,

May 22, 2013, 10:21:44 PM5/22/13

to Morgan Segalis, erlang-questions

This kind of thing tends to happen when you continuously send messages to a process faster than it can handle them. The most common case that I've seen this is where you have a lot of processes communicating with a single gen_server process. If your server has swap enabled, this may appear to make the node "freeze completely but not crash".

In the past I've diagnosed this by monitoring the message_queue_len of registered processes, but I'm sure there are tools that can help do this for you.

_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions

Morgan Segalis

unread,

May 22, 2013, 10:41:50 PM5/22/13

to Bob Ippolito, erlang-questions

Hi,

Generally, when a module is critical, and a lot solicited, I create a pg2 "pool" of supervised gen_server that will join the group, and get the pid using get_closest_pid in order to have multiple process.

Further more, The server has 16GB of RAM, and when the server is starting to get crazy, it only has 1,5 GB tops of RAM used. it really needs to go crazy a long time before using swap I guess, but I don't see it until another node cluster is telling me that the freeze node is timeout.

However, since we are at it… I may have find something really weird looking at my crash dump.

I'm using the emysql application.

My initialization of the emysql application is pretty basic :

application:start(emysql),

emysql:add_pool(my_db,

30,

"login",

"password",

"my.db-host.com",

3306,

"table",

latin1)

Has you can see, I only have 30 connections asked in the pool. However in the crash dump here's what I have found in the fun table :

Module Uniq Index Address Native_address Refc

emysql_util 8432855 1 0x00007f1d4f9f6f00 3476

emysql_util 8432855 0 0x00007f1d4f9f7218 3476

emysql_util 8432855 3 0x00007f1d4f9f6e48 2

emysql_util 8432855 2 0x00007f1d4f9f6ea8 1

emysql 79898780 0 0x00007f1d4f9b56f8 841

Is that something normal to have with only 30 connections in one pool ?

Thank you all.

Morgan Segalis

unread,

May 22, 2013, 11:26:03 PM5/22/13

to Yogish Baliga, erlang-q...@erlang.org

Hi,

Unfortunately, I'm not using RabbitMQ.

as for the garbage collection, Since the server is having 16GB of RAM, and the erlang node is using 1,5 GB when everything works fine, garbage collection should not be here an issue…

Le 23 mai 2013 à 05:14, Yogish Baliga <bal...@gmail.com> a écrit :

I saw this message in our RabbitMQ server. Investigation lead me to the garbage collection. It happened only once. After restart everything seems to be fine.

Dmitry Kolesnikov

unread,

May 23, 2013, 1:13:44 AM5/23/13

to Morgan Segalis, erlang-questions

Hello,

I would aree with Bob about most probable root cause. You can use entop to check message queue length and used memory per-process.

Best Regards,

Dmitry >-|-|-*>

Vance Shipley

unread,

May 23, 2013, 1:38:34 AM5/23/13

to Morgan Segalis, erlang-q...@erlang.org

On Thu, May 23, 2013 at 04:00:07AM +0200, Morgan Segalis wrote:
} I'm having a bit of an issue with my production servers.

You will find that etop is your friend:

http://www.erlang.org/doc/apps/observer/etop_ug.html

Run etop from the command line and sort on the column you're
interested in. To watch memory usage:

etop -node tiger@durin -sort memory

This will list the processes by memory size in decreasing order.
This shows you the memory hogs. Watch it as it starts to get
into trouble and you should see where the memory is getting used.

As Bob points out the most common problem is that a process's
inbox will start to fill up. Once this starts happening it's
the beginning of the end. Another process may start eating up
memory and the node may crash because it has requested more than
is available bt the root cause was that one process not having
time to service the messages at the rate they are received.

To watch for message queue lengths:

etop -node tiger@durin -sort msg_q

The above will list the processes in decreasing order of inbox
size. They should all be zero, and sometimes one, normally. If
you have a problem you'll see one process stay at the top and it's
message queue length will start to grow over time.

--
-Vance

Morgan Segalis

unread,

May 23, 2013, 5:13:19 AM5/23/13

to Vance Shipley, erlang-q...@erlang.org

I have launch the etop on my computer monitoring the production server… hoping that I will see something wrong !

Thank you for your help so far (to All).

I'll come back to you as soon as I have more information with etop.

Morgan.

Morgan Segalis

unread,

May 23, 2013, 5:23:04 AM5/23/13

to Vance Shipley, erlang-q...@erlang.org

Apparently I'm monitoring my own node…

Does someone know how to monitor a external cluster node with etop ?

Morgan Segalis

unread,

May 23, 2013, 5:35:45 AM5/23/13

to Vance Shipley, erlang-q...@erlang.org

Nevermind I got it…

However I do not get a lot of information…

most of process is proc_lib:ini_p/5

Dmitry Kolesnikov

unread,

May 23, 2013, 5:50:11 AM5/23/13

to Morgan Segalis, erlang-questions@erlang.org Questions

which means that you are using proc_lib heavily...
Are those top process with reductions, message queue size or heap?

Try to connect into node and gather more info about those processes using
erlang:process_info(…) or sys:get_status(…)

- Dmitry

Morgan Segalis

unread,

May 23, 2013, 6:04:11 AM5/23/13

to Dmitry Kolesnikov, erlang-questions@erlang.org Questions

For more information here's what my erlang node is doing :

It is an instant messaging server, each client connected is a process spawn automatically by a supervisor…

Every process spawned is monitored and started by a supervisor…

I have made a little function a while back, getting all processes and removing the processes inited at the beginning…

Here's what it gives me when everything works fine :

Dict: {dict,16,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
{{[[{{connector_serv,init,1},[connector_suprc42,connector,<0.42.0>]}|548]],
[],
[[{{supervisor,connector_sup,1},[connector,<0.42.0>]}|3],
[{{connector_serv,init,1},[connector_supssl,connector,<0.42.0>]}|1460],
[{{supervisor,casserl_sup,1},[connector,<0.42.0>]}|1],
[{{supervisor,pushiphone_sup,1},[connector,<0.42.0>]}|2],
[{{pushiphone,init,1},['pushiphone-lite',connector,<0.42.0>]}|3],
[{{supervisor,clientpool_sup,1},[connector,<0.42.0>]}|1]],
[],
[[{{clientpool,init,1},[clientpool_sup,connector,<0.42.0>]}|1],
[undefined|4]],
[],
[[{{supervisor,connector,1},[<0.42.0>]}|1],
[{{casserl_serv,init,1},[casserl_sup,connector,<0.42.0>]}|50]],
[],[],[],
[[{{connector_serv,init,1},[connector_suprc4,connector,<0.42.0>]}|472],
[{{ssl_connection,init,1},
[ssl_connection_sup,ssl_sup,<0.51.0>]}|
1366],
[{unknown,unknown}|3]],
[],[],
[[{{pushiphone,init,1},['pushiphone-full',connector,<0.42.0>]}|3]],
[],
[[{{pg2,init,1},[kernel_safe_sup,kernel_sup,<0.10.0>]}|1]]}}}
ok

Morgan Segalis

unread,

May 23, 2013, 6:24:47 AM5/23/13

to Dmitry Kolesnikov, erlang-questions@erlang.org Questions

When a client is connecting, yes, I'm using start_child.

The supervising strategy is : simple_one_for_one, temporary, worker

Morgan.

Le 23 mai 2013 à 12:13, Dmitry Kolesnikov <dmkole...@gmail.com> a écrit :

On May 23, 2013, at 1:04 PM, Morgan Segalis <mseg...@gmail.com> wrote:

Every process spawned is monitored and started by a supervisor…

Do you use start_child to spawn a new process? If so, do you clean it up?
What is supervising strategy?

- Dmitry

Dmitry Kolesnikov

unread,

May 23, 2013, 8:19:40 AM5/23/13

to Morgan Segalis, erlang-questions@erlang.org Questions

On May 23, 2013, at 1:04 PM, Morgan Segalis <mseg...@gmail.com> wrote:

I have made a little function a while back, getting all processes and removing the processes inited at the beginning…

Could you please elaborate on that? Why you are not satisfied with supervisor?

- Dmitry

Morgan Segalis

unread,

May 23, 2013, 8:25:22 AM5/23/13

to Dmitry Kolesnikov, erlang-questions@erlang.org Questions

No, I was talking about the function I made to investigate which processes I have created, which gives me this output :

I'm very satisfied with supervisor, and I don't think to have the expertise tweaking it...

Dmitry Kolesnikov

unread,

May 23, 2013, 8:39:06 AM5/23/13

to Morgan Segalis, erlang-questions@erlang.org Questions

Right, you do not have many processes. Same time you goes out of memory…

Unfortunately, I had no time play around with R16B at production…

Could it be some issue with SSL, I re-call there was some complains in the list?

I would use entop to spot the process that has either too much reductions, queue len or heap.

Once you know they pid you can dig more info about them using erlang:process_info(…) and/or sys:get:status(…)

BTW, What erlang:memory() says on you production node?

- Dmitry

Morgan Segalis

unread,

May 23, 2013, 8:46:08 AM5/23/13

to Dmitry Kolesnikov, erlang-questions@erlang.org Questions

So I should go back to R15B ?

erlang:memory() gives me

[{total,1525779584},

{processes,1272881427},

{processes_used,1272789743},

{system,252898157},

{atom,372217},

{atom_used,346096},

{binary,148093608},

{code,8274446},

{ets,1546832}]

But keep in mind that right now, there is no infinite loop, or memory issue at this exact time…

It will be more interesting to have that when the VM is asking for 14GB of memory, but when it does, the console is unresponsive, so I can't get anything then.

Vance Shipley

unread,

May 23, 2013, 10:30:23 AM5/23/13

to Morgan Segalis, Questions erlang-questions

Keep etop running and capture the output to a file (e.g. etop ... | tee stop.log). After it gets into trouble look back and see what was happening beforehand.

Morgan Segalis

unread,

May 23, 2013, 10:34:24 AM5/23/13

to Vance Shipley, Questions erlang-questions

Yeah that what I'm doing right now, but of course, when I'm monitoring it, it won't crash, only when I sleep !!

I get back to the Erlang list as soon as I have more informations about this.

Thank you all !

Morgan.

Morgan Segalis

unread,

May 23, 2013, 5:13:29 PM5/23/13

to Vance Shipley, Questions erlang-questions

Ok, it finally got into the infinite loop…

And of course, the node on which I was running etop could not give me anymore since it got disconnected from the production node.

So back to square one… no way to investigate correctly so far :-/

Morgan.

Dmitry Kolesnikov

unread,

May 23, 2013, 5:20:30 PM5/23/13

to Morgan Segalis, Questions erlang-questions

You system definitely leaking some resources :-/

- Check number of used FD(s) may be you exceeded limit there

- What was overall system memory / cpu utilisation before crash?

- Check how many connections you got before crash, may be you can reproduce it at dev

- Dmitry

Morgan Segalis

unread,

May 23, 2013, 5:30:10 PM5/23/13

to Dmitry Kolesnikov, Questions erlang-questions

Yeah you got that right ! leaking at a huge rate at some point !

- The number of Fd - I don't get close to the max

# cat /proc/sys/fs/file-nr

3264 0 6455368

- On the production server there is only the erlang node, no other service…

The beam.smp was through the roof at 300% CPU and 97% RAM

The weird thing is that it got there in a second, I was looking at it when it happens.

- It has happened with 2000 connections, 4000 connections, and 10000 connections… 5 min after start, 5hours after start.

I really can't find a pattern here…and I'm becoming a little desperate.

Thank you for your help again.

Morgan.

Morgan Segalis

unread,

May 24, 2013, 4:13:46 AM5/24/13

to Vance Shipley, Questions erlang-questions

The problem is that the VM freezes completely, it does not generate a crash dump

Is there a way to limit the memory that a VM can allocate, so the server is not overwhelmed in order to create a crash dump ?

Le 24 mai 2013 à 02:00, Vance Shipley <van...@motivity.ca> a écrit :

Have you analyzed the crash dump file with the crash dump viewer?

Dave Cottlehuber

unread,

May 24, 2013, 4:17:38 AM5/24/13

to erlang-questions

On 24 May 2013 10:13, Morgan Segalis <mseg...@gmail.com> wrote:
> The problem is that the VM freezes completely, it does not generate a crash
> dump
>
> Is there a way to limit the memory that a VM can allocate, so the server is
> not overwhelmed in order to create a crash dump ?

You could use ulimit -v perhaps, I never got ulimit to work right with
resident memory.

A+
Dave

Dmitry Kolesnikov

unread,

May 24, 2013, 4:31:17 AM5/24/13

to Morgan Segalis, Questions erlang-questions

Hello,

I am not aware of a single flag to limit the memory like in Java.

You can try to configure memory allocation

http://www.erlang.org/doc/man/erts_alloc.html

One of the freeze reason might be a huge crash_dump.

See the flags at bottom of page how to tune its behaviour

http://www.erlang.org/doc/man/erl.html

If you switch off a swap it helps to observe OOM.

Would you share to the list app you running applications?

application:which_applications()

- Dmitry

Morgan Segalis

unread,

May 24, 2013, 4:34:37 AM5/24/13

to Dmitry Kolesnikov, Questions erlang-questions

Thank you, I'll look into it.

Heres what application:which_application() gives me :

[{emysql,"Emysql - Erlang MySQL driver","0.2"},

{ssl,"Erlang/OTP SSL application","5.2.1"},

{public_key,"Public key infrastructure","0.18"},

{crypto,"CRYPTO version 2","2.3"},

{stdlib,"ERTS CXC 138 10","1.19.1"},

{kernel,"ERTS CXC 138 10","2.16.1"}]

nothing fancy has you can see...

Morgan Segalis

unread,

May 24, 2013, 4:35:48 AM5/24/13

to masked...@gmail.com, erlang-pr...@googlegroups.com, Questions erlang-questions

Thanks !

I'll try that…

will keep you posted.

Le 24 mai 2013 à 10:30, masked...@gmail.com a écrit :

You can run a little function that writes process info to files every N seconds. Like this:
F = fun(F2, T) -> Seconds=calendar:datetime_to_gregorian_seconds(calendar:now_to_local_time(now())), Fname=lists:flatten(io_lib:format("/tmp/f-~p", [Seconds])), [begin Info=process_info(Pid), Data=io_lib:format("~p:~n~p~n", [Pid, Info]), file:write_file(Fname, Data, [append]) end || Pid <- processes()], timer:sleep(T), F2(F2, T) end.

run it from console with F(F, 5000) and get a bunch of files in /tmp that probably can provide you something useful

Dmitry Kolesnikov

unread,

May 24, 2013, 5:06:03 AM5/24/13

to Morgan Segalis, Questions erlang-questions

Indeed the list of apps is simple, basically this is either ssl, emysql or your integration layer.

If you read a lot of data from mysql then that code might leak a memory by keeping ref to huge binaries.

E.g. you you do select * from xxx then that data is returned as set of binaries.

Whole binary hands in memory even if a first row is used. This is a one way to treat binaries received from somewhere:

case binary:referenced_byte_size(X) of

Large when Large > 2 * byte_size(X) ->

binary:copy(X);

_ ->

X

end

as long as I see emysql does not use binary:copy anywhere but of course they might dereference them in other way around.

If you could monitor a erlang:memory(binary) over a time then it might reveal the case.

BTW, is this specific to R16 or other release if so then some glitches at ssl?

- Dmitry

Vance Shipley

unread,

May 24, 2013, 5:05:43 AM5/24/13

to Morgan Segalis, Questions erlang-questions

Have you set the ERL_CRASH_DUMP_SECONDS environment variable?:
It won't create one unless you set it to a positive value. Set it to 60 or more to be sure it completes.

Vance Shipley

unread,

May 24, 2013, 5:27:43 AM5/24/13

to Morgan Segalis, Questions erlang-questions

It "freezes"? So it hasn't crashed at all. In that case you just need to be more patient and wait for it to either crash, and leave a crash dump, or output to etop. Possibly setting process priorities would help. Give the suspicious ones low priority.
If it's CPU resources which are being depleted you would want to observe which have the most reductions. Use stop to order by reductions and see who's the busiest.

Another way would be to run a debug emulator and interrupt it while it's frozen. Then inspect the backtrace to see what it has been doing.

Jesper Louis Andersen

unread,

May 24, 2013, 7:17:59 AM5/24/13

to Morgan Segalis, Questions erlang-questions

emysql… check your MySQL query log for queries with large return results. Emysql has no cursor-support, so you have to build the result in memory and this can *easily* blow up your heap space if you do a large select.

Jesper Louis Andersen
Erlang Solutions Ltd., Copenhagen

Matthias Lang

unread,

May 27, 2013, 6:50:26 PM5/27/13

to Morgan Segalis, erlang-q...@erlang.org

> On May 24, 2013, at 11:13 AM, Morgan Segalis <mseg...@gmail.com> wrote:

> > Is there a way to limit the memory that a VM can allocate, so the
> > server is not overwhelmed in order to create a crash dump ?

Sorry for the late input, I haven't been following this thread.

Here's how we limit Erlang's memory use (on linux):

static void set_limit_for_memory_consumption()
{
struct rlimit rlim;

if(getrlimit(RLIMIT_AS, &rlim))
{
die("getrlimit failed");
}
else
{
rlim.rlim_cur = MAX_MEM;
if(setrlimit(RLIMIT_AS, &rlim))
{
die("setrlimit failed");
}
syslog(LOG_INFO, "Limiting Erlang memory use to %u\n", MAX_MEM);
}
}

works well for us.

Matt

Matthias Lang

unread,

May 27, 2013, 7:01:29 PM5/27/13

to Morgan Segalis, erlang-q...@erlang.org

Quoting myself:

> Here's how we limit Erlang's memory use (on linux):

[C code cut, only really makes sense if you're starting Erlang from
some sort of heartbeat system]

If you just want to see this work from bash (again, on linux, but
probably works on many *nixes):

> erl
Erlang R15B03 (erts-5.9.3.1) [source] [64-bit] [smp:2:2] [async-threads:0] [hipe] [kernel-poll:false]
1> lists:seq(1, 123456789).
%% at this point, pressing ^C or ^G doesn't work for me, but 'kill' does.

>ulimit -v 1000000
~ >erl
Erlang R15B03 (erts-5.9.3.1) [source] [64-bit] [smp:2:2] [async-threads:0] [hipe] [kernel-poll:false]
1> lists:seq(1, 123456789).
%% a few seconds pass
Crash dump was written to: erl_crash.dump
eheap_alloc: Cannot allocate 306104640 bytes of memory (of type "heap").
Aborted

Reply all

Reply to author

Forward