looking for help debugging apparent memory leak

18 views
Skip to first unread message

Allan Streib

unread,
Jul 10, 2012, 12:10:45 PM7/10/12
to erlan...@googlegroups.com
Hi all,

I attended the Erlang camp in Chicago a couple of years ago. I am
finally getting a chance to use some OTP "for real" so I'd consider
myself to be still pretty much at the beginner level. I am running into
a situation where one of my processes seems to be consuming memory.
Eventually I will get a crashdump with a message like:

eheap_alloc: Cannot allocate 50152200 bytes of memory (of type "heap")

Thought I'd throw this out to see if anyone can give me some pointers of
where to hunt for this. Some details of what I've found below.

The application, called stats_collect, has some processes which are
subscribers to AMQP (RabbitMQ) message queues, and forward some message
statistics to a named process (sc_db_serv) which passes them off to
worker processes (simple one-for-one) for storage into a database. This
sc_db_serv process is where the bulk of the memory growth is happening.
Some details...

Shortly after starting the application, memory() returns:

[{total,4967512},
{processes,1096856},
{processes_used,1089936},
{system,3870656},
{atom,354829},
{atom_used,352561},
{binary,107752},
{code,2538745},
{ets,163116}]

... and later:

[{total,51882592},
{processes,41724296},
{processes_used,41717376},
{system,10158296},
{atom,354829},
{atom_used,352594},
{binary,6388912},
{code,2544953},
{ets,163548}]


Running the i() command I found that sc_db_serv showed by far the most
heap growth. This is also supported by my inspection of the crash dump,
which showed the sc_db_serv process in a "Garbing" state.

Pid Initial Call Heap Reds Msgs
Registered Current Function Stack
<0.48.0> sc_db_serv:init/1 4181 29950 0
sc_db_serv gen_server:loop/6 9

(later)...

<0.48.0> sc_db_serv:init/1 10030440 3156134 0
sc_db_serv gen_server:loop/6 9


I'm not seeing anything accumulating in the message queue for sc_db_serv:

56> sys:get_status(sc_db_serv).
{status,<0.48.0>,
{module,gen_server},
[[{'$ancestors',[<0.47.0>,stats_collect_sup,<0.45.0>]},
{'$initial_call',{sc_db_serv,init,1}}],
running,<0.47.0>,[],
[{header,"Status for generic server sc_db_serv"},
{data,[{"Status",running},
{"Parent",<0.47.0>},
{"Logged events",[]}]},
{data,[{"State",
{state,3,<0.47.0>,<0.57.0>,[],
{[],[]},
[<0.49.0>,<0.53.0>,<0.55.0>]}}]}]]}
57> erlang:process_info(list_to_pid("<0.48.0>")).
[{registered_name,sc_db_serv},
{current_function,{gen_server,loop,6}},
{initial_call,{proc_lib,init_p,5}},
{status,waiting},
{message_queue_len,0},
{messages,[]},
{links,[<0.49.0>,<0.53.0>,<0.55.0>,<0.47.0>]},
{dictionary,[{'$ancestors',[<0.47.0>,stats_collect_sup,
<0.45.0>]},
{'$initial_call',{sc_db_serv,init,1}}]},
{trap_exit,true},
{error_handler,error_handler},
{priority,normal},
{group_leader,<0.44.0>},
{total_heap_size,10030440},
{heap_size,10030440},
{stack_size,9},
{reductions,3159648},
{garbage_collection,[{min_bin_vheap_size,46368},
{min_heap_size,233},
{fullsweep_after,65535},
{minor_gcs,0}]},
{suspending,[]}]


Wondering what else I can do to find out what is causing this heap
growth. Thanks for any suggestions.

Allan
--
Allan Streib

Eric Merritt

unread,
Jul 10, 2012, 12:19:22 PM7/10/12
to erlan...@googlegroups.com
Allen,

the easiest to figure this out would probably to run dbg on that
module and see what calls are happening. I strongly suspect your
problem will be visible really quickly doing that.

Dbg use is detailed pretty reasonably here
stackoverflow.com/questions/6438041/how-to-debug-erlang-code

I have always found scott's helper functions very useful
http://www.snookles.com/erlang/user_default.erl

Eric

Allan Streib

unread,
Jul 10, 2012, 5:53:02 PM7/10/12
to erlan...@googlegroups.com
Thanks Eric,

The helper functions seem most convenient. I'll keep playing with it, but so far nothing is jumping out at me in the dbg call traces. Everything looks like what I would expect. State looks correct, no messages accumulating in mailboxes, etc. Is there any way to inpect the contents of the "heap" space? That's what seems to be growing.

http://stackoverflow.com/questions/5038459/memory-leak-observed-in-erlang-application

The above link is sort of similar to my scenario, though I'm not doing anything with an ETS table, or responding back to the originating process. My "sc_db_serv" process receives messages from the queue subscribers, and passes them along to workers which save information to a database. These messages contain JSON structures which include numerous short binaries, like

{struct,[{<<"username">>,<<"liulab">>},
{<<"queue">>,<<"batch">>},
{<<"tasks">>,1},
{<<"group">>,<<"faculty">>},
{<<"exit_status">>,null},
{<<"requested_mem">>,0},
{<<"eligibletime">>,1341953989},
{<<"start_time">>,1341956534},
{<<"jobid">>,34689},
{<<"project">>,null},
{<<"completion_time">>,1341963734},
{<<"step">>,0},
{<<"submit_time">>,1341953989},
{<<"nodelist">>,[<<"c15">>]},
{<<"filename">>,<<"20120710">>},
{<<"type">>,<<"start">>},
{<<"mem">>,0},
{<<"walltime">>,7200}]}

It's my understanding that small binaries like these are stored in the process heap not in the system binary area. I see both binary and heap usage growing in my node. So I suspect something is hanging on to these binaries, maybe as a result of them being passed around between different processes. On the other hand, I could be wrong :).

Allan

Garrett Smith

unread,
Jul 10, 2012, 6:01:02 PM7/10/12
to erlan...@googlegroups.com
Are you using the timer module? That uses private ets tables and can
eat up memory over time if not carefully managed.

In any case, you can use tv to peruse ets tables to take a quick look.

Are you using any modules/apps that use NIFs? E.g. your json library?

I'd also recommend posting this to the general Erlang questions list.
Reply all
Reply to author
Forward
0 new messages