[erlang-questions] ETS and CPU

25 views
Skip to first unread message

Alex Howle

unread,
Mar 15, 2016, 6:59:03 AM3/15/16
to erlang-q...@erlang.org

I've been experiencing an issue and was wondering if anyone else has any experience in this area. I've stripped back the problem to its bare bones for the purposes of this mail.

 

I have an Erlang 18.1 application that uses ETS to store an Erlang map structure. Using erts_debug:flat_size/1 I can approximate the map's size to be 1MB. Upon the necessary activity trigger the application spawns about 25 short-lived processes to perform the main work of the application. This activity trigger is fired roughly 9 times a second under normal operating conditions. Each of these 25 processes performs 1 x ets:lookup/2 calls to read from the map.

 

What I've found is that the above implementation has a CPU profile that is quite "expensive" - each of the CPU cores (40 total comprised of 2 Processors with 10 hyperthreaded cores) frequently runs at 100%. The machine in question also has 32GB RAM of which about 9GB is used at peak. There is no swap usage whatsoever. Examination shows that copy_shallow is performing the most work.

 

After changing the implementation so that the 25 spawned processes no longer read from the ETS table to retrieve the map structure and, instead the map is passed to the processes on spawn, the CPU usage on the server is considerably lower.

 

Can anyone offer advice as to why I'm seeing the differing CPU profiles?

Sverker Eriksson

unread,
Mar 15, 2016, 7:32:34 AM3/15/16
to Alex Howle, erlang-q...@erlang.org
Each successful ets:lookup call is a copy operation of the entire term
from ETS to the process heap.

If you are comparing ets:lookup of big map
to sending big map in message then I would expect
ets:lookup to win, as copy_shallow (used by ets:lookup)
is optimized to be faster than copy_struct (used by send).


/Sverker, Erlang/OTP
_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions

Jesper Louis Andersen

unread,
Mar 15, 2016, 7:39:09 AM3/15/16
to Sverker Eriksson, Erlang (E-mail), Alex Howle
If your map is "flat" and you frequently access individual keys, then you can avoid copying by restructuring:

If you have {Key, Map} then store [{{Key, K}, V} || {K, V} <- maps:to_list(Map)] and look up by using {Key, K} where K is the desired map key. This can break the large map into smaller chunks which may be more amenable to copying around.
--
J.

Alex Howle

unread,
Mar 15, 2016, 11:36:28 AM3/15/16
to erlang-q...@erlang.org

The map is not being sent as a message. It is passed in to the spawned processes by being in scope of the spawned function.

Pseudocode:

A=bigmap,
spawn(fun()-> do_something(A) end).

On 15 Mar 2016 13:43, "Alex Howle" <itshow...@gmail.com> wrote:

The map is not being sent as a message. It is passed in to the spawned processes by being in scope of the spawned function.

Pseudocode:

A=bigmap,
spawn(fun()-> do_something(A) end).

Philip Clarke

unread,
Mar 15, 2016, 12:38:00 PM3/15/16
to Alex Howle, erlang-q...@erlang.org
Hi Alex,

It would be interesting to see if the cpu is busy with garbage collection when the process is reading from ETS.

I suspect that when the process is started without the map and then reads the map from ETS that it has to increase its heap size several times.  In order to increase heap size a full GC has to be run and I have found this in the past to be quite expensive.

I suspect that when you create the process with spawn(fun()-> do_something(A) end)  that the process is initialised with the correct heap size thereby avoiding the GC and hence the better CPU performance.

Perhaps you could rerun the program with a higher default heap size per process (+hms)?
I would be interested to know how it goes.

Regards
Philip

Hynek Vychodil

unread,
Mar 15, 2016, 2:13:32 PM3/15/16
to Alex Howle, erlang-questions
If you start it in this way, your map is copied from a heap of the parent process to a heap of the new process. The difference could be starting size of a heap. Sou you could try to start the process as spawn_op(fun()-> do_something(A) end, {}min_heap_size, Size}).
With Size around 2 * erts_debug:flat_size(Map).

Hynek Vychodil

unread,
Mar 15, 2016, 2:15:23 PM3/15/16
to Alex Howle, erlang-questions
I'm sorry, there is a typo. It should be spawn_opt(fun()-> do_something() end, [{min_heap_size, Size}]).

Robert Virding

unread,
Mar 15, 2016, 8:25:53 PM3/15/16
to Alex Howle, erlang-questions
As others have said that this will copy the map from the spawner to the spawnee process. The interesting thing is what you do with afterwards. If you just carry it around as process local data then there is no extra copying except during GC which you can't do anything about. If, however, you then store the map as a map in an ETS table you will be copying the map between the process and the table for every access. This will be very inefficient if you just want to access one key in the map as you will first have to copy the whole map from the ETS table to the process, access the key, and then if you are writing to the map copy it all back to the ETS table. It would then be better to split the map into separate {Key,Value} tuples which you store as separate elements in the ETS table.

ETS tables are great but you have to be aware of the copying. Erlang doesn't do shared global data! ETS tables aren't shared, just globally accessible.

Robert

Alex Howle

unread,
Mar 16, 2016, 7:33:37 AM3/16/16
to Sverker Eriksson, erlang-q...@erlang.org

Assuming that when you say "win" you mean that ets:lookup should be more efficient (and less CPU intensive) then I'm seeing the opposite.

On 15 Mar 2016 11:32, "Sverker Eriksson" <sverker....@ericsson.com> wrote:

Hynek Vychodil

unread,
Mar 16, 2016, 12:19:05 PM3/16/16
to Alex Howle, erlang-questions
Just to be sure, when you wrote that erts_debug:flat_size/1 approximate the map's size to be 1MB, you mean something around this value

> erts_debug:flat_size(Map)
131072

Because if you have something like
> erts_debug:flat_size(Map)
1048576

It is 8MB and your memory IO is like 25*9/s*8MB = 1800MB/s. It is still manageable by your HW but should be considered in your design.

Alex Howle

unread,
Mar 16, 2016, 12:20:41 PM3/16/16
to Hynek Vychodil, erlang-q...@erlang.org

I'm on a 64-bit machine, and flat_size returns the number of words. So I took the result and multiplied by 8 to get the number of bytes.

Sverker Eriksson

unread,
Mar 16, 2016, 12:21:04 PM3/16/16
to Alex Howle, erlang-q...@erlang.org
Well, I would expect copy_shallow (from ETS) to be less CPU intensive
than copy_struct (from process).

However, as indicated by others, ets:lookup on such a big map will probably
trigger a garbage collection on the process, which will lead to
yet another copy of the big map.

The spawn(fun() -> do_something(BigMap) end) on the other hand will
allocate a big enough heap for the process form the start and only do
one copy of the big map.

/Sverker, Erlang/OTP

Hynek Vychodil

unread,
Mar 16, 2016, 1:34:04 PM3/16/16
to Sverker Eriksson, erlang-questions, Alex Howle
I was curious enough to try it:

-module(ets_vs_msg).

-export([start/1]).

-export([ets/2, ets_h/2, msg/2, arg/2]).

-define(Tab, ?MODULE).

-define(MapSize, 100000). %% 100000 is 2.87 MB

start(N) ->
    Map = gen_map(),
    ets_init(Map),
    [[{X, element(1, timer:tc(fun ?MODULE:X/2, [N, Map]))/N}
      || X <- [ets_h, ets, msg, arg]]
     || _ <- lists:seq(1, 3)].

gen_map() ->
    gen_map(?MapSize).

gen_map(N) ->
    maps:from_list([{X, []} || X <- lists:seq(1, N)]).

ets_init(Map) ->
    (catch ets:new(?Tab, [named_table])),
    ets:insert(?Tab, {foo, Map}).

ets(N, _Msg) ->
    Pids = [ spawn_link(fun loop/0) || _ <- lists:seq(1, N) ],
    [ Pid ! {ets, self()} || Pid <- Pids],
    [ receive {ok, Pid} -> ok end || Pid <- Pids ].

ets_h(N, Msg) ->
    Size = 2*erts_debug:flat_size(Msg),
    Pids = [ spawn_opt(fun loop/0, [link, {min_heap_size,Size}]) || _ <- lists:seq(1, N) ],
    [ Pid ! {ets, self()} || Pid <- Pids],
    [ receive {ok, Pid} -> ok end || Pid <- Pids ].

msg(N, Msg) ->
    Pids = [ spawn_link(fun loop/0) || _ <- lists:seq(1, N) ],
    [ Pid ! {msg, self(), Msg} || Pid <- Pids],
    [ receive {ok, Pid} -> ok end || Pid <- Pids ].

arg(N, Msg) ->
    Pids = [ spawn_link(fun() -> init(Msg) end) || _ <- lists:seq(1, N) ],
    [ Pid ! {do, self()} || Pid <- Pids],
    [ receive {ok, Pid} -> ok end || Pid <- Pids ].

init(_) ->
    loop().

loop() ->
    receive
        {ets, From} ->
            ets:lookup(?Tab, foo),
            From;
        {msg, From, _Msg} ->
            From;
        {do, From} ->
            From
    end ! {ok, self()}.

Reading from ets with prepared heap is clear winner:

40> ets_vs_msg:start(1000).
[[{ets_h,805.83},{ets,2383.31},{msg,4492.15},{arg,3957.693}],
 [{ets_h,918.221},
  {ets,2379.459},
  {msg,4651.258},
  {arg,4028.799}],
 [{ets_h,927.538},
  {ets,2370.421},
  {msg,4519.885},
  {arg,4057.264}]]

But there is a catch. If I look to CPU utilisation, only ets_h and ets uses all cores/schedulers (i7 with 4 HT in my case) which indicate that both msg and arg version copy the map from the single process. In my case sending a message from more processes would lead to max 4x speed up for msg and arg version.

Hynek Vychodil

unread,
Mar 16, 2016, 1:58:22 PM3/16/16
to Sverker Eriksson, erlang-questions, Alex Howle
I have tried parallel version of msg and arg

msg_p(N, Msg) ->
    do_p(fun msg/2, N, Msg).

arg_p(N, Msg) ->
    do_p(fun arg/2, N, Msg).

do_p(F, N, Msg) ->
    Schedulers = erlang:system_info(schedulers),
    Parent = self(),
    N2 = N div Schedulers,
    Pids = [spawn_link(fun() -> F(N2, Msg), Parent ! {ok, self()} end)
            || _ <- lists:seq(1, Schedulers) ],
    [ receive {ok, Pid} -> ok end || Pid <- Pids].

and it performs better but still worse than ets but I don't know how it would behave on HW with 40 CPUs/schedulers

[[{ets_h,787.688},
  {ets,2215.42},
  {msg_p,2525.365},
  {msg,4964.156},
  {arg_p,2780.5},
  {arg,4248.214}],
 [{ets_h,901.369},
  {ets,2343.145},
  {msg_p,2368.203},
  {msg,5062.984},
  {arg_p,2073.172},
  {arg,4260.998}],
 [{ets_h,906.705},
  {ets,2423.889},
  {msg_p,3135.662},
  {msg,5069.39},
  {arg_p,2186.49},
  {arg,4268.753}]]

Setting initial heap size in msg helps little bit

msg(N, Msg) ->
    Size = 2*erts_debug:flat_size(Msg),
    Pids = [ spawn_opt(fun loop/0, [link, {min_heap_size,Size}]) || _ <- lists:seq(1, N) ],
    [ Pid ! {msg, self(), Msg} || Pid <- Pids],
    [ receive {ok, Pid} -> ok end || Pid <- Pids ].

[[{ets_h,823.901},
  {ets,2200.168},
  {msg_p,1974.292},
  {msg,4678.855},
  {arg_p,2082.779},
  {arg,4666.294}],
 [{ets_h,906.677},
  {ets,2033.719},
  {msg_p,2092.892},
  {msg,4665.692},
  {arg_p,2005.953},
  {arg,4707.86}],
 [{ets_h,902.813},
  {ets,2290.883},
  {msg_p,2041.713},
  {msg,4655.373},
  {arg_p,2011.422},
  {arg,4659.18}]]

So I think sending message could be reasonably faster than ets version on HW with 40 CPUs. Anyway storing or sending map this big doesn't seem good design.

Alex Howle

unread,
Mar 16, 2016, 4:44:33 PM3/16/16
to Hynek Vychodil, erlang-questions
Thank you very much for taking the time to experiment with this!

Does that mean that splitting the 1MB map into smaller maps would somehow be better, do you think? All parts of the map are required for the processing to be successful.

Hynek Vychodil

unread,
Mar 16, 2016, 5:22:55 PM3/16/16
to Alex Howle, erlang-questions
It's hard to tell because we still don't know much about your use case. Where the data in map comes from? How often do they change?  How big portion of keys change? How big portion of keys is read in each process?

For example, if data in map change less than like once per quarter of an hour and is read heavily I would consider compile module with literal constant. If the data in map changes often but a small portion of keys and also a small portion of keys are read I would definitely store data in ets one key per record which is what Jesper Louis Andersen suggested and so on. It depends heavily on your exact use case.

One thing we can tell you for sure, each time you call ets:lookup/2 the whole record from ets is copied in memory. Each time you send a message, the whole message is copied in memory. Each time you spawn process the whole initial data is copied in memory.

I made some more measurements. When I use 2 schedulers, sending messages or using spawn are 96% slower than ets:lookup/2. When I use 4 chedulers the difference is only 70% so there is a possibility that wich much more schedulers it could be even faster. I don't have enough data. But it is just out of my curiosity because for your real use case there could be a way better solution.
Reply all
Reply to author
Forward
0 new messages