[erlang-questions] gproc scalability, shared ets, links and using terminate/2

Max Lapshin

unread,

Nov 9, 2011, 5:41:23 AM11/9/11

to Erlang-Questions Questions

As we know, gen_server:terminate/2 function is called
non-deterministically. You should not rely that it will be called. But
you may hope that it will be called.

And it has very important feature: it is called in separate process.

Now a bit about gproc. While moving erlyvideo to 10 GBit limit I've
experienced problems with scalability and using central tracking
process.

I want to claim that monitor(process, Pid) technology has problems. It
is very convenient in terms or reenterability and many other. But it
has one big feature and drawback:
it is impossible to call demonitor from other process.

Now look, what is happening. Request storm is beginning and lots of
user sessions are created. Then it is over or something happens and
they are going to close in a very short
period of time.

Thousands of {'DOWN' messages are going to central tracker and it lays
down, not possible to open new processes.

So, again: problem is in handling lots of DOWN messages in one central
process. We should divide and conquer this situation.

I've changed this schema a bit: use good old link/1 function and call
it in terminate/2 handler:

my_session.erl:

terminate(_, #session{session_id = Id}) ->
gen_tracker:remove_me(flu_sessions, Id).

gen_tracker.erl:

remove_me(Zone, Id) when is_pid(Id) ->
unlink(whereis(Zone)),
delete_by_pid(Zone, Id).

This approach makes all modifications in client process, which is
still alive. gen_tracker will have to clean only those processes,
which have died due to kill reason.
Together with find_or_open(Key, SpawnFun) approach it helped me a lot.
_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions

Ulf Wiger

unread,

Nov 9, 2011, 7:06:28 AM11/9/11

to Max Lapshin, Erlang-Questions Questions

It's true that link/1 has that advantage.

The disadvantage is that if the server crashes, and is linked to, say, 100K processes, the EXIT signal will be duplicated that many times, likely forcing an OOM crash.

(Been there, done that.)

BR,
Ulf W

Ulf Wiger, CTO, Erlang Solutions, Ltd.
http://erlang-solutions.com

Max Lapshin

unread,

Nov 9, 2011, 9:00:15 AM11/9/11

to Ulf Wiger, Erlang-Questions Questions

On Wed, Nov 9, 2011 at 1:06 PM, Ulf Wiger
<ulf....@erlang-solutions.com> wrote:
>
> It's true that link/1 has that advantage.
>
> The disadvantage is that if the server crashes, and is linked to, say, 100K processes, the EXIT signal will be duplicated that many times, likely forcing an OOM crash.
>

I understand. But I don't have anything to do, except hoping that
server will not crash, because with monitor in my situation system
gets frozen under normal load.

And 100 K of messages is not a reason for OOM. Dumping these messages
will be an OOM, error logger is the main reason to bring down VM.

By the way, have you seen my previous claim about non-atomic process startup?
Using supervisor {error, {already_started,Pid}} feature is also not
100% unreliable because of race condition.

Ulf Wiger

unread,

Nov 9, 2011, 9:14:05 AM11/9/11

to Max Lapshin, Erlang-Questions Questions

On 9 Nov 2011, at 15:00, Max Lapshin wrote:

> And 100 K of messages is not a reason for OOM. Dumping these messages
> will be an OOM, error logger is the main reason to bring down VM.

Actually, it can well be. Granted, the one time I managed to do this was when I had spawn-linked 100K processes from the shell, and then managed to crash the shell process through a typo (I was going to write length(processes()), but mistyped 'length'). The {undef, {erlang, lenth, [Pids]}} became a pretty large EXIT message, and the VM tried to create 100K copies of it in one atomic operation. It took 10 minutes before I regained use of my workstation...

I'll consider your use case, and try to come up with something that does scale without adding overhead to the normal case.

The other reason for disliking links in this case is that the server imposes an EXIT signal on the client that it basically has no reasonable reaction to. I think a logical extension of this is to rig gproc so that it always brings down the whole node if it crashes. Currently, the server can crash and recover, which is not a bad feature.

BR,
Ulf W

Ulf Wiger, CTO, Erlang Solutions, Ltd.
http://erlang-solutions.com

_______________________________________________

Joseph Norton

unread,

Nov 9, 2011, 9:17:02 AM11/9/11

to Max Lapshin, Erlang-Questions Questions

I've been using a "goodbye" patch for the gproc application to help move some of the cleanup work to the client side and lessen the work of the centralized gproc server.

https://github.com/norton/gproc/commit/e2c4108c2ceae5d86ca78f9f1d5e5c6b45f7309a

Not (quite) sure if this is helpful to your use case or not.

- Joe N.

Max Lapshin

unread,

Nov 9, 2011, 9:21:03 AM11/9/11

to Joseph Norton, Erlang-Questions Questions

Yes, perhaps problem is because client process dies, sends DOWN
message and tracker
is busy with cleaning data from ets.

If client will send {demonitor_me, Pid} message, so tracker will know
that ets is cleaned and it will behave much faster.

I'll try that, because problems, that Ulf is telling, may be very serious.
100K of messages is not an issue. Erlang VM can deal with them in 2
seconds. Printing of them is a problem.

Ulf Wiger

unread,

Nov 9, 2011, 9:21:13 AM11/9/11

to Joseph Norton, Erlang-Questions Questions

Joe, I believe that patch has been merged into the main. At least, gproc:goodbye() does exist there.

It should make some difference, since the server will have much less work to do for each DOWN message.

BR,
Ulf

On 9 Nov 2011, at 15:17, Joseph Norton wrote:

>
> I've been using a "goodbye" patch for the gproc application to help move some of the cleanup work to the client side and lessen the work of the centralized gproc server.
>
> https://github.com/norton/gproc/commit/e2c4108c2ceae5d86ca78f9f1d5e5c6b45f7309a
>
> Not (quite) sure if this is helpful to your use case or not.
>
> - Joe N.
>
> On Nov 9, 2011, at 11:00 PM, Max Lapshin wrote:
>
>> On Wed, Nov 9, 2011 at 1:06 PM, Ulf Wiger
>> <ulf....@erlang-solutions.com> wrote:
>>>
>>> It's true that link/1 has that advantage.
>>>
>>> The disadvantage is that if the server crashes, and is linked to, say, 100K processes, the EXIT signal will be duplicated that many times, likely forcing an OOM crash.
>>>
>>
>> I understand. But I don't have anything to do, except hoping that
>> server will not crash, because with monitor in my situation system
>> gets frozen under normal load.
>>
>> And 100 K of messages is not a reason for OOM. Dumping these messages
>> will be an OOM, error logger is the main reason to bring down VM.
>>
>>
>> By the way, have you seen my previous claim about non-atomic process startup?
>> Using supervisor {error, {already_started,Pid}} feature is also not
>> 100% unreliable because of race condition.
>> _______________________________________________
>> erlang-questions mailing list
>> erlang-q...@erlang.org
>> http://erlang.org/mailman/listinfo/erlang-questions
>

Ulf Wiger, CTO, Erlang Solutions, Ltd.
http://erlang-solutions.com

_______________________________________________

Zabrane Mickael

unread,

Nov 9, 2011, 9:20:42 AM11/9/11

to Ulf Wiger, erlang-questions Questions

Hi Ulf,

Any chance to update jobs doc manual and examples here:

https://github.com/esl/jobs

Regards,

Zabrane

David Mercer

unread,

Nov 10, 2011, 11:57:33 AM11/10/11

to Max Lapshin, Erlang-Questions Questions

On Wednesday, November 09, 2011, Max Lapshin wrote:

> As we know, gen_server:terminate/2 function is called
> non-deterministically. You should not rely that it will be called. But
> you may hope that it will be called.

I did not know that. I thought it got called so long as your server was
trapping exits. Is my assumption not true?

Cheers,

DBM

Max Lapshin

unread,

Nov 10, 2011, 12:59:22 PM11/10/11

to David Mercer, Erlang-Questions Questions

> I did not know that. I thought it got called so long as your server was
> trapping exits. Is my assumption not true?
>

after erlang:exit(Pid,kill) your process will not have time to call terminate.

Robert Virding

unread,

Nov 11, 2011, 6:25:50 AM11/11/11

to Max Lapshin, Erlang-Questions Questions

Yes, the 'kill' signal is untrappable so the process which receives it just dies immediately

Robert

Reply all

Reply to author

Forward