You could use the synchronized serialization to generate push-back
behaviour from your system,
so that you do not handle a new request before it's possible - maybe
you are already doing this,
or not.
If you really want to find the bottlenecks, you could try with fprof
http://www.erlang.org/doc/man/fprof.html
/G
> _______________________________________________
> erlang-questions mailing list
> erlang-q...@erlang.org
> http://erlang.org/mailman/listinfo/erlang-questions
>
_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions
Sent from my iPhone
The obvious, and maybe non-OTP, answer is to hold some of this state information in a public or protected named ETS table that your clients read from directly. A single gen_server can still own and write to that ETS table.
-- Jesper Louis Andersen Erlang Solutions Ltd., Copenhagen, DK
[Replying to multiple replies at once, all quoted text reformatted for
readability (seems people these days can't be bothered!?).]
[Warning: excessively nitpicky content ahead, here and there.]
# God Dang 2012-01-28:
> I'm creating a system where I've ended up with alot of gen_servers that
> provides a clean interface. When I run this under load I see that the
> gen_server:call's is becoming a bottleneck.
You're probably treating asynchronous things as synchronous somewhere along
the path, inflicting collateral damage to concurrent users? Synchronous
high-level API is fine, but if you know some operation is expensive or
depends on response from the outside world you should record request
context in ETS (row looking something like {Req_id, Timeout, From[, ...]}
and process requests out-of-order OR offload the actual processing to
short lived worker processes like Matthew Evans says OR a combination
of both OR somesuch.
My point being that gen_server:call/N, by itself, is *very* fast in
practice, so chances are you're doing something wrong elsewhere.
Other (unlikely) thing: you're not sending very large data structures in
messages, are you? That could hurt, but there are ways to address that
too if needed.
# Matthew Evans 2012-01-28:
> Another obvious answer is to provide back-pressure of some kind to prevent
> clients from requesting data when it is under load.
On external interfaces (or for global resource usage of some sort): yes, a
fine idea (a clear "must have", actually!); but doing this internally would
seem excessively defensive to me, unless further justification was given.
> You might want to change such an operation from:
>
> handle_call({long_operation,Data},From,State) ->
> Rsp = do_lengthy_operation(Data),
> {reply, Rsp, State};
>
> to:
>
> handle_call({long_operation,Data},From,State) ->
> spawn(fun() -> Rsp = do_lengthy_operation(Data), gen_server:reply(Rsp,From) end),
> {noreply, State};
1. Why do people bother introducing "one-shot" variables for trivial
expressions they could have inlined? Means less context to maintain
when reading the code...
2. Surely you meant proc_lib:spawn_link/X there, didn't you? SASL logs
and fault propagation are the reason. While there are exceptions to
this, they're extremely rare.
3. The order of arguments to gen_server:reply/2 is wrong.
Regarding the general approach: yes, a fine idea too. Depending on what
do_lengthy_operation/1 does putting these workers under supervisor might
be called for.
# Jesper Louis Andersen 2012-01-28:
> This would be my first idea. Create an ETS table being protected.
> Writes to the table goes through the gen_server,
Yes, a fine idea too -- ETS is one of the less obvious cornerstones
of Erlang programming (but don't tell "purity" fascists )... One
detail: almost all of my ETS tables are public even when many of
them are really treated as private or protected, reason is to keep
high degree of runtime tweakability just in case (this might be a
bit superstitious I admit).
> -export([write/1, read/1]).
>
> write(Obj) ->
> call({write, Obj}).
>
> call(M) ->
> gen_server:call(?SERVER, M, infinity).
1. Abstracting trivial functionality such as call/1 above only
obfuscates code for precisely zero gain.
2. Same goes for typing "?SERVER" instead of the actual server
name. Using "?MODULE" is however alright, as long as it's
only referred to from current module (as it should).
3. No infinite timeouts without very good justification! You're
sacrificing a good default protective measure for no good
reason...
> but reads happen in the calling process of the API and does not go
> through the gen_server at all,
>
> read(Key) ->
> case ets:lookup(?TAB, Key) of
> [] -> not_found;
> [_|_] = Objects -> {ok, Objects}
> end.
(2) from above also applies to "?TAB" here. More to the point, it's
sometimes perfectly OK to do table writes directly from caller's
context too, like:
write(Item) ->
true = ets:insert_new(actual_table_name, Item).
It can of course be very tricky business and needs good thinking first.
I bet you're aware of this, mentioning it just because it's a handy
trick that doesn't seem to be widely known.
> Creating the table with {read_concurrency, true} as the option will
> probably speed up reads by quite a bit as well. It is probably going
> to be a lot faster than having all caching reads going through that
> single point of contention. Chances are that just breaking parts of
> the chain is enough to improve the performance of the system.
Well, yes, avoiding central points of contention (such as blessed
processes or, you guessed it, ETS tables) is certainly good engineering
practice, but see first part of this email for other considerations.
BR,
-- Jachym
What i meant was that you could maybe change your current design (if
it is not like this already) so that you
do not start to process another request from an external interface
until your system has the actual resources
to do so.
Perhaps my reaction for a cache was to knee-jerky. You might want a call
to fail if it takes too long to process because it will uncover another
problem in the code: namely overload of the cache.
--
Jesper Louis Andersen
Erlang Solutions Ltd., Copenhagen, DK
_______________________________________________
True.
> Perhaps my reaction for a cache was to knee-jerky. You might want a
> call to fail if it takes too long to process because it will uncover
> another problem in the code: namely overload of the cache.
And it will also release the resources held by waiting process, that's
my main concern. What came to mind immediately when I saw the infinite
timeout was a typical (IMO) scenario when one might be tempted to use
those:
%% This could be a load-balancer or failover manager process that
%% acts as entry point to a protocol stack, or somesuch thing.
send_req(Pid, Req, Timeout) ->
gen_server:call(Pid, {send_req, Req, Timeout}, infinity).
handle_call({send_req, Req, Timeout}, From, #state{parties = Ps} = State) ->
party:send_req(choose_party(Ps), {send_req, Req, From, Timeout}),
{noreply, State};
Where party module would perhaps do some more delegation of its own and
eventually request ends up sent out to an external system, timeout gets
planned and its reference recorded along with From in and ETS table.
When either timeout triggers or response arrives it gets correlated
against ETS and gen_server:reply/2 is called.
Now if something goes wrong and that ETS table evaporates or a low-level
process explodes and the error, by mistake, isn't propagated correctly
(which would involve faulting all pending requests immediately), one is
left with the client process sitting there forever. Sure, gen_server:call/X
isn't stupid and monitors the server process -- but given the amount of
delegation we have going on, that one may very well still be alive and
doing well. Now over time these zombie processes could add u, and whole
node crashes.
This is a somewhat elaborate scenario and depends on a suitable bug being
already present somewhere in the system, or perhaps just some unfortunate
timing in otherwise reasonably designed system. But let's consider trivial
change, everything else being the same:
send_req(Pid, Req, Timeout) ->
gen_server:call(Pid, {send_req, Req, scale_down(Timeout)}, Timeout).
scale_down(N) ->
%% This could involve both low and high internal processing overhead
%% allowance cutoff if one wanted to be super-correct about this.
round(N * 0.90).
Bugs or not and timing or not we now have a hard deadline on resource
release and can sleep a bit more peacefully at night without nightmares
of zombie apocalypse. :-) Furthermore, this valuable behavioral contract
is immediately apparent during code inspection, making things easier to
reason about.
Hopefully this sort of context clarifies my (possibly overly terse)
response.
BR,
-- Jachym