[erlang-questions] High lock contention on dist

Brian Picciano

unread,

Apr 19, 2013, 3:24:12 PM4/19/13

to erlang-q...@erlang.org

We have a pool of 3 erlang nodes, all on different servers. Every afternoon, without fail, we start seeing lots of messages between the nodes start having really high latency, on the order of tens of seconds. Today we ran lcnt on them to see if there's anything there, and found that on one of the nodes dist_tables had a significantly higher lock percentage then anything else, and definitely higher then on the other boxes:

(node@address)8> lcnt:conflicts().

lock id #tries #collisions collisions [%] time [us] duration [%]

----- --- ------- ------------ --------------- ---------- -------------

dist_table 1 3468191 1242055 35.8128 153712413 255.2521

run_queue 24 76969638 4088578 5.3119 14468656 24.0264

process_table 1 2015686 147148 7.3001 3208529 5.3280

timer_wheel 1 12214948 834737 6.8337 3076638 5.1090

timeofday 1 18231600 594487 3.2608 1491633 2.4770

...

while on the other boxes it had closer to 3. On the box with the high lock contention we also saw much higher load then on the other boxes.

My question is: what is this lock? We couldn't find much online except that it appears to have to do with communication between nodes, but we're not sure what. Also, what, if anything, could we do to mitigate this problem?

(We're running erlang 16B)

Lukas Larsson

unread,

Apr 22, 2013, 12:06:43 PM4/22/13

to Brian Picciano, Erlang Questions

The dist_table mutex refers to the rwmutex which is defined here[1]. There is a bunch of different places where it is used, so saying exactly what is causing the contentions is hard without knowing the code. Generally it should indicate that you are trying to send many messages over distribution while information about remote nodes is changing frequently.

One thing I noticed is that the nodes() bif call takes a rwlock on the mutex. Are you using that bif alot?

Lukas

[1]: https://github.com/erlang/otp/blob/maint/erts/emulator/beam/erl_node_tables.c#L802

_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions

Brian Picciano

unread,

Apr 22, 2013, 4:23:48 PM4/22/13

to Lukas Larsson, Erlang Questions

We are actually. Is there an alternative way of easily retrieving which nodes are currently connected?

Lukas Larsson

unread,

Apr 23, 2013, 3:30:32 AM4/23/13

to Brian Picciano, Erlang Questions

I don't know of an alternative way. Before trying to come up with a solution I would verify that it is nodes() which is causing the contention.

If you do lcnt:inspect(dist_table,[{locations,true}]) you should get a list of which actual locks in the source code it is that the contentions are happening.

Scott Lystig Fritchie

unread,

Apr 23, 2013, 3:01:27 PM4/23/13

to Brian Picciano, erlang-q...@erlang.org

Brian Picciano <mediocr...@gmail.com> wrote:

bp> We have a pool of 3 erlang nodes, all on different servers. Every
bp> afternoon, without fail, we start seeing lots of messages between
bp> the nodes start having really high latency, on the order of tens of
bp> seconds. [...]

Brian, it's probably worthwhile to continue chasing the 'lcnt' avenue
as you've been corresponding with Lukas...

... but at the same time, I also wonder about "tens of seconds". My gut
says that such delays would require some amazingly high lock contention
rates. Something that can cause such messaging delays much more easily
is network congestion/packet loss that triggers TCP slow start. Many
Linux kernels have the RTO_min value at one second, which is the amount
of time to wait before entering slow start state.

If network packet loss is a problem, this blog posting can explain one
reason why it's happening:
http://www.snookles.com/slf-blog/2012/01/05/tcp-incast-what-is-it/

-Scott

Lukas Larsson

unread,

May 16, 2013, 3:43:21 AM5/16/13

to Brian Picciano, Erlang Questions

Hello Brian,

Just letting you know that I have just merged a fix which changes the rwlock I mentioned before to an rlock. This should reduce the contention which you are seeing if it was caused by many calls to erlang:nodes().

Lukas

Brian Picciano

unread,

Jun 12, 2013, 8:28:47 PM6/12/13

to Lukas Larsson, Erlang Questions

Thanks for the heads up Lukas! Sorry I stopped responding, we ended up solving the problem (for now) by drastically cutting down on inter-node communication in another way, and this thread got lost in my inbox, but I really appreciate the follow-up!

Reply all

Reply to author

Forward

[erlang-questions] High lock contention on dist_tables

Brian Picciano

Lukas Larsson

Brian Picciano

Lukas Larsson

Scott Lystig Fritchie

Lukas Larsson

Brian Picciano