[erlang-questions] High lock contention on dist_tables

50 views
Skip to first unread message

Brian Picciano

unread,
Apr 19, 2013, 3:24:12 PM4/19/13
to erlang-q...@erlang.org
We have a pool of 3 erlang nodes, all on different servers. Every afternoon, without fail, we start seeing lots of messages between the nodes start having really high latency, on the order of tens of seconds. Today we ran lcnt on them to see if there's anything there, and found that on one of the nodes dist_tables had a significantly higher lock percentage then anything else, and definitely higher then on the other boxes:

(node@address)8> lcnt:conflicts().
                               
                 lock     id   #tries  #collisions  collisions [%]  time [us]  duration [%]
                -----    ---  ------- ------------ --------------- ---------- -------------
           dist_table      1  3468191      1242055         35.8128  153712413      255.2521
            run_queue     24 76969638      4088578          5.3119   14468656       24.0264 
        process_table      1  2015686       147148          7.3001    3208529        5.3280 
          timer_wheel      1 12214948       834737          6.8337    3076638        5.1090 
            timeofday      1 18231600       594487          3.2608    1491633        2.4770 
...

while on the other boxes it had closer to 3. On the box with the high lock contention we also saw much higher load then on the other boxes.

My question is: what is this lock? We couldn't find much online except that it appears to have to do with communication between nodes, but we're not sure what. Also, what, if anything, could we do to mitigate this problem?

(We're running erlang 16B)

Lukas Larsson

unread,
Apr 22, 2013, 12:06:43 PM4/22/13
to Brian Picciano, Erlang Questions
The dist_table mutex refers to the rwmutex which is defined here[1]. There is a bunch of different places where it is used, so saying exactly what is causing the contentions is hard without knowing the code. Generally it should indicate that you are trying to send many messages over distribution while information about remote nodes is changing frequently.

One thing I noticed is that the nodes() bif call takes a rwlock on the mutex. Are you using that bif alot?


_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions


Brian Picciano

unread,
Apr 22, 2013, 4:23:48 PM4/22/13
to Lukas Larsson, Erlang Questions
We are actually. Is there an alternative way of easily retrieving which nodes are currently connected?

Lukas Larsson

unread,
Apr 23, 2013, 3:30:32 AM4/23/13
to Brian Picciano, Erlang Questions
I don't know of an alternative way. Before trying to come up with a solution I would verify that it is nodes() which is causing the contention.

If you do lcnt:inspect(dist_table,[{locations,true}]) you should get a list of which actual locks in the source code it is that the contentions are happening.

Scott Lystig Fritchie

unread,
Apr 23, 2013, 3:01:27 PM4/23/13
to Brian Picciano, erlang-q...@erlang.org
Brian Picciano <mediocr...@gmail.com> wrote:

bp> We have a pool of 3 erlang nodes, all on different servers. Every
bp> afternoon, without fail, we start seeing lots of messages between
bp> the nodes start having really high latency, on the order of tens of
bp> seconds. [...]

Brian, it's probably worthwhile to continue chasing the 'lcnt' avenue
as you've been corresponding with Lukas...

... but at the same time, I also wonder about "tens of seconds". My gut
says that such delays would require some amazingly high lock contention
rates. Something that can cause such messaging delays much more easily
is network congestion/packet loss that triggers TCP slow start. Many
Linux kernels have the RTO_min value at one second, which is the amount
of time to wait before entering slow start state.

If network packet loss is a problem, this blog posting can explain one
reason why it's happening:
http://www.snookles.com/slf-blog/2012/01/05/tcp-incast-what-is-it/

-Scott

Lukas Larsson

unread,
May 16, 2013, 3:43:21 AM5/16/13
to Brian Picciano, Erlang Questions
Hello Brian,

Just letting you know that I have just merged a fix which changes the rwlock I mentioned before to an rlock. This should reduce the contention which you are seeing if it was caused by many calls to erlang:nodes().

Lukas

Brian Picciano

unread,
Jun 12, 2013, 8:28:47 PM6/12/13
to Lukas Larsson, Erlang Questions
Thanks for the heads up Lukas! Sorry I stopped responding, we ended up solving the problem (for now) by drastically cutting down on inter-node communication in another way, and this thread got lost in my inbox, but I really appreciate the follow-up!
Reply all
Reply to author
Forward
0 new messages