Periodic garbage_collect() in RabbitMQ

309 views
Skip to first unread message

Tim Stewart

unread,
Dec 19, 2014, 4:48:07 PM12/19/14
to rabbitm...@googlegroups.com
Hello rabbitmq-users,

During the course of tracking down some latency issues in my system, I
came across background_gc.erl and the fact that a garbage collection of
all waiting processes is performed every 60 seconds. I suspect that
many users never notice this periodic collection, but my system "abuses"
RabbitMQ a bit by opening quite a few connections (several thousand).
The background_gc:gc/0 function takes a nontrivial amount of time to run
on an idle instance of RabbitMQ in my system (i.e. system is up and
running but doing no work). Recon reveals the following:

(rabbit@host)1> recon_trace:calls({background_gc, gc,
[{'_', [], [{return_trace}]}]},
{10, 1000}).
1
16:39:04.115221 <0.260.0> background_gc:gc()
16:39:04.203086 <0.260.0> background_gc:gc/0 --> ok
(rabbit@host)1>

This particular example takes almost 88 milliseconds to run. My
analysis shows that one of my CPU cores is likely pegged during this time.

I have observed 2+ second message latency in our production environment
at the time this VM-wide garbage collection occurs. I suspect this code
exists to prevent some pathological Erlang memory management issues with
refcounted binaries, but I wanted to get the list's opinion of its purpose.

I'm going to perform some testing using OTP 17.4 with the periodic GC
disabled. I will report my results here.

-TimS

--
Tim Stewart
t...@stoo.org

Michael Klishin

unread,
Dec 19, 2014, 4:50:56 PM12/19/14
to Tim Stewart, rabbitm...@googlegroups.com
 On 20 December 2014 at 00:48:07, Tim Stewart (t...@stoo.org) wrote:
> I'm going to perform some testing using OTP 17.4 with the periodic
> GC
> disabled. I will report my results here.

Thank you!
--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Simon MacMullen

unread,
Jan 5, 2015, 5:30:37 AM1/5/15
to Tim Stewart, rabbitm...@googlegroups.com
On 19/12/2014 21:48, Tim Stewart wrote:
> This particular example takes almost 88 milliseconds to run. My
> analysis shows that one of my CPU cores is likely pegged during this time.
>
> I have observed 2+ second message latency in our production environment
> at the time this VM-wide garbage collection occurs.

Of course 88ms is not 2s. But there could be a connection.

So background_gc first of all gets a list of all processes, then tells
each waiting one to GC. The actual GCs should not block the whole server
since Erlang processes each have their own heap, but possibly getting
the list of processes blocks the server?

Note also that we attempt to back off from doing this too frequently if
the GC takes too long:

http://hg.rabbitmq.com/rabbitmq-server/file/e9dcfbf6695a/src/background_gc.erl#l71

> I suspect this code
> exists to prevent some pathological Erlang memory management issues with
> refcounted binaries, but I wanted to get the list's opinion of its purpose.

Not exclusively about binaries, but we have seen cases where processes
manage to eat lots of memory while executing not many reductions (large
binaries certainly help here) - meaning they need to be prodded to GC.

> I'm going to perform some testing using OTP 17.4 with the periodic GC
> disabled. I will report my results here.

Cool. It would also be useful if you can try just disabling the
garbage_collect/1 while keeping the list comprehension here:
http://hg.rabbitmq.com/rabbitmq-server/file/e9dcfbf6695a/src/background_gc.erl#l78
to see whether the garbage_collect/1 or processes/0 calls are to blame.

Cheers, Simon
Reply all
Reply to author
Forward
0 new messages