management stats db issues on 3.6.0

683 views
Skip to first unread message

Noah

unread,
Jan 14, 2016, 11:12:07 AM1/14/16
to rabbitmq-users
Hi,

We had some unscheduled maintenance yesterday thanks to a RMQ 3.5.6 crash, and decided to upgrade to Erlang 18.2 and RMQ 3.6.0 on our 20 node cluster last night.  The upgrade was successful after many hours of reindexing.  

Now, however, we notice some differences in the management plugin.  All queues show as idle, and consumer utilization at 0% (although we see consumers getting messages).  Also the stats db backs up almost immediately to tens of millions of events.  We've restarted the management plugin using 

rabbitmqctl eval 'exit(erlang:whereis(rabbit_mgmt_db), please_terminate).'

... and it restarts on the same node, with the same behavior.

This is our config on each node, any suggestions to reduce load on the management db?

[
  { kernel, [ { inet_dist_listen_min, 4096}, 
              { inet_dist_listen_max, 5120 },
              { net_tick_time, 120 } 
            ] 
  },
  { rabbit, [ { cluster_nodes, {[  ... ], disc }},
              { hipe_compile, true },
              { disk_free_limit, 20000000000 },
              { vm_memory_high_watermark_paging_ratio, 0.35},
              { vm_memory_high_watermark, 0.3 },
              { log_levels,  [{mirroring, info}, {connection, error}] },
              { cluster_partition_handling, pause_minority },
              { collect_statistics_interval, 15000 } 
            ]
  },
  {rabbitmq_management,       [ { rates_mode,    none } ] }
].

As an aside, we have tried to contact Pivotal for paid support, but never received a response.  Do they offer paid support for RabbitMQ, and if so, who should we contact?


Thanks,

-Noah

Michael Klishin

unread,
Jan 14, 2016, 12:55:16 PM1/14/16
to rabbitm...@googlegroups.com, Noah
On 14 January 2016 at 19:12:09, Noah (noah....@gmail.com) wrote:
> { collect_statistics_interval, 15000 }

Increase this value. And probably subscribe to https://github.com/rabbitmq/rabbitmq-management/issues/41

> As an aside, we have tried to contact Pivotal for paid support,
> but never received a response. Do they offer paid support for
> RabbitMQ, and if so, who should we contact?

sa...@rabbitmq.com
--
MK

Staff Software Engineer, Pivotal/RabbitMQ


Michael Klishin

unread,
Jan 14, 2016, 12:57:24 PM1/14/16
to rabbitm...@googlegroups.com, Noah
On 14 January 2016 at 19:12:09, Noah (noah....@gmail.com) wrote:
> Also the stats db backs up almost immediately to tens of millions
> of events.

Note that this can be perfectly natural for a cluster that's been just brought up:
clients will re-connect and declare queues, etc at very high rate initially,
which greatly increases the number of events emitted.

This will be true for any stats emission interval, and the overall rate will go down
(the higher the collection interval, the lower it will be per unit of time). 

Noah

unread,
Jan 14, 2016, 1:18:55 PM1/14/16
to rabbitmq-users, noah....@gmail.com


On Thursday, January 14, 2016 at 9:57:24 AM UTC-8, Michael Klishin wrote:
On 14 January 2016 at 19:12:09, Noah (noah....@gmail.com) wrote:
> Also the stats db backs up almost immediately to tens of millions  
> of events.

Note that this can be perfectly natural for a cluster that's been just brought up:
clients will re-connect and declare queues, etc at very high rate initially,
which greatly increases the number of events emitted.

This state remains for several hours, long after all clients have connected, in fact it does not abate until we discontinue publishing entirely.  It is not something we've seen in 3.5.6.  Queues are all durable, none are being declared.  We also never see consumer utilization return a value other than null.  

We are also seeing very poor publish performance compared to 3.5.6 and are planning a downgrade to 3.5.7 later today as a result unless we can resolve these issues.

Michael Klishin

unread,
Jan 14, 2016, 1:50:01 PM1/14/16
to rabbitm...@googlegroups.com, Noah
On 14 January 2016 at 21:18:58, Noah (noah....@gmail.com) wrote:
> This state remains for several hours, long after all clients 
> have connected, in fact it does not abate until we discontinue
> publishing entirely. It is not something we've seen in 3.5.6.
> Queues are all durable, none are being declared.

Publishing does not put any stress on the management database.
Stats are emitted periodically or when certain things are created/declared
(e.g. a new connection or channel is opened, or a queue declared) or closed/deleted.

> We also never
> see consumer utilization return a value other than null.

This can be a legitimate issue but not necessarily in  the management plugin.

Noah

unread,
Jan 14, 2016, 1:59:52 PM1/14/16
to rabbitmq-users, noah....@gmail.com
On Thursday, January 14, 2016 at 10:50:01 AM UTC-8, Michael Klishin wrote:
On 14 January 2016 at 21:18:58, Noah (noah....@gmail.com) wrote:
> This state remains for several hours, long after all clients 
> have connected, in fact it does not abate until we discontinue  
> publishing entirely. It is not something we've seen in 3.5.6.  
> Queues are all durable, none are being declared.

Publishing does not put any stress on the management database.
Stats are emitted periodically or when certain things are created/declared
(e.g. a new connection or channel is opened, or a queue declared) or closed/deleted.

That is good to know, thank you.  One more variable to eliminate.  The node with the management db, as it is selected by random, does of course host queues and receive connections from publishers and the load increases along with the publish rate.  Is there anything else we can do to reduce the management db load in this scenario?

 

> We also never  
> see consumer utilization return a value other than null.

This can be a legitimate issue but not necessarily in  the management plugin.

We also see this via the REST API, if that makes any difference.
 

Michael Klishin

unread,
Jan 14, 2016, 2:01:47 PM1/14/16
to rabbitm...@googlegroups.com, Noah
On 14 January 2016 at 21:59:55, Noah (noah....@gmail.com) wrote:
> Is there anything else we can do to reduce the management db load
> in this scenario?

Increase stats emission interval.

> This can be a legitimate issue but not necessarily in the management
> plugin.
>
> We also see this via the REST API, if that makes any difference.

Management UI is built on the REST API. 

Michael Klishin

unread,
Jan 14, 2016, 2:11:40 PM1/14/16
to Noah Magram, rabbitm...@googlegroups.com
+rabbitmq-users 

On 14 January 2016 at 22:04:53, Noah Magram (noah....@gmail.com) wrote:
> Would you suggest a value, we have it currently set to 15000 (ms),
> we do chart this information as it relates to queue depth, but
> don't need particularly fine granularity.

60000? (1 minute)

> Also, is there any way to do this at runtime? If not, would this
> be a feature request that you would consider?

rabbitmqctl eval 'application:set_env(rabbit, collect_statistics_interval, 60000).'

It will not have any effect on existing connections, channels, and queues,
because the timers for those are already set up. But it will have effect on new
stats emitting entities.

Noah

unread,
Jan 14, 2016, 2:30:23 PM1/14/16
to rabbitmq-users, noah....@gmail.com
Thanks.

Does this also mean that durable queues will not pick up a new configuration value at server restart, until they are redeclared?

Michael Klishin

unread,
Jan 14, 2016, 2:37:35 PM1/14/16
to rabbitm...@googlegroups.com, Noah
On 14 January 2016 at 22:30:26, Noah (noah....@gmail.com) wrote:
> Does this also mean that durable queues will not pick up a new
> configuration value at server restart, until they are redeclared?

No need to redeclare them. Their timers will be set up with the new config value
on node start, when durable queues are recovered. 

Noah

unread,
Jan 15, 2016, 12:50:40 AM1/15/16
to rabbitmq-users, noah....@gmail.com


On Thursday, January 14, 2016 at 11:37:35 AM UTC-8, Michael Klishin wrote:
On 14 January 2016 at 22:30:26, Noah (noah....@gmail.com) wrote:
> Does this also mean that durable queues will not pick up a new  
> configuration value at server restart, until they are redeclared?  

No need to redeclare them. Their timers will be set up with the new config value
on node start, when durable queues are recovered. 

I would note that after a full and lengthy restart of our cluster, the problem persists.  10s of millions of events in the backlog after publishing perhaps a million.

Our config follows:

[
  { kernel, [ { inet_dist_listen_min, 4096}, 
              { inet_dist_listen_max, 5120 },
              { net_tick_time, 120 } 
            ] 
  },
  { rabbit, [ { cluster_nodes, {[ ... ], disc }},
              { hipe_compile, true },
              { disk_free_limit, 20000000000 },
              { vm_memory_high_watermark_paging_ratio, 0.5},
              { vm_memory_high_watermark, 0.4 },
              { log_levels,  [{mirroring, info}, {connection, error}] },
              { cluster_partition_handling, pause_minority },
              { collect_statistics_interval, 120000 } 
            ]
  },
  {rabbitmq_management,       [ { rates_mode,    none } ] }
].

 
--  

Noah

unread,
Jan 15, 2016, 1:51:33 AM1/15/16
to rabbitmq-users, noah....@gmail.com
I should also note that the stats db node constantly hit its high water mark then eventually suffered an allocation failure and crashed.

-N

Michael Klishin

unread,
Jan 15, 2016, 7:27:32 AM1/15/16
to rabbitm...@googlegroups.com, Noah
 On 15 January 2016 at 09:51:37, Noah (noah....@gmail.com) wrote:
> I should also note that the stats db node constantly hit its high
> water mark then eventually suffered an allocation failure and
> crashed.

Currently event collector creates a ton of garbage in a single process, continuously,
and if it falls behind, it also falls behind on removing old samples from an in-memory
data store it uses.

Noah Magram

unread,
Jan 15, 2016, 1:32:06 PM1/15/16
to Michael Klishin, rabbitm...@googlegroups.com
We are downgrading to 3.5.7 and erl 17.5 today after two full days of outages, today making the third.

I would like to reiterate that our usage was no different from previous versions, with the exception of increasing the collect_statistics_interval configuration value.   We even went so far as to move all queues processes off the node with the management console, with the same outcome.

The only thing I can think of is that this may be somehow related to the exponential backoff GC changes on rmq or other changes in Erlang 18.2 that got exercised due to the high amount of garbage generated by the events collector? 

Michael Klishin

unread,
Jan 15, 2016, 1:35:59 PM1/15/16
to Noah Magram, rabbitm...@googlegroups.com
On 15 January 2016 at 21:32:01, Noah Magram (noah....@gmail.com) wrote:
> The only thing I can think of is that this may be somehow related
> to the exponential backoff GC changes on rmq or other changes
> in Erlang 18.2 that got exercised due to the high amount of garbage
> generated by the events collector?

Event collector generates so much garbage that it has to be GC'ed very often.

And the background GC interval has been *decreased* in 3.6.0 (the interval can no longer
go above a certain number of minutes). 

We have identified an issue that slowed down queries in https://github.com/rabbitmq/rabbitmq-management/pull/101,
so hopefully you'd have a build with that to try in the next week or so.

Noah Magram

unread,
Jan 15, 2016, 1:43:43 PM1/15/16
to Michael Klishin, rabbitm...@googlegroups.com
Perhaps as an unintended consequence of the increased GC frequency, it is preventing the events collector from doing the work needed to get through the events backlog?  We've seen similar behavior in other (JVM based) applications with stop-the-world GC, where throughput diminishes with increased GC frequency to the point where it becomes impossible to catch up with the workload.  

I read through some of the issues related to the stats db performance, and unfortunately it seems like a difficult problem.  As it stands today, it a significant bottleneck for scaling a cluster.  In the interim, it would be nice to be able to separate the the management API from the stats collection so you could at least keep the REST API up if your load caused the stats collector to become non-functional.


Thanks,

-N

Michael Klishin

unread,
Jan 15, 2016, 1:46:41 PM1/15/16
to Noah Magram, rabbitm...@googlegroups.com
On 15 January 2016 at 21:43:41, Noah Magram (noah....@gmail.com) wrote:
> Perhaps as an unintended consequence of the increased GC frequency,
> it is preventing the events collector from doing the work needed
> to get through the events backlog?

Part of the issue is that the event collector process does too much, including
ETS tables cleanup and serving read queries. So when GC kicks in, it
affects all of those things.

This is why use separate "ETS GC" processes, with one process per table, and
looking into separating reads from event collection. And so on.

> We've seen similar behavior
> in other (JVM based) applications with stop-the-world GC, where
> throughput diminishes with increased GC frequency to the point
> where it becomes impossible to catch up with the workload.

In Erlang GC is per-process because processes don't share heaps. Plus background
GC runs once every 10, 30, etc seconds, which is very rare to affect anything. 

Noah Magram

unread,
Jan 15, 2016, 1:53:39 PM1/15/16
to Michael Klishin, rabbitm...@googlegroups.com
On Fri, Jan 15, 2016 at 10:46 AM, Michael Klishin <mkli...@pivotal.io> wrote:

Part of the issue is that the event collector process does too much, including
ETS tables cleanup and serving read queries. So when GC kicks in, it
affects all of those things.

This is why use separate "ETS GC" processes, with one process per table, and
looking into separating reads from event collection. And so on.

Certainly sacrificing read performance to handle writes/housekeeping would be preferable in cases of high load.  The performance improvement from the lazy queues was quite substantial for our workload, and we really hoped to get 3.6.0 working.  We went from 25k/s with 3.5.6 to 50k/s with 3.6.0 and lazy queues before we started blocking due to hitting the high water mark on the node hosting the management server.  Unfortunately, we were unable to mitigate this issue, and eagerly look forward to a fix.  



In Erlang GC is per-process because processes don't share heaps. Plus background
GC runs once every 10, 30, etc seconds, which is very rare to affect anything. 

Ah yes of course, no shared heap :)


Thanks,

-N

Michael Klishin

unread,
Jan 15, 2016, 1:55:26 PM1/15/16
to Noah Magram, rabbitm...@googlegroups.com
On 15 January 2016 at 21:53:37, Noah Magram (noah....@gmail.com) wrote:
> Certainly sacrificing read performance to handle writes/housekeeping
> would be preferable in cases of high load. The performance improvement
> from the lazy queues was quite substantial for our workload,
> and we really hoped to get 3.6.0 working. We went from 25k/s with
> 3.5.6 to 50k/s with 3.6.0 and lazy queues before we started blocking
> due to hitting the high water mark on the node hosting the management
> server. Unfortunately, we were unable to mitigate this issue,
> and eagerly look forward to a fix.

Thanks for the data point, that’s good to hear.

Someone is spending most of her time on parallelising event collector. You’ll be
first to know outside of Pivotal when that work is merged. 

Noah Magram

unread,
Jan 15, 2016, 2:01:40 PM1/15/16
to Michael Klishin, rabbitm...@googlegroups.com
On Fri, Jan 15, 2016 at 10:55 AM, Michael Klishin <mkli...@pivotal.io> wrote:

Someone is spending most of her time on parallelising event collector. You’ll be
first to know outside of Pivotal when that work is merged. 


Excellent, thank you!


-N

Noah

unread,
Jan 15, 2016, 6:15:30 PM1/15/16
to rabbitmq-users
Can confirm this issue does not occur with 3.5.7 and Erlang 17.5 in our environment.

-N

Michael Denny

unread,
Mar 2, 2016, 12:50:12 PM3/2/16
to rabbitmq-users
Sorry to kick in, but I would like to say that a couple of days ago we have also upgraded from RMQ 3.5.2 (Erlang 17.5) to 3.6.0 (Erlang 18.2.1) and we are experience a very similar problem to Noah, with the 3.5.2 we never saw statistics database backlogging events, just after the upgrade we start seeing for the first time the yellow banner (stats db currently has a queue of xxxx events to process......), even if the CPU is averaging 40/50%, RAM 33%, and there were not such an increment of usage.

We don't see the banner constantly, and for us is not a blocking problem to now, if we just look the overview UI, no banner is shown, if we open also the Queues page, we start see the warning. This also happens more frequently when we use the mgmg ui concurrently from multiple users. We use "basic" stats as was set before in 3.5.2. Here's a video of the behavior:


I have some questions to understand the statistics behavior:

1. how is the statistics database update/emit?
2. does it emits some kind of statistics only when you open specific pages?
3. how does it scale with multiple users / multiple pages requests to the management ui? every user/page has it's own emitted statistics? (seems that more queues page opened the more statistics events are backlogged, even if I set the paged to show max 100 elements)


I've subscribed to:
and


Thank you in advance,
Michael.

Michael Klishin

unread,
Mar 2, 2016, 2:59:57 PM3/2/16
to rabbitm...@googlegroups.com, Michael Denny
 On 2 March 2016 at 20:50:15, Michael Denny (micd...@gmail.com) wrote:
> 1. how is the statistics database update/emit?
> 2. does it emits some kind of statistics only when you open specific
> pages?
> 3. how does it scale with multiple users / multiple pages requests
> to the management ui? every user/page has it's own emitted statistics?
> (seems that more queues page opened the more statistics events
> are backlogged, even if I set the paged to show max 100 elements)

The statistics database collects data emitted by nodes, connections, channels, queues.
The emission happens regardless of what UI page you have opened, if any. HTTP API
simply queries the database.

We have a new collector ready to be merged, will publish it in 3.6.2 Milestone 1 later this week.

Michael Denny

unread,
Mar 2, 2016, 3:42:24 PM3/2/16
to Michael Klishin, rabbitm...@googlegroups.com
Thank you! I realized after send the message that a lot of info about statistics database and emits are written in the code comments itself and in the #41 and #101 issues.

M.
--
Michael Denny
Software Architect, Trainer and Consultant

Reply all
Reply to author
Forward
0 new messages