Hello.
We are running various experiments with the latest RabbitMQ and
collecting metrics to see how it behaves over time.
One of the tests suggests that there can be some constant growth in
reported memory usage.
(Or it can be that I completely misunderstand how RabbitMQ/Erlang
uses memory)
The load-test setup:
* Single RabbitMQ 3.6.12 broker (Erlang 19) on AWS EC2 node
* several load-generating nodes establishing persistent AMQP over
TLS connections.
* there is a single consumer that gets messages from the broker and
basically discards them
Together all the load-generators keep 3600 connections established
and combined publish rate is about 400 messages per second.
This is way beyond a node (c4.2xlarge I believe) can handle.
The test is running for more than 24 hours now with all "input"
parameters being steady.
(All the charts below should have the same time scale going from
Friday 15, 16:00 to Saturday 16, 23:00)
overview object_totals:
overview message_details:
Now, what observations caused concerns (and this email):
1. the
mem_used metric keeps growing even after all these
hours - the growth is not too large now but the trend is clearly
visible.
It is still under
mem_limit so I am not sure if this is a
kind of expected behavior or what it going to happen when limit is
reached.
2. in an attempt to explain that growth, we collected memory
distribution metrics (by passing ?memory=true to the /node/xxx REST
API)
There is some (barely visible) upward trend in
memory.binary
And a better visible trend in
memory.other_system
3. we added binary=true to the /node/xxx REST API to get more
insight into that binary thing
So
binary.connection_readers was growing rather rapidly in
the very beginning, then it hit 100Mb and jumped down where it
stayed flat for the rest of the test. Not sure what the initial
quick growth means really.
And
binary.connection_channels kept growing since the start
of test and until about 140Mb. Then fallen and then it is growing
again (even beyond 140 Mb)
To be honest, I know very little on how Erlang works. Maybe some
sort of a garbage collection only kicks in at a certain point and
all these charts represent an expected behaviour.
It just feels a bit strange that a process keeps growing in memory
(even slowly) when there aren't any messages to queue etc.
Also, there is just too little info on what
memory.other_system
really means.
The doc (
https://www.rabbitmq.com/memory-use.html ) says
Other system memory
Other memory used by Erlang. One contributor to this
value is the number of available file descriptors.
but that does not really explain much.
4. Other observations - the sum of
binary.* is not the same
as
memory.binary - is that expected? I could not find
explanation what these metrics really mean.
If someone could shed some light on what I am observing on these
charts, that would help a lot.
If there are some other metrics that need to be collected from a
[still running] RabbitMQ process - I can do that.
Many thanks
Dmitry