RMQ reports incorrect process memory when vm_memory_calculation_strategy is unset

96 views
Skip to first unread message

John Baublitz

unread,
Oct 17, 2017, 10:24:21 AM10/17/17
to rabbitmq-users
Hi all,
We have bumped into a problem with version 3.6.12 of rabbitmq-server/rabbitmq-common code with Erlang 19.3.6 running on Ubuntu 16.04. Under heavy load, RMQ becomes unusable, stops accepting and ACKing published messages and throughput drops to close to zero. This happened in our testing environment after an update from Ubuntu 14.04, Erlang R16B03 and RMQ 3.5.3. With these versions, we do not encounter this issue with the same config. We seem to have isolated the issue with this version. After determining that new tunable options were added between our previous version and this one, we investigated the vm_* options. The only default that appeared to have changed was vm_memory_calculation_strategy so after ruling out the others, we focused on this one. We found that even though rabbitmqctl is reporting the calculation strategy to be rss by default when this tunable is unset, the memory usage reported by RMQ internal functions is actually returning results consistent with a value of erlang. This is entirely consistent with the behavior we see which is the following:

1. RMQ shows the above failure behavior when vm_memory_calculation_strategy is unset but reports a value of rss.
2. RMQ behaves properly when vm_memory_calculation_strategy is set to rss and reports rss.
3. RMQ shows the above failure behavior when vm_memory_calculation_strategy is set to erlang and reports a value of erlang.

To reproduce this behavior, follow the following steps:

1. Unset rabbit.vm_memory_calculation_strategy by removing from any config files that set this - this should default to rss and is reported to be by rabbitmqctl status
2. Run rabbitmqctl eval "erlang:memory(total)."
3. Run rabbitmqctl eval "rabbitmqctl eval "os:cmd(\"ps -p \" ++ os:getpid() ++ \" -o rss=\")." and multiply the result by 1024.
4. Run rabbitmqctl eval "vm_memory_monitor:get_process_memory()."
5. The result of the 1st and 3rd should match even though rabbitmqctl reports rss which indicates the last two should match instead.
6. Repeat steps with rabbit.vm_memory_calculation_strategy set to rss. Now the last two should match and the strategy reporting should not change.

My hypothesis based on the values being returned is that for some reason, RMQ is falling back on erlang reporting when vm_memory_calculation_strategy is unset. Why this is, I haven’t had the time to investigate yet but there seems to be some code in v3.6.12 that falls back on Erlang memory reporting if the output of ps is improperly formatted (see get_ps_memory() for this).

While we have found a way around this by setting vm_memory_calculation_strategy to rss, this seems to be a critical issue if RMQ is actually using statistics consistent with erlang memory management while simultaneously reporting that it is using rss for memory reporting. This will cause systems with this value unset to potentially thrash under heavy load as memory is apparently approximately 20% underreported with erlang reporting. I’m happy to discuss an alternate solution but the current state of things seems like it will cause confusion when things do go wrong in the above situation. What are your thoughts on all of this?

Thanks,
John

Michael Klishin

unread,
Oct 17, 2017, 10:39:48 AM10/17/17
to rabbitm...@googlegroups.com
When a value isn't set, there is a default. Here it is in 3.6.12:

Here it is in stable (future 3.6.13):

The "allocated" strategy is as precise as the RSS one but does not call out to any external tools:

3.6.12 will fall back to the old strategy if all else fails:

I tried reproducing steps through 2 and 4 on an idle node on 64-bit Erlang 19.3.6 that
uses the following config:

Here's a transcript:

rabbitmqctl eval "erlang:memory(total)."
67986848
rabbitmqctl eval 'vm_memory_monitor:get_process_memory().'
79544320

As part of https://github.com/rabbitmq/rabbitmq-common/pull/227, Team RabbitMQ has conducted extensive testing
with various OS memory amounts (all the way up to 60+ GB) and produced multiple charts that demonstrate
how erlang:memory(total), RSS reported by the kernel and the new allocator metric strategy relate to each other.

A multi-hour test produces a chart that demonstrates that the new strategy reports values that are
very very close to what the kernel reports.

We currently have a little SNAFU in our CI pipeline but as soon as those are resolved I will upload a snapshot
3.6.13 build for everyone to try.


--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Luke Bakken

unread,
Oct 17, 2017, 1:17:24 PM10/17/17
to rabbitmq-users
Hi John,

This is an interesting report, thanks. Unfortunately, I can't reproduce what you describe. Here's what I did -

* Set up a fresh Ubuntu 16 VM using vagrant

* Install Erlang 19.3.6 and RabbitMQ 3.6.12

* Clone the https://github.com/massemanet/eper.git repository and build. This is to have access to the redbug tracing library.

* Connect to the running RabbitMQ node via remsh, add eper to the code path, and run a trace on vm_memory_monitor:get_ps_memory()

In the attached file, you can see a transcript of the output from the above commands showing that the output of ps is what is being used to collect memory information.

Did you upgrade your Ubuntu 14 environment via apt or build a new one from scratch?

Thanks,
Luke
ubuntu-16-vm_memory_monitor-trace.txt

Michael Klishin

unread,
Oct 17, 2017, 1:46:22 PM10/17/17
to rabbitm...@googlegroups.com
Also, can `rabbitmqctl environment` and `rabbitmqctl status` output be provided?
It will display effective configuration, RabbitMQ and Erlang versions used.

--
You received this message because you are subscribed to the Google Groups "rabbitmq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rabbitmq-users+unsubscribe@googlegroups.com.
To post to this group, send email to rabbitmq-users@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Michael Klishin

unread,
Oct 18, 2017, 1:16:15 PM10/18/17
to rabbitm...@googlegroups.com
A snapshot 3.6.13 build from earlier today can be obtained from https://dl.bintray.com/rabbitmq/all-dev/rabbitmq-server/3.6.13-alpha.28/.

To post to this group, send email to rabbitm...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--
MK

Staff Software Engineer, Pivotal/RabbitMQ

gl...@pivotal.io

unread,
Nov 6, 2017, 7:44:10 AM11/6/17
to rabbitmq-users
Hi John,

I would be interested to know if you can reproduce this in 3.6.13, because we can't. If there is still an issue, please let me know what PerfTest flags [1] will re-create your RabbitMQ workload. The RabbitMQ use-cases that we've made public might help [2].

The description of your issue sounds awfully similar to a queue paging to disk [3]. If that's the case, you will observe frequent disk write bursts in 3.6.13 vs a long period of no queue activity followed by a single, large disk write spike.

Thank you, Gerhard.

Reply all
Reply to author
Forward
0 new messages