Hi all,
We have bumped into a problem with version 3.6.12 of rabbitmq-server/rabbitmq-common code with Erlang 19.3.6 running on Ubuntu 16.04. Under heavy load, RMQ becomes unusable, stops accepting and ACKing published messages and throughput drops to close to zero. This happened in our testing environment after an update from Ubuntu 14.04, Erlang R16B03 and RMQ 3.5.3. With these versions, we do not encounter this issue with the same config. We seem to have isolated the issue with this version. After determining that new tunable options were added between our previous version and this one, we investigated the vm_* options. The only default that appeared to have changed was vm_memory_calculation_strategy so after ruling out the others, we focused on this one. We found that even though rabbitmqctl is reporting the calculation strategy to be rss by default when this tunable is unset, the memory usage reported by RMQ internal functions is actually returning results consistent with a value of erlang. This is entirely consistent with the behavior we see which is the following:
1. RMQ shows the above failure behavior when vm_memory_calculation_strategy is unset but reports a value of rss.
2. RMQ behaves properly when vm_memory_calculation_strategy is set to rss and reports rss.
3. RMQ shows the above failure behavior when vm_memory_calculation_strategy is set to erlang and reports a value of erlang.
To reproduce this behavior, follow the following steps:
1. Unset rabbit.vm_memory_calculation_strategy by removing from any config files that set this - this should default to rss and is reported to be by rabbitmqctl status
2. Run rabbitmqctl eval "erlang:memory(total)."
3. Run rabbitmqctl eval "rabbitmqctl eval "os:cmd(\"ps -p \" ++ os:getpid() ++ \" -o rss=\")." and multiply the result by 1024.
4. Run rabbitmqctl eval "vm_memory_monitor:get_process_memory()."
5. The result of the 1st and 3rd should match even though rabbitmqctl reports rss which indicates the last two should match instead.
6. Repeat steps with rabbit.vm_memory_calculation_strategy set to rss. Now the last two should match and the strategy reporting should not change.
My hypothesis based on the values being returned is that for some reason, RMQ is falling back on erlang reporting when vm_memory_calculation_strategy is unset. Why this is, I haven’t had the time to investigate yet but there seems to be some code in v3.6.12 that falls back on Erlang memory reporting if the output of ps is improperly formatted (see get_ps_memory() for this).
While we have found a way around this by setting vm_memory_calculation_strategy to rss, this seems to be a critical issue if RMQ is actually using statistics consistent with erlang memory management while simultaneously reporting that it is using rss for memory reporting. This will cause systems with this value unset to potentially thrash under heavy load as memory is apparently approximately 20% underreported with erlang reporting. I’m happy to discuss an alternate solution but the current state of things seems like it will cause confusion when things do go wrong in the above situation. What are your thoughts on all of this?
Thanks,
John