At the most busy time of the day, we have a publish count of around 1300 m/s, with all other numbers (connections, channels, consumers etc.) staying the same.
Last sunday, we had a vm_memory_high_watermark set, so all publishers were blocked from publishing to the cluster.
The memory went up like crazy (from 0.5GB to over 9GB within 10 minutes), but there were not any spikes in other numbers like channels, number of connections, message rate etc..
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
=INFO REPORT==== 3-Jun-2018::16:53:13 ===
=WARNING REPORT==== 3-Jun-2018::17:14:09 ===
client unexpectedly closed TCP connection
=ERROR REPORT==== 3-Jun-2018::17:15:31 ===
Ranch listener rabbit_web_dispatch_sup_15672 had connection process started with cowboy_protocol:start_link/4 at <0.25283.5808> exit with reason: [{reason,{timeout,{gen_server,call,[<0.572.0>,{fetch,#Fun<rabbit_mgmt_db.22.104142656>, ... (all queues and exchanges)
(... About 10 error reports as the one above)
=ERROR REPORT==== 3-Jun-2018::17:19:31 ===
Ranch listener rabbit_web_dispatch_sup_15672 had connection process started with cowboy_protocol:start_link/4 at <0.20964.5822> exit with reason: [{reason,{timeout,{gen_server,call,[<0.572.0>,{fetch,#Fun<rabbit_mgmt_db.22.104142656>, ... (all queues and exchanges)=INFO REPORT==== 3-Jun-2018::17:19:38 ===
vm_memory_high_watermark set. Memory used:
7125663744 allowed:6871733043
=WARNING REPORT==== 3-Jun-2018::17:19:38 ===
memory resource limit alarm set on node rabbit@osmrabbit03.
**********************************************************
*** Publishers will be blocked until this alarm clears ***
**********************************************************
=INFO REPORT==== 3-Jun-2018::17:19:39 ===
vm_memory_high_watermark clear. Memory used:3993567232 allowed:6871733043
=WARNING REPORT==== 3-Jun-2018::17:19:39 ===
memory resource limit alarm cleared on node rabbit@osmrabbit03
=WARNING REPORT==== 3-Jun-2018::17:19:39 ===
memory resource limit alarm cleared across the cluster
=ERROR REPORT==== 3-Jun-2018::17:20:01 ===
Ranch listener rabbit_web_dispatch_sup_15672 had connection process started with cowboy_protocol:start_link/4 at <0.14631.5824> exit with reason: [{reason,{timeout,{gen_server,call,[<0.572.0>,{fetch,#Fun<rabbit_mgmt_db.22.104142656>, ... (all queues and exchanges)
=ERROR REPORT==== 3-Jun-2018::17:20:24 ===
Ranch listener rabbit_web_dispatch_sup_15672 had connection process started with cowboy_protocol:start_link/4 at <0.22413.5825> exit with reason: [{reason,{timeout,{gen_server,call,[<0.567.0>,{fetch,#Fun<rabbit_mgmt_db.21.104142656>,[]},60000]}}},{mfa,{rabbit_mgmt_wm_connections,to_json,2}},{stacktrace,[{gen_server,call,3,[{file,"gen_server.erl"},{line,214}]},{rabbit_mgmt_db,submit_cached,2,[{file,"src/rabbit_mgmt_db.erl"},{line,707}]},{rabbit_mgmt_wm_connections,augmented,2,[{file,"src/rabbit_mgmt_wm_connections.erl"},{line,55}]},{rabbit_mgmt_wm_connections,to_json,2,[{file,"src/rabbit_mgmt_wm_connections.erl"},{line,43}]},{cowboy_rest,call,3,[{file,"src/cowboy_rest.erl"},{line,976}]},{cowboy_rest,set_resp_body,2,[{file,"src/cowboy_rest.erl"},{line,858}]},{cowboy_protocol,execute,4,[{file,"src/cowboy_protocol.erl"},{line,442}]}]},{req,[{socket,#Port<0.2067080>},{transport,ranch_tcp},{connection,close},{pid,<0.22413.5825>},{method,<<"GET">>},{version,'HTTP/1.1'},{peer,{{10,10,0,252},38085}},{host,<<"10.10.0.123">>},{host_info,undefined},{port,15672},{path,<<"/api/connections">>},{path_info,undefined},{qs,<<>>},{qs_vals,[]},{bindings,[]},{headers,[{<<"te">>,<<"deflate,gzip;q=0.3">>},{<<"connection">>,<<"TE, close">>},{<<"authorization">>,<<"Basic bW9uaXRvcmluZzpQN0tLTVo2V2tnSHpkUTF2">>},{<<"host">>,<<"
10.10.0.123:15672">>},{<<"user-agent">>,<<"check_rabbitmq_connections libwww-perl/6.08">>}]},{p_headers,[{<<"if-modified-since">>,undefined},{<<"if-none-match">>,undefined},{<<"if-unmodified-since">>,undefined},{<<"if-match">>,undefined},{<<"accept">>,undefined},{<<"connection">>,[<<"te">>,<<"close">>]}]},{cookies,undefined},{meta,[{media_type,{<<"application">>,<<"json">>,[]}},{charset,undefined}]},{body_state,waiting},{buffer,<<>>},{multipart,undefined},{resp_compress,true},{resp_state,waiting},{resp_headers,[{<<"vary">>,[<<"accept">>,[<<", ">>,<<"accept-encoding">>],[<<", ">>,<<"origin">>]]},{<<"content-type">>,[<<"application">>,<<"/">>,<<"json">>,<<>>]},{<<"vary">>,<<"origin">>}]},{resp_body,<<>>},{onresponse,#Fun<rabbit_cowboy_middleware.onresponse.4>}]},{state,{context,{user,<<"monitoring">>,[monitoring],[{rabbit_auth_backend_internal,none}]},<<"P7KKMZ6WkgHzdQ1v">>,undefined}}]
=ERROR REPORT==== 3-Jun-2018::17:20:31 ===
Ranch listener rabbit_web_dispatch_sup_15672 had connection process started with cowboy_protocol:start_link/4 at <0.2776.5826> exit with reason: [{reason,{timeout,{gen_server,call,[<0.572.0>,{fetch,#Fun<rabbit_mgmt_db.22.104142656>, ... (all queues and exchanges)
=ERROR REPORT==== 3-Jun-2018::17:21:01 ===
Ranch listener rabbit_web_dispatch_sup_15672 had connection process started with cowboy_protocol:start_link/4 at <0.27697.5827> exit with reason: [{reason,{timeout,{gen_server,call,[<0.572.0>,{fetch,#Fun<rabbit_mgmt_db.22.104142656>, ... (all queues and exchanges)
=ERROR REPORT==== 3-Jun-2018::17:21:31 ===
Ranch listener rabbit_web_dispatch_sup_15672 had connection process started with cowboy_protocol:start_link/4 at <0.15965.5829> exit with reason: [{reason,{timeout,{gen_server,call,[<0.572.0>,{fetch,#Fun<rabbit_mgmt_db.22.104142656>, ... (all queues and exchanges)
=ERROR REPORT==== 3-Jun-2018::17:21:36 ===
Ranch listener rabbit_web_dispatch_sup_15672 had connection process started with cowboy_protocol:start_link/4 at <0.24446.5829> exit with reason: [{reason,{timeout,{gen_server,call,[<0.567.0>,{fetch,#Fun<rabbit_mgmt_db.21.104142656>,[]},60000]}}},{mfa,{rabbit_mgmt_wm_connections,to_json,2}},{stacktrace,[{gen_server,call,3,[{file,"gen_server.erl"},{line,214}]},{rabbit_mgmt_db,submit_cached,2,[{file,"src/rabbit_mgmt_db.erl"},{line,707}]},{rabbit_mgmt_wm_connections,augmented,2,[{file,"src/rabbit_mgmt_wm_connections.erl"},{line,55}]},{rabbit_mgmt_wm_connections,to_json,2,[{file,"src/rabbit_mgmt_wm_connections.erl"},{line,43}]},{cowboy_rest,call,3,[{file,"src/cowboy_rest.erl"},{line,976}]},{cowboy_rest,set_resp_body,2,[{file,"src/cowboy_rest.erl"},{line,858}]},{cowboy_protocol,execute,4,[{file,"src/cowboy_protocol.erl"},{line,442}]}]},{req,[{socket,#Port<0.2041008>},{transport,ranch_tcp},{connection,close},{pid,<0.24446.5829>},{method,<<"GET">>},{version,'HTTP/1.1'},{peer,{{10,10,0,252},38680}},{host,<<"10.10.0.123">>},{host_info,undefined},{port,15672},{path,<<"/api/connections">>},{path_info,undefined},{qs,<<>>},{qs_vals,[]},{bindings,[]},{headers,[{<<"te">>,<<"deflate,gzip;q=0.3">>},{<<"connection">>,<<"TE, close">>},{<<"authorization">>,<<"Basic bW9uaXRvcmluZzpQN0tLTVo2V2tnSHpkUTF2">>},{<<"host">>,<<"
10.10.0.123:15672">>},{<<"user-agent">>,<<"check_rabbitmq_connections libwww-perl/6.08">>}]},{p_headers,[{<<"if-modified-since">>,undefined},{<<"if-none-match">>,undefined},{<<"if-unmodified-since">>,undefined},{<<"if-match">>,undefined},{<<"accept">>,undefined},{<<"connection">>,[<<"te">>,<<"close">>]}]},{cookies,undefined},{meta,[{media_type,{<<"application">>,<<"json">>,[]}},{charset,undefined}]},{body_state,waiting},{buffer,<<>>},{multipart,undefined},{resp_compress,true},{resp_state,waiting},{resp_headers,[{<<"vary">>,[<<"accept">>,[<<", ">>,<<"accept-encoding">>],[<<", ">>,<<"origin">>]]},{<<"content-type">>,[<<"application">>,<<"/">>,<<"json">>,<<>>]},{<<"vary">>,<<"origin">>}]},{resp_body,<<>>},{onresponse,#Fun<rabbit_cowboy_middleware.onresponse.4>}]},{state,{context,{user,<<"monitoring">>,[monitoring],[{rabbit_auth_backend_internal,none}]},<<"P7KKMZ6WkgHzdQ1v">>,undefined}}]
=ERROR REPORT==== 3-Jun-2018::17:22:01 ===
Ranch listener rabbit_web_dispatch_sup_15672 had connection process started with cowboy_protocol:start_link/4 at <0.5944.5831> exit with reason: [{reason,{timeout,{gen_server,call,[<0.572.0>,{fetch,#Fun<rabbit_mgmt_db.22.104142656>, ... (all queues and exchanges)
=INFO REPORT==== 3-Jun-2018::17:22:25 ===
vm_memory_high_watermark set. Memory used:10770513920 allowed:6871733043
=WARNING REPORT==== 3-Jun-2018::17:22:25 ===
memory resource limit alarm set on node rabbit@osmrabbit03.
**********************************************************
*** Publishers will be blocked until this alarm clears ***
**********************************************************
=INFO REPORT==== 3-Jun-2018::17:22:26 ===
vm_memory_high_watermark clear. Memory used:
6206062592 allowed:6871733043
=WARNING REPORT==== 3-Jun-2018::17:22:26 ===
memory resource limit alarm cleared on node rabbit@osmrabbit033
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
=INFO REPORT==== 3-Jun-2018::17:40:14 ===
Starting RabbitMQ 3.6.15 on Erlang 20.0
Copyright (C) 2007-2018 Pivotal Software, Inc.
=INFO REPORT==== 3-Jun-2018::17:40:14 ===
node : rabbit@osmrabbit03
home dir : C:\Windows
config file(s) : c:/Users/ADMINI~1/AppData/Roaming/RabbitMQ/rabbitmq.config
cookie hash : k4DIRENyx5vP/nEsXKZKKg==
log : C:/Users/ADMINI~1/AppData/Roaming/RabbitMQ/log/RABBIT~1.LOG
sasl log : C:/Users/ADMINI~1/AppData/Roaming/RabbitMQ/log/RABBIT~2.LOG
database dir : c:/Users/ADMINI~1/AppData/Roaming/RabbitMQ/db/RABBIT~1
=INFO REPORT==== 3-Jun-2018::17:40:20 ===
Memory high watermark set to 6553 MiB (6871733043 bytes) of 16383 MiB
(17179332608 bytes) total
=INFO REPORT==== 3-Jun-2018::17:40:20 ===
Enabling free disk space monitoring
=INFO REPORT==== 3-Jun-2018::17:40:20 ===
Disk free limit set to 50MB
=INFO REPORT==== 3-Jun-2018::17:40:20 ===
Limiting to approx 8092 file handles (7280 sockets)
=INFO REPORT==== 3-Jun-2018::17:40:20 ===
FHC read buffering: OFF
FHC write buffering: ON
=INFO REPORT==== 3-Jun-2018::17:40:21 ===
Waiting for Mnesia tables for 30000 ms, 9 retries left
=INFO REPORT==== 3-Jun-2018::17:40:21 ===
Waiting for Mnesia tables for 30000 ms, 9 retries left
=INFO REPORT==== 3-Jun-2018::17:40:21 ===
Waiting for Mnesia tables for 30000 ms, 9 retries left
=INFO REPORT==== 3-Jun-2018::17:40:21 ===
Priority queues enabled, real BQ is rabbit_variable_queue
=INFO REPORT==== 3-Jun-2018::17:40:21 ===
Starting rabbit_node_monitor
=INFO REPORT==== 3-Jun-2018::17:40:21 ===
Management plugin: using rates mode 'basic'
=INFO REPORT==== 3-Jun-2018::17:40:21 ===
msg_store_transient: using rabbit_msg_store_ets_index to provide index
=INFO REPORT==== 3-Jun-2018::17:40:21 ===
msg_store_persistent: using rabbit_msg_store_ets_index to provide index
=WARNING REPORT==== 3-Jun-2018::17:40:21 ===
msg_store_persistent: rebuilding indices from scratch
=INFO REPORT==== 3-Jun-2018::17:40:21 ===
started TCP Listener on [::]:5672
=INFO REPORT==== 3-Jun-2018::17:40:21 ===
=INFO REPORT==== 3-Jun-2018::17:40:21 ===
rabbit on node rabbit@osmrabbit01 up
=WARNING REPORT==== 3-Jun-2018::17:40:21 ===
Could not find handle.exe, please install from sysinternals
=INFO REPORT==== 3-Jun-2018::17:40:21 ===
rabbit on node rabbit@osmrabbit02 up
=INFO REPORT==== 3-Jun-2018::17:40:21 ===
Management plugin started. Port: 15672
=INFO REPORT==== 3-Jun-2018::17:40:21 ===
Statistics database started.
=INFO REPORT==== 3-Jun-2018::17:40:21 ===
Server startup complete; 6 plugins started.
* rabbitmq_management
* rabbitmq_management_agent
* rabbitmq_web_dispatch
* cowboy
* amqp_client
* cowlib
Then, we encountered the next issue. osmrabbit01 tried to sync with osmrabbit02 (to become the new master / slave setup), but this syncing took forever.
I did a 'rabbitmqctl status' before I restarted osmrabbit01 too and this was the result:
After this, publishing worked again, but the queues were corrupt and we did not send any data to the consumers til the next morning (8 a.m.) when we saw no data was coming into our analytical systems.
All queues were corrupt and I had to force delete them and re-create them.
This memory issue happens way too often (average > 1.5 times per month) and is really hurting our business (and my team's reputation).
How is it possible that this node runs out of memory so quick without any clear cause? There are no other processes on that server running.
If I need to provide more information, I'll try to post anything we have. This is really frustrating me. Thanks in advance!