other_proc grows and many close_waits after zabbix checks.

Dmitry Kurbatov

unread,

Dec 30, 2015, 5:48:25 AM12/30/15

to rabbitmq-users

Hi All.

I have a cluster with 4 nodes RabbitMQ 3.5.6. Cluster uptime 6d. Also I have zabbix for monitoring all nodes. Yesterday zabbix has start to alerting in blink mode that one node is dead, but this node has the same uptime that each other and works well. Today I've try to resolve this issue and found that there are a lot of CLOSE_WAIT state connections from local zabbix agent to local rabbitmq (124) and rabbit's memory other_proc too high - 4,7G (on other nodes 200M).

Does anybody know why so happens?

Is the Erlang's GC broken?

How to look inside rabbit's memory more fixedly and find why memory growing up?

Of course, I can "solve" this problem with node reboot, but I want to understand what had happened.

# netstat -ntp | grep :5672 | grep CLOSE

tcp6 1 0 127.0.0.1:5672 127.0.0.1:58233 CLOSE_WAIT -

tcp6 1 0 127.0.0.1:5672 127.0.0.1:59063 CLOSE_WAIT -

tcp6 1 0 127.0.0.1:5672 127.0.0.1:58493 CLOSE_WAIT -

...

total 124 items

# sudo netstat -ntp | grep :5672 | grep -v CLOSE | wc -l

12

# rabbitmqctl list_queues | wc -l

49

# service rabbitmq-server status

Status of node rabbit@rabbitmq4 ...

[{pid,8797},

{running_applications,

[{rabbitmq_auth_backend_ldap,"RabbitMQ LDAP Authentication Backend",

"3.5.6"},

{eldap,"Ldap api","1.2"},

{lager,"Erlang logging framework","2.0.0-rmq3.5.x-git9719370"},

{rabbitmq_management,"RabbitMQ Management Console","3.5.6"},

{rabbitmq_management_agent,"RabbitMQ Management Agent","3.5.6"},

{rabbit,"RabbitMQ","3.5.6"},

{mnesia,"MNESIA CXC 138 12","4.13.1"},

{amqp_client,"RabbitMQ AMQP Client","3.5.6"},

{rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.5.6"},

{webmachine,"webmachine","1.10.3-rmq3.5.6-gite9359c7"},

{mochiweb,"MochiMedia Web Server","2.7.0-rmq3.5.6-git680dba8"},

{xmerl,"XML parser","1.3.8"},

{inets,"INETS CXC 138 49","6.0.1"},

{os_mon,"CPO CXC 138 46","2.4"},

{sasl,"SASL CXC 138 11","2.6"},

{stdlib,"ERTS CXC 138 10","2.6"},

{kernel,"ERTS CXC 138 10","4.1"}]},

{os,{unix,linux}},

{erlang_version,

"Erlang/OTP 18 [erts-7.1] [source] [64-bit] [smp:4:4] [async-threads:128] [kernel-poll:true]\n"},

{memory,

[{total,5080348064},

{connection_readers,418232},

{connection_writers,589992},

{connection_channels,1991272},

{connection_other,825408},

{queue_procs,2257080},

{queue_slave_procs,11272448},

{plugins,339576},

{other_proc,4867926208},

{mnesia,410944},

{mgmt_db,12216},

{msg_index,813200},

{other_ets,1306960},

{binary,23844248},

{code,20755419},

{atom,711569},

{other_system,146873292}]},

{alarms,[]},

{listeners,[{clustering,25672,"::"},{amqp,5672,"::"}]},

{vm_memory_high_watermark,0.85},

{vm_memory_limit,14303955148},

{disk_free_limit,1682818253},

{disk_free,8359010304},

{file_descriptors,

[{total_limit,10140},

{total_used,41},

{sockets_limit,9124},

{sockets_used,11}]},

{processes,[{limit,1048576},{used,631}]},

{run_queue,0},

{uptime,521353}]

--

Thanks in advance,

Dmitry

Michael Klishin

unread,

Dec 30, 2015, 5:57:17 AM12/30/15

to rabbitm...@googlegroups.com

Please read up in what CLOSE_WAIT means. No, the issue is very unlikely to be runtime GC. Rather your clients don't configure a low enough heartbeat value
for stale TCP connections to be noticed and released quickly.

There is no fine grained breakdown for other_procs.
Erlang's observer app and tools such as Recon can
provide quite a bit of visibility but I suspect the issue is in not-really-closed TCP connections.

See rabbitmq.com/heartbeats.html and rabbitmq.com/networking.html.

MK

Message has been deleted

Dmitry Kurbatov

unread,

Dec 30, 2015, 7:15:54 AM12/30/15

to rabbitmq-users

Why my reply has been deleted?

Michael Klishin

unread,

Dec 30, 2015, 7:50:50 AM12/30/15

to rabbitm...@googlegroups.com, Dmitry Kurbatov

On 30 December 2015 at 15:15:59, Dmitry Kurbatov (d...@dimcha.ru) wrote:
> Why my reply has been deleted?

It wasn't deleted, it ended up in the moderator's approval queue. Should be posted soon.
--
MK

Staff Software Engineer, Pivotal/RabbitMQ

Michael Klishin

unread,

Dec 30, 2015, 7:56:04 AM12/30/15

to rabbitm...@googlegroups.com, Dmitry Kurbatov

On 30 December 2015 at 15:50:01, Dmitry Kurbatov (d...@dimcha.ru) wrote:
> As I understand of CLOSE_WAIT - it's a state of tcp connection
> when client send fin to server and socket waiting from app executing
> close(). Can you explain me - what rabbitMQ awaiting after FIN?
> Why he can't close connection and release memory? I suspect a
> reason can be in unsent data in sndbuf, is it true?

It is a state in the TCP state machine managed by the OS. RabbitMQ does not expect anything:
it up to the OS to consider a connection to be completely closed and notify socket owner(s).

See http://blogs.technet.com/b/janelewis/archive/2010/03/09/explaining-close-wait.aspx,
http://www.isi.edu/touch/pubs/infocomm99/infocomm99-web/, http://vincent.bernat.im/en/blog/2014-tcp-time-wait-state-linux.html.

Because modern TCP stack defaults in Linux are from the 90s (as well as some other Linux defaults — but hey,
it works for those running Gnome somewhere on the west coast!), the timeout is inadequate for modern systems
with lots of clients, often transient ones.

http://rabbitmq.com/heartbeats.html explains why many messaging protocols introduce their own heartbeat mechanism
that pretty much only exist to undo this TCP behaviour.

Set a heartbeat timeout of 8 seconds for your clients. I suspect that it will make quite a bit of a difference
because RabbitMQ will have a chance to detect dead connections before the kernel notifies it.

Of course, there are TCP-level settings that help coping with TIME_WAIT on servers with moderate or high
connection churn. The articles above explain several of them better than I can.

We also mention some of them in http://www.rabbitmq.com/networking.html.

Reply all

Reply to author

Forward