other_proc grows and many close_waits after zabbix checks.

390 views
Skip to first unread message

Dmitry Kurbatov

unread,
Dec 30, 2015, 5:48:25 AM12/30/15
to rabbitmq-users
Hi All.

I have a cluster with 4 nodes RabbitMQ 3.5.6. Cluster uptime 6d. Also I have zabbix for monitoring all nodes. Yesterday zabbix has start to alerting in blink mode that one node is dead, but this node has the same uptime that each other and works well. Today I've try to resolve this issue and found that there are a lot of CLOSE_WAIT state connections from local zabbix agent to local rabbitmq (124) and rabbit's memory other_proc  too high - 4,7G (on other nodes 200M).

Does anybody know why so happens?
Is the Erlang's GC broken?
How to look inside rabbit's memory more fixedly and find why memory growing up?

Of course, I can "solve" this problem with node reboot, but I want to understand what had happened.

# netstat -ntp | grep :5672  | grep CLOSE 
tcp6       1      0 127.0.0.1:5672          127.0.0.1:58233         CLOSE_WAIT  -               
tcp6       1      0 127.0.0.1:5672          127.0.0.1:59063         CLOSE_WAIT  -               
tcp6       1      0 127.0.0.1:5672          127.0.0.1:58493         CLOSE_WAIT  -               
...
total 124 items


# sudo netstat -ntp | grep :5672 | grep -v CLOSE  | wc -l
12


# rabbitmqctl list_queues | wc -l
49


# service rabbitmq-server status
Status of node rabbit@rabbitmq4 ...
[{pid,8797},
 {running_applications,
     [{rabbitmq_auth_backend_ldap,"RabbitMQ LDAP Authentication Backend",
          "3.5.6"},
      {eldap,"Ldap api","1.2"},
      {lager,"Erlang logging framework","2.0.0-rmq3.5.x-git9719370"},
      {rabbitmq_management,"RabbitMQ Management Console","3.5.6"},
      {rabbitmq_management_agent,"RabbitMQ Management Agent","3.5.6"},
      {rabbit,"RabbitMQ","3.5.6"},
      {mnesia,"MNESIA  CXC 138 12","4.13.1"},
      {amqp_client,"RabbitMQ AMQP Client","3.5.6"},
      {rabbitmq_web_dispatch,"RabbitMQ Web Dispatcher","3.5.6"},
      {webmachine,"webmachine","1.10.3-rmq3.5.6-gite9359c7"},
      {mochiweb,"MochiMedia Web Server","2.7.0-rmq3.5.6-git680dba8"},
      {xmerl,"XML parser","1.3.8"},
      {inets,"INETS  CXC 138 49","6.0.1"},
      {os_mon,"CPO  CXC 138 46","2.4"},
      {sasl,"SASL  CXC 138 11","2.6"},
      {stdlib,"ERTS  CXC 138 10","2.6"},
      {kernel,"ERTS  CXC 138 10","4.1"}]},
 {os,{unix,linux}},
 {erlang_version,
     "Erlang/OTP 18 [erts-7.1] [source] [64-bit] [smp:4:4] [async-threads:128] [kernel-poll:true]\n"},
 {memory,
     [{total,5080348064},
      {connection_readers,418232},
      {connection_writers,589992},
      {connection_channels,1991272},
      {connection_other,825408},
      {queue_procs,2257080},
      {queue_slave_procs,11272448},
      {plugins,339576},
      {other_proc,4867926208},
      {mnesia,410944},
      {mgmt_db,12216},
      {msg_index,813200},
      {other_ets,1306960},
      {binary,23844248},
      {code,20755419},
      {atom,711569},
      {other_system,146873292}]},
 {alarms,[]},
 {listeners,[{clustering,25672,"::"},{amqp,5672,"::"}]},
 {vm_memory_high_watermark,0.85},
 {vm_memory_limit,14303955148},
 {disk_free_limit,1682818253},
 {disk_free,8359010304},
 {file_descriptors,
     [{total_limit,10140},
      {total_used,41},
      {sockets_limit,9124},
      {sockets_used,11}]},
 {processes,[{limit,1048576},{used,631}]},
 {run_queue,0},
 {uptime,521353}]


-- 
Thanks in advance,
Dmitry 

Michael Klishin

unread,
Dec 30, 2015, 5:57:17 AM12/30/15
to rabbitm...@googlegroups.com
Please read up in what CLOSE_WAIT means. No, the issue is very unlikely to be runtime GC. Rather your clients don't configure a low enough heartbeat value
for stale TCP connections to be noticed and released quickly.

There is no fine grained breakdown for other_procs.
Erlang's observer app and tools such as Recon can
provide quite a bit of visibility but I suspect the issue is in not-really-closed TCP connections.

See rabbitmq.com/heartbeats.html and rabbitmq.com/networking.html.

MK
Message has been deleted

Dmitry Kurbatov

unread,
Dec 30, 2015, 7:15:54 AM12/30/15
to rabbitmq-users
 Why my reply has been deleted?

Michael Klishin

unread,
Dec 30, 2015, 7:50:50 AM12/30/15
to rabbitm...@googlegroups.com, Dmitry Kurbatov
On 30 December 2015 at 15:15:59, Dmitry Kurbatov (d...@dimcha.ru) wrote:
> Why my reply has been deleted?

It wasn't deleted, it ended up in the moderator's approval queue. Should be posted soon. 
--
MK

Staff Software Engineer, Pivotal/RabbitMQ


Michael Klishin

unread,
Dec 30, 2015, 7:56:04 AM12/30/15
to rabbitm...@googlegroups.com, Dmitry Kurbatov
On 30 December 2015 at 15:50:01, Dmitry Kurbatov (d...@dimcha.ru) wrote:
> As I understand of CLOSE_WAIT - it's a state of tcp connection
> when client send fin to server and socket waiting from app executing
> close(). Can you explain me - what rabbitMQ awaiting after FIN?
> Why he can't close connection and release memory? I suspect a
> reason can be in unsent data in sndbuf, is it true?

It is a state in the TCP state machine managed by the OS. RabbitMQ does not expect anything:
it up to the OS to consider a connection to be completely closed and notify socket owner(s).

See http://blogs.technet.com/b/janelewis/archive/2010/03/09/explaining-close-wait.aspx,
http://www.isi.edu/touch/pubs/infocomm99/infocomm99-web/http://vincent.bernat.im/en/blog/2014-tcp-time-wait-state-linux.html.

Because modern TCP stack defaults in Linux are from the 90s (as well as some other Linux defaults — but hey,
it works for those running Gnome somewhere on the west coast!), the timeout is inadequate for modern systems
with lots of clients, often transient ones.

http://rabbitmq.com/heartbeats.html explains why many messaging protocols introduce their own heartbeat mechanism
that pretty much only exist to undo this TCP behaviour.

Set a heartbeat timeout of 8 seconds for your clients. I suspect that it will make quite a bit of a difference
because RabbitMQ will have a chance to detect dead connections before the kernel notifies it.

Of course, there are TCP-level settings that help coping with TIME_WAIT on servers with moderate or high
connection churn. The articles above explain several of them better than I can.

We also mention some of them in http://www.rabbitmq.com/networking.html
Reply all
Reply to author
Forward
0 new messages