Info
A hanging tcp connection remains on the RabbitMQ broker server after any celery worker finishes.
Using pre-emtible instances in Google Cloud Platform as workers in a
processing pipeline. The number of connection builds up until eventually
the Debian server runs out of memory.
Scenario summary
- Worker boots and connects to Rabbit MQ, 2 tcp connections are established
- Worker finishes and the instance is stopped and removed
- Worker is dead, connection A is closed, connection B remains
Same problem appears running two different RabbitMQ as well as Erlang versions:
RabbitMQ 3.7.17 + Erlang 22.0.7-1
RabbitMQ 3.10.14 + Erlang 25.0.4-1
Scenario
1. Worker boots and connects to Rabbit MQ, 2 tcp connections are established. Two connections on two different ports are established from the worker's IP to the rabbit MQ instance
Listing connections ...
user peer_host peer_port state
epic 10.240.60.56 A running
epic 10.240.60.56 B running
netstat shows two connections to Rabbit MQ (5672)
2. Worker finishes and the instance is stopped and removed
Connection on port 36654
tcpdump shows the following
# 36654 last package from worker originating on port B
17:05:24.395092 IP 10.240.50.2.5672 > 10.240.60.56.B: Flags [P.], seq 2769:2790, ack 48864, win 273, options [nop,nop,TS val 991690205 ecr 1201716502], length 21
# broker B ACK last message
17:05:24.395252 IP 10.240.60.56.B > 10.240.50.2.5672: Flags [.], ack 2790, win 507, options [nop,nop,TS val 1201716502 ecr 991690205], length 0
# broker to worker on port A trying to resend last 8 bytes of "4232:4240" ?
17:05:29.922421 IP 10.240.50.2.5672 > 10.240.60.56.A: Flags [P.], seq 4232:4240, ack 1324, win 58, options [nop,nop,TS val 991691587 ecr 1201692028], length 8
17:05:30.127621 IP 10.240.50.2.5672 > 10.240.60.56.A: Flags [P.], seq 4232:4240, ack 1324, win 58, options [nop,nop,TS val 991691639 ecr 1201692028], length 8
17:05:30.335615 IP 10.240.50.2.5672 > 10.240.60.56.A: Flags [P.], seq 4232:4240, ack 1324, win 58, options [nop,nop,TS val 991691691 ecr 1201692028], length 8
17:05:30.771599 IP 10.240.50.2.5672 > 10.240.60.56.A: Flags [P.], seq 4232:4240, ack 1324, win 58, options [nop,nop,TS val 991691800 ecr 1201692028], length 8
17:05:31.603593 IP 10.240.50.2.5672 > 10.240.60.56.A: Flags [P.], seq 4232:4240, ack 1324, win 58, options [nop,nop,TS val 991692008 ecr 1201692028], length 8
17:05:33.267555 IP 10.240.50.2.5672 > 10.240.60.56.A: Flags [P.], seq 4232:4240, ack 1324, win 58, options [nop,nop,TS val 991692424 ecr 1201692028], length 8
17:05:36.563603 IP 10.240.50.2.5672 > 10.240.60.56.A: Flags [P.], seq 4232:4240, ack 1324, win 58, options [nop,nop,TS val 991693248 ecr 1201692028], length 8
17:05:43.219601 IP 10.240.50.2.5672 > 10.240.60.56.A: Flags [P.], seq 4232:4240, ack 1324, win 58, options [nop,nop,TS val 991694912 ecr 1201692028], length 8
17:05:56.531566 IP 10.240.50.2.5672 > 10.240.60.56.A: Flags [P.], seq 4232:4240, ack 1324, win 58, options [nop,nop,TS val 991698240 ecr 1201692028], length 8
17:06:23.923626 IP 10.240.50.2.5672 > 10.240.60.56.A: Flags [P.], seq 4232:4240, ack 1324, win 58, options [nop,nop,TS val 991705088 ecr 1201692028], length 8
# closing A after giving up
17:06:59.920635 IP 10.240.50.2.5672 > 10.240.60.56.A: Flags [R.], seq 4240, ack 1324, win 58, options [nop,nop,TS val 991714087 ecr 1201692028], length 0
3. Worker is dead, connection A is closed, connection B remains
Rabbit MQ says one connection reimains, we also see a netstat showing connection to 5672
Listing connections ...
user peer_host peer_port state
epic 10.240.60.56 B running
This connection remains until RabbitMQ or server is restarted.
I expect RabbitMQ to send heartbeats on the remaining connection, it
should then discover that the peer is not there and then close the
connection. It seems heartbeat is not sent.
Tried the following:
- upgrading RabbitMQ version and Erlang, the same problem remained => no effect
- lowering kernel TCP keepalive from 60 seconds to 5. net.ipv4.tcp_keepalive_time => no effect
- lowering Rabbit MQ hearbeat interval from 60s to 10s => no effect
Debugging tools
To see connections I use:
sudo rabbitmqctl list_connections
and (RabbitMQ runs on port 5672)
sudo netstat -ntpo | grep -E ':5672\>'|wc -l
To see what packages are sent I use tcpdump and the IP+port to
identify the two different connections. For readability I'll replace the
two worker ports with A and B
Appreciate any help,
Erik