Connection leak - TCP kept open even with heartbeat and low TCP keepalive

Erik K

unread,

Feb 6, 2023, 12:18:21 PM2/6/23

to rabbitmq-users

Hi all, looking for any advice regarding this situation:

Info
A hanging tcp connection remains on the RabbitMQ broker server after any celery worker finishes.

Using pre-emtible instances in Google Cloud Platform as workers in a processing pipeline. The number of connection builds up until eventually the Debian server runs out of memory.

Scenario summary

Worker boots and connects to Rabbit MQ, 2 tcp connections are established
Worker finishes and the instance is stopped and removed
Worker is dead, connection A is closed, connection B remains

Same problem appears running two different RabbitMQ as well as Erlang versions:

RabbitMQ 3.7.17 + Erlang 22.0.7-1
RabbitMQ 3.10.14 + Erlang 25.0.4-1

Scenario

1. Worker boots and connects to Rabbit MQ, 2 tcp connections are established. Two connections on two different ports are established from the worker's IP to the rabbit MQ instance

Listing connections ... user peer_host peer_port state epic 10.240.60.56 A running epic 10.240.60.56 B running

netstat shows two connections to Rabbit MQ (5672)

2. Worker finishes and the instance is stopped and removed

Connection on port 36654

tcpdump shows the following

# 36654 last package from worker originating on port B 17:05:24.395092 IP 10.240.50.2.5672 > 10.240.60.56.B: Flags [P.], seq 2769:2790, ack 48864, win 273, options [nop,nop,TS val 991690205 ecr 1201716502], length 21

# broker B ACK last message 17:05:24.395252 IP 10.240.60.56.B > 10.240.50.2.5672: Flags [.], ack 2790, win 507, options [nop,nop,TS val 1201716502 ecr 991690205], length 0

# broker to worker on port A trying to resend last 8 bytes of "4232:4240" ? 17:05:29.922421 IP 10.240.50.2.5672 > 10.240.60.56.A: Flags [P.], seq 4232:4240, ack 1324, win 58, options [nop,nop,TS val 991691587 ecr 1201692028], length 8

17:05:30.127621 IP 10.240.50.2.5672 > 10.240.60.56.A: Flags [P.], seq 4232:4240, ack 1324, win 58, options [nop,nop,TS val 991691639 ecr 1201692028], length 8 17:05:30.335615 IP 10.240.50.2.5672 > 10.240.60.56.A: Flags [P.], seq 4232:4240, ack 1324, win 58, options [nop,nop,TS val 991691691 ecr 1201692028], length 8 17:05:30.771599 IP 10.240.50.2.5672 > 10.240.60.56.A: Flags [P.], seq 4232:4240, ack 1324, win 58, options [nop,nop,TS val 991691800 ecr 1201692028], length 8 17:05:31.603593 IP 10.240.50.2.5672 > 10.240.60.56.A: Flags [P.], seq 4232:4240, ack 1324, win 58, options [nop,nop,TS val 991692008 ecr 1201692028], length 8 17:05:33.267555 IP 10.240.50.2.5672 > 10.240.60.56.A: Flags [P.], seq 4232:4240, ack 1324, win 58, options [nop,nop,TS val 991692424 ecr 1201692028], length 8 17:05:36.563603 IP 10.240.50.2.5672 > 10.240.60.56.A: Flags [P.], seq 4232:4240, ack 1324, win 58, options [nop,nop,TS val 991693248 ecr 1201692028], length 8 17:05:43.219601 IP 10.240.50.2.5672 > 10.240.60.56.A: Flags [P.], seq 4232:4240, ack 1324, win 58, options [nop,nop,TS val 991694912 ecr 1201692028], length 8 17:05:56.531566 IP 10.240.50.2.5672 > 10.240.60.56.A: Flags [P.], seq 4232:4240, ack 1324, win 58, options [nop,nop,TS val 991698240 ecr 1201692028], length 8 17:06:23.923626 IP 10.240.50.2.5672 > 10.240.60.56.A: Flags [P.], seq 4232:4240, ack 1324, win 58, options [nop,nop,TS val 991705088 ecr 1201692028], length 8

# closing A after giving up 17:06:59.920635 IP 10.240.50.2.5672 > 10.240.60.56.A: Flags [R.], seq 4240, ack 1324, win 58, options [nop,nop,TS val 991714087 ecr 1201692028], length 0

3. Worker is dead, connection A is closed, connection B remains

Rabbit MQ says one connection reimains, we also see a netstat showing connection to 5672

Listing connections ...

user peer_host peer_port state

epic 10.240.60.56 B running

This connection remains until RabbitMQ or server is restarted.

I expect RabbitMQ to send heartbeats on the remaining connection, it should then discover that the peer is not there and then close the connection. It seems heartbeat is not sent.

Tried the following:

upgrading RabbitMQ version and Erlang, the same problem remained => no effect
lowering kernel TCP keepalive from 60 seconds to 5. net.ipv4.tcp_keepalive_time => no effect
lowering Rabbit MQ hearbeat interval from 60s to 10s => no effect

Debugging tools

To see connections I use:

sudo rabbitmqctl list_connections

and (RabbitMQ runs on port 5672)

sudo netstat -ntpo | grep -E ':5672\>'|wc -l

To see what packages are sent I use tcpdump and the IP+port to identify the two different connections. For readability I'll replace the two worker ports with A and B

Appreciate any help,

Erik

Luke Bakken

unread,

Feb 6, 2023, 12:33:35 PM2/6/23

to rabbitmq-users

Hi Erik -

What is logged by RabbitMQ when the worker completes? I would expect to see a log message indicating that the connection is closed. If you don't see the message, something is keeping it open. This could be Celery itself or a load balancer.

Thanks,

Luke

Luke Bakken

unread,

Feb 6, 2023, 12:34:51 PM2/6/23

to rabbitmq-users

If you see that Celery opens two connections (via the RabbitMQ log) but only one of those is closed when the worker completes, then the issue lies with Celery or some intermediary.

Niklas Bivald

unread,

Feb 6, 2023, 6:22:34 PM2/6/23

to rabbitmq-users

(I'm helping Erik debug this)

But given that we've enabled heartbeats in RabbitMQ shouldn't RabbitMQ still close the connections, eventually? There is no load balancer between Celery and the broker, and the celery instances are completely terminated. Since we use pre-emptibles (spot instances) we can't guarantee that Celery closes the connection as cleanly as we would have liked. They can be killed at any moment.

I guess the ugly solution is to restart rabbitmq daily, but then all connections (even the right ones) will be closed which is not so good obviously :)

Luke Bakken

unread,

Feb 6, 2023, 6:37:29 PM2/6/23

to rabbitmq-users

Apparently RabbitMQ considers the connection to still be alive and responding to heartbeats. I have no idea how.

Can you check to see what RabbitMQ logs with regard to these connections? I'm assuming that two connections are logged when your worker starts, but only one of them is logged as closed.

The best way to help us figure this out is to come up with a way we can easily reproduce the issue. An ideal way is to provide a docker compose project I can clone and start via "docker compose up".

Luke Bakken

unread,

Feb 6, 2023, 6:41:27 PM2/6/23

to rabbitmq-users

There is also a chance that Celery is disabling heartbeats.

You point this out in the original message:

"Rabbit MQ says one connection remains, we also see a netstat showing connection to 5672

Listing connections ...

user peer_host peer_port state

epic 10.240.60.56 B running"

If the worker is truly terminated, how can the RabbitMQ host consider a TCP connection to still be open if there are no intermediary network devices? Something is fishy here.

Luke Bakken

unread,

Feb 6, 2023, 6:51:14 PM2/6/23

to rabbitmq-users

See this - https://github.com/celery/celery/issues/7250

I'm pretty sure this is the issue you're seeing. Try the fix that involves setting broker_transport_options

Just FYI if a client application requests that heartbeats be disabled, that's what RabbitMQ goes with.

Reply all

Reply to author

Forward