Hi everyone,
we are experiencing a bizarre connection issue with some of our federated nodes. Maybe someone has been through this situation and could give some direction on how to troubleshoot it :)
The setup is quite simple: there is a master exchange which fans out every message to ~40 "slave" exchanges, each on its own geographically distributed RabbitMQ node, through a VPN tunnel on more or less shitty ADSLs (Italy yay!). Each "slave" then has its own replica set of queues, symfony2 consumers etc etc. Every server which hosts a "slave" node is a mirror (both in hardware and software) of the others.
The load is not much: about 40k to 70k messages per day, with an high on evening and weekends but no real peak, and almost no traffic at late night - early morning.
About 35 slaves work easy and steady (without any whipping) as expected, while ~5 of them can't handle it. Messages start queuing up in their internally created federated queue on master node, and while some messages go through, on the long run the queue keeps growing. During night (when there's no traffic) queues decrease but veeeeery slowly - not even 2k messages each queue.
We tried fiddling with prefetch counts, hearthbeats and other config options. Our sysadmins checked network settings, tweaked MTUs, watched netstats ad nauseam, to no avail. There is to note that while we have access to each server which hosts a RabbitMQ node, we did not setup the VPN and don't know the whole net topology (firewalls and such) at each node. The IT figure who manages the net swears it's the same in every location, including the ~5 failing nodes.
Enogh with prologue, here's the bizarre connection behaviour: watching the connection related to the federated queue of an unhealty slave, the chart in management panel shows sporadic bursts of outgoing traffic and then nothing else (apart from the heartbeat every 30 secs). Said heartbeat says the connection is well alive and running.. but idle the whole time. Meanwhile, there are ~90k message in queue.
In the attached picture you can see this on the left, while on the right there is one of the healty slaves.
Any idea of what could cause this? Could it just be down to a shitty network on the downstream node? Are some of the servers cursed?
Thanks in advance,
Daniele