Hey,
The distributed failure detection does require that ALL nodes can communicate with each other over port 8301.
This means not just client <-> server, but client <-> client as well. This is so that the servers don’t have the burden
of all the health checking.
In the case of having QA / Prod environments, we typically recommend that they each have their own datacenter,
and not to overload them into the same datacenter. That way you get a clean namespace separation, and you
don’t run into strange networking issues when separating the clusters.
Edwin, the behavior you are describing is typical when the TCP port is open but not
the UDP port. The gossip system uses both TCP+UDP, so the UDP failure detector may be failing all the
time, and the TCP anti-entropy is able to recover, causing a constant flapping scenario.
Best Regards,
Armon Dadgar