We are doing some torture tests to see how the nats cluster will behave when a node or two goes down, but we are experiencing some issues.
We have 3 nodes, deployed as stateful sets on Azure kubernetes cluster, streams are with work queue retention, replicated on all nodes, no explicit limits are set.
When we put a node down, the cluster starts to not acknowledging that messages are acknowledged.
. What i mean with that - we have an application logic which explicitly acks message after 60 redeliveries ( 60 times we couldn't process the message and we nakced it, so it can be redelivered ). In normal condition, acknowledge message is not redelivered and we don't see increase in deliveries count ( obv ). But when a node is down, or nodes are synching changes we occasionally see more than 60 deliveries, which basically means that the server hasn't acknowledge that we've already acknowledge the given message.
We are using PHP and the only lib available for it, nats is updated to 2.10.22, but similar behaviour was observerd with 2.10.19