I'm the maintainer for the radix golang library (
https://github.com/fzzy/radix), and I'm working on adding a cluster sub-package. You can find that code here:
The support for simple commands is pretty solid imo. It will transparently redirect for MOVED and ASK errors, and deal with updating the slots internally. However, I'm wondering what the best course of action would be for certain edge-cases.
1) For a single command, on a MOVED, I'm currently only updating the slot that command's key occupied; I'm not doing a CLUSTER SLOTS and re-discovering the full topology. My reasoning for this is that if there's a large change in slots that is on-going (happening one slot at a time), I don't want to call CLUSTER SLOTS before the full change has propagated, since then I'll just end up calling it again and again until the full topology change has actually propagated, which might be expensive. Doing it the way I currently have will end up with an extra call per slot that has changed, but I feel that the hit from that is negligible, given the relatively small number of slots, and it's a cleaner solution. Am I wrong in this thinking?
2) This one is somewhat related to the previous one. My question is what do I do when a node becomes unreachable? Or, more specifically, at what point do I stop trying to handle errors and return an error to the user? For example, if the master of a set of slots segfaults and isn't restarted, the slave will takeover those slots. But my client still has pointers for all those slots pointing at the old, dead master. When a command hits one of those slots it will get a connection closed error or something. At that point I could a) just return the error to the user and let them call Reset b) call Reset implicitly, which will call CLUSTER SLOTS and re-create the topology c) try the command on a different connection, knowing it won't work but just hoping to get a MOVED to the correct node. If there was some kind of pubsub system which sent out alerts about topology changes, like there is with sentinel, that would be ideal, but as far as I can find that doesn't exist for cluster.
3) This one is unrelated to the previous two. I have some minimal support for pipelining requests in a cluster in the driver right now. It will send a set of commands to a node chosen by the first command which has a key. For example, MULTI, SET foo bar, SET bar baz, EXEC will go to the node chosen by the key "foo". Right now I've not implemented any MOVED or ASK handling for the pipelining though. My reasoning is that if two different parts of the pipe are MOVED to different nodes things could get messy real quick What if it was a transaction? I can't just split up a transaction into two transactions. What if all the keys are on the same slot, but that slot is in the middle of getting migrated? Half the keys could get ASK'd, which again will mess up transactions. If command order matters on the application level and a command in the middle of the pipe gets ASK'd, I can't just stop sending commands after that point since they all get sent at once, so the commands will get processed in a different order than intended. This can be solved with a transaction, but then we hit the problems mentioned before. I guess I just want to make sure I'm not way overthinking this problem and there's some simple solution I'm not thinking of, or if I'm right and it's better to let the user decide for themselves what they want to do about errors.
Sorry this is so lengthy, I just want to make sure I'm doing the right things before I push the code to master and am stuck with whatever poor decisions I make.