Hey friends,
I've got a couple of questions about the way we drain our complicated system. My confusion might cause me to ask the wrong questions all-together. I've read the
RFC and I know some things about draining of SQL session, but I still don't really understand much.
Basically I'm interested in the prereqs for calling `Stopper.Stop()`.
So, a node wants to drain. It goes through a series of "
drain modes", and then
stopper.Stop() is called. The first one (
DrainMode_CLIENT) drains the SQL connections and descriptor leases, the second one (
DrainMode_LEASES) relinquishes range leases.
DrainMode_CLIENT starts by going through a dance waiting for all SQL connections to be terminated. If that doesn't happen within a sequence of timeouts (e.g. a rogue session didn't react to cancelation), that part of draining
returns an error. If that error is returned, then we don't proceed to draining the
range leases. However, it looks to me like
we proceed with `stopper.Stop()` just fine.
My first question is whether it makes any sense to skip the transferring of leases in this situation. Seems to me that this can have bad consequences - a SQL session causing unavailability once the node goes down until the leases expire. Conversely, my second question is if it's safe to call stopper.Stop() while SQL queries are still running on that node - e.g. while they might be sending KV requests through the client interface. Note that at the moment, sql connections are not stopper tasks, so they don't directly block the stopper. I don't have a good idea about what might go wrong - but of course all sorts of processes are stopped by the stopper.
The reason why I ask is because I want to introduce some async operations triggered by SQL - I want client.Txn.Rollback() to always do the cleanup (send the EndTxn) even if the operations ctx (the SQL session's ctx) has been cancelled (at the moment, calling txn.Rollback() within a canceled session is ludicrously a no-op. The proposal is to send the EndTxn async if the ctx is canceled. And so I'm wondering what the right model is. Should I worry about making async rollbacks stopper tasks? The TxnCoordSender, which does somewhat analogous async cleanup, does it in a task indeed. Does it matter?
Any and all thoughts on the subject would be most useful.
- a_m