draining semantics

8 views
Skip to first unread message

Andrei Matei

unread,
May 4, 2018, 4:26:48 PM5/4/18
to Tobias Schottdorf, Ben Darnell, Alfonso Subiotto Marques, CockroachDB
Hey friends,

I've got a couple of questions about the way we drain our complicated system. My confusion might cause me to ask the wrong questions all-together. I've read the RFC and I know some things about draining of SQL session, but I still don't really understand much.
Basically I'm interested in the prereqs for calling `Stopper.Stop()`. 

So, a node wants to drain. It goes through a series of "drain modes", and then stopper.Stop() is called. The first one (DrainMode_CLIENT) drains the SQL connections and descriptor leases, the second one (DrainMode_LEASES) relinquishes range leases.
DrainMode_CLIENT starts by going through a dance waiting for all SQL connections to be terminated. If that doesn't happen within a sequence of timeouts (e.g. a rogue session didn't react to cancelation), that part of draining returns an error. If that error is returned, then we don't proceed to draining the range leases. However, it looks to me like we proceed with `stopper.Stop()` just fine.

My first question is whether it makes any sense to skip the transferring of leases in this situation. Seems to me that this can have bad consequences - a SQL session causing unavailability once the node goes down until the leases expire. Conversely, my second question is if it's safe to call stopper.Stop() while SQL queries are still running on that node - e.g. while they might be sending KV requests through the client interface. Note that at the moment, sql connections are not stopper tasks, so they don't directly block the stopper. I don't have a good idea about what might go wrong - but of course all sorts of processes are stopped by the stopper.


The reason why I ask is because I want to introduce some async operations triggered by SQL - I want client.Txn.Rollback() to always do the cleanup (send the EndTxn) even if the operations ctx (the SQL session's ctx) has been cancelled (at the moment, calling txn.Rollback() within a canceled session is ludicrously a no-op. The proposal is to send the EndTxn async if the ctx is canceled. And so I'm wondering what the right model is. Should I worry about making async rollbacks stopper tasks? The TxnCoordSender, which does somewhat analogous async cleanup, does it in a task indeed. Does it matter?

Any and all thoughts on the subject would be most useful.

- a_m

Ben Darnell

unread,
May 6, 2018, 4:41:03 PM5/6/18
to Andrei Matei, Tobias Schottdorf, Alfonso Subiotto Marques, cockroach-db
On Fri, May 4, 2018 at 4:26 PM Andrei Matei <and...@cockroachlabs.com> wrote:

My first question is whether it makes any sense to skip the transferring of leases in this situation. Seems to me that this can have bad consequences - a SQL session causing unavailability once the node goes down until the leases expire.

I agree. If the error is bad enough (I'm not sure what that would be) we should just abort the whole shutdown process and exit immediately, but in general we should continue with the next phase of draining on schedule. 
 
Conversely, my second question is if it's safe to call stopper.Stop() while SQL queries are still running on that node - e.g. while they might be sending KV requests through the client interface. Note that at the moment, sql connections are not stopper tasks, so they don't directly block the stopper. I don't have a good idea about what might go wrong - but of course all sorts of processes are stopped by the stopper.

I think connections should be stopper tasks, although nothing *too* bad would happen without that. You just might see some weird errors from everything shutting down around you. The goal of the stopper is that it's safe to call stop at any point, and the worst thing that can happen is that operations in flight get a "shutting down" error (but only if they deliberately check for the stopper shutting down. Tasks are normally allowed to run to completion)
 


The reason why I ask is because I want to introduce some async operations triggered by SQL - I want client.Txn.Rollback() to always do the cleanup (send the EndTxn) even if the operations ctx (the SQL session's ctx) has been cancelled (at the moment, calling txn.Rollback() within a canceled session is ludicrously a no-op. The proposal is to send the EndTxn async if the ctx is canceled. And so I'm wondering what the right model is. Should I worry about making async rollbacks stopper tasks? The TxnCoordSender, which does somewhat analogous async cleanup, does it in a task indeed. Does it matter?

Yes, we should make the async rollback a task. The reason things like this should be tasks is that the existence of active tasks prevents the shutdown of workers, and in the absence of workers things can deadlock because there's no one reading on a channel. Goroutines that we "own" should nearly always be either tasks or workers. 

-Ben

Tobias Schottdorf

unread,
May 6, 2018, 4:51:38 PM5/6/18
to Ben Darnell, Andrei Matei, Alfonso Subiotto Marques, cockroach-db
I have nothing to add to what Ben said.
--

-- Tobias
Reply all
Reply to author
Forward
0 new messages