Adding distributed transactions to rqlite (Raft-based SQLite database)

oto...@google.com

unread,

Mar 31, 2018, 5:07:38 PM3/31/18

to raft-dev

Hello all,

For those of you that are not familiar with it, rqlite is a distributed relational database built on SQLite and Raft (Hashicorp's pre-1.0 implementation). Right now it has a flavour of transaction support -- a set of SQL statements can be wrapped in a single Raft command, and the leader then (transparently) wraps the SQL statements in BEGIN and COMMIT when executing the Raft command. As a result either all the SQL commands will take effect or none.

I am considering adding full transaction support to rqlite -- basically I would allow users to send explicit BEGIN and COMMIT (or ROLLBACK) commands to the leader. The issue with this has always been what to do if the COMMIT (or ROLLBACK) is never received because (for example) the client crashed. Until a transaction is closed the rqlite system could not be an weird state (due to the nature of SQLite). A crash during transactions might leave the SQLite with a dangling transaction, even after a restart.

I am considering adding a timeout to deal with this. If the leader does not receive a COMMIT or ROLLBACK command within, say, 30 seconds of the last command it received, it would issue ROLLBACK Itself. This ROLLBACK would be sent through the Raft system, guaranteeing that only node would issue (in the case that a leader election takes place during a transaction). The timeout would be configurable of course.

Anyone see any problems with this approach? Is it solid (assuming it is implemented bug-free, of course)? Anyone familiar with how other distributed relational databases handle this case?

Philip

oto...@google.com

unread,

Mar 31, 2018, 5:10:12 PM3/31/18

to raft-dev

On Saturday, March 31, 2018 at 5:07:38 PM UTC-4, oto...@google.com wrote:

guaranteeing that only node would issue (in the case that a leader election takes place during a transaction).

...guaranteeing that only ONE node would issue the ROLLBACK.

jordan.h...@gmail.com

unread,

Mar 31, 2018, 5:47:11 PM3/31/18

to raft...@googlegroups.com

This is certainly one of the proposals to dealing with coordinator failures in two phase commit. There is a lot of literature on handling coordinator/participant failures in 2PC.l that would be useful to you.

We implemented transactions for cross-shard operations in ONOS/Atomix using two-phase commit. To handle coordinator (or client) failures, we use failure detectors and an atomic check-and-set operation to determine which node will take over as coordinator and complete the transaction after a failure is detected. To do this, some transaction state also has to be stored in Raft:

• When a transaction is started, store a transaction ID, coordinator ID, and transaction state, e.g. PREPARING

• When a transaction is committed or rolled back, update the state to COMMITTING/ROLLING_BACK prior to actually performing the operation

• After a failure, a new coordinator takes over by updating the coordinator ID using an atomic check-and-set (through Raft) and the new coordinator uses the transaction state to complete or rollback the transaction

Granted, this is a much different use case because the architecture is so much different. We use two-phase commit in a sharded Raft cluster and multiplex many state machines on a single Raft log within each partition, and these are basically clients taking over transactions for other clients. But the solution is still the same. A timeout is nothing but a rudimentary failure detector. So yes, I think it’s a fine solution, you just have to handle the case where a healthy client’s transaction is rolled back. But this seems the right approach as long as it’s done through Raft as you said.

But even better would be to do this through sessions. Sessions already have a deterministic approximation of time associated with them, and when a session is expired (the client disconnected) the state machine can simply rollback all the transactions for that session.

--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Etienne de L'Amour

unread,

Mar 31, 2018, 5:59:33 PM3/31/18

to raft-dev

Many thanks for pursuing this issue, Philip (@oto).

Again, for those who don't know, this issue came up for me in a node.js app because I need to manually BEGIN a transaction, carry out a select query and, if appropriate, insert a new record and grab a row id from that. If the insert failed, I'd ROLLBACK; if the insert succeeded, I'd carry out some more processing based on the retrieved ID. Finally, depending on the outcome of that, I'd either COMMIT or ROLLBACK.

This works fine with a single instance of node.js's better-sqlite3 (which can also tell me if I'm already in a transaction, to avoid nesting), but further down the line I'm sure that I'll need a distributed database setup like rqlite.

On Saturday, 31 March 2018 22:07:38 UTC+1, oto...@google.com wrote:

Hello all,

... I am considering adding full transaction support to rqlite -- basically I would allow users to send explicit BEGIN and COMMIT (or ROLLBACK) commands to the leader ...

Philip O'Toole

unread,

Apr 1, 2018, 3:55:51 PM4/1/18

to raft...@googlegroups.com

Thanks for your response. Some more inline.

On Sat, Mar 31, 2018 at 5:47 PM jordan.h...@gmail.com <jordan.h...@gmail.com> wrote:

This is certainly one of the proposals to dealing with coordinator failures in two phase commit. There is a lot of literature on handling coordinator/participant failures in 2PC.l that would be useful to you.

We implemented transactions for cross-shard operations in ONOS/Atomix using two-phase commit. To handle coordinator (or client) failures, we use failure detectors and an atomic check-and-set operation to determine which node will take over as coordinator and complete the transaction after a failure is detected. To do this, some transaction state also has to be stored in Raft:
• When a transaction is started, store a transaction ID, coordinator ID, and transaction state, e.g. PREPARING

Makes sense. I'd basically write the SQLite command BEGIN to the Raft long (along with some other metadata).

• When a transaction is committed or rolled back, update the state to COMMITTING/ROLLING_BACK prior to actually performing the operation

Is the order important? I was assuming it goes as follows: COMMIT/ROLLBACK as needed, and then write to the log -- the same way any other change to the underlying SQLite store is handled. Why the COMMITTING/ROLLING_BACK as opposed to COMMITTED/ROLLED_BACK?

• After a failure, a new coordinator takes over by updating the coordinator ID using an atomic check-and-set (through Raft) and the new coordinator uses the transaction state to complete or rollback the transaction

Exactly what scenario do you have in mind here? The failures I am thinking about are where the client, or leader, crashes before the client itself issued COMMIT or ROLLBACK. I need some mechanism that deals with dangling transactions.

Granted, this is a much different use case because the architecture is so much different. We use two-phase commit in a sharded Raft cluster and multiplex many state machines on a single Raft log within each partition, and these are basically clients taking over transactions for other clients. But the solution is still the same. A timeout is nothing but a rudimentary failure detector. So yes, I think it’s a fine solution, you just have to handle the case where a healthy client’s transaction is rolled back.

I am not 100% clear what this case is? Are you talking about a healthy client losing its connection to rqlite?

Oren Eini (Ayende Rahien)

unread,

Apr 4, 2018, 1:57:26 AM4/4/18

to raft...@googlegroups.com

Is there a reason you can't treat the transaction as a single block?

So instead of doing this as a separate commands, treat it as a batch?

Alternatively, have the leader send a timer message via Raft and have explicit timeout mechanism via this manner?

So you send BEGIN TX, then have it timeout if there are 15 timer messages or something like that?

Hibernating Rhinos Ltd

Oren Eini l CEO l Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

--

You received this message because you are subscribed to the Google Groups "raft-dev" group.

To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+unsubscribe@googlegroups.com.

Philip O'Toole

unread,

Apr 4, 2018, 8:41:29 AM4/4/18

to raft...@googlegroups.com

On Wed, Apr 4, 2018 at 5:57 AM Oren Eini (Ayende Rahien) <aye...@ayende.com> wrote:

Is there a reason you can't treat the transaction as a single block?
So instead of doing this as a separate commands, treat it as a batch?

rqlite already supports exactly that functionality: https://github.com/rqlite/rqlite/blob/master/DOC/DATA_API.md#transactions

However users have been requesting more interactive transaction support. The currently supported transaction support is also insufficient to build a fully-featured ORM on top of rqlite.

Alternatively, have the leader send a timer message via Raft and have explicit timeout mechanism via this manner?
So you send BEGIN TX, then have it timeout if there are 15 timer messages or something like that?

Yes, something like that might work.

oto...@google.com

unread,

Apr 23, 2018, 2:01:19 PM4/23/18

to raft-dev

FWIW, the design I am now proposing is outlined in detail here: https://github.com/rqlite/rqlite/issues/266#issuecomment-383665547

Feedback, as always, is welcome.

Philip

Reply all

Reply to author

Forward