[ANN] rqlite v2 -- SQLite replication with Hashicorp Raft

121 views
Skip to first unread message

philip...@yahoo.com

unread,
Apr 10, 2016, 2:27:02 AM4/10/16
to raft...@googlegroups.com
Hello,

About 18 months ago I let the list know about rqlite -- a system that replicates SQLite databases using the Raft consensus protocol. v1 was built on go-raft, which is no longer maintained. So I recently released v2, which now runs on Hashicorp's Raft consensus module. I also improved the rqlite API significantly, and added support for hot backups. You can also choose read-consistency when querying the system. This new version may be of interest to people out there.


https://github.com/otoolep/rqlite

Cheers,

Philip
 
-----------------------------------------
http://www.philipotoole.com

Henrik Ingo

unread,
Apr 10, 2016, 1:31:21 PM4/10/16
to raft...@googlegroups.com
Hi Philip

It seems you've managed to replicate a transactional relational
database using Raft. From knowing both Raft and something about RDBMS
transactions, it is not at all obvious how this is even possible. If
you'd like to take the time to explain more, I'd love to hear it.

To be a bit more specific (but don't let this restrict you in
explaining what you've done), Raft describes a system where
replciation happens first and only at a much later time are the
replicated changes committed to the "state machine", in your casel
sqlite. It assumes that such commit to state machine (sqlite) will
always be possible, since from a cluster point of view the transaction
is already committed. Otoh this is not how a multi-client RDBMS is
working, you might have other clients holding locks to the same
resources that are being replicated.

It's of course possible to replicate also RDBMSes, but it seems to me
that Raft doesn't directly address complex transactions, rather
describes more or less a key-value database.

henrik
> --
> You received this message because you are subscribed to the Google Groups
> "raft-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to raft-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
henri...@avoinelama.fi
+358-40-5697354 skype: henrik.ingo irc: hingo
www.openlife.cc

My LinkedIn profile: http://fi.linkedin.com/pub/henrik-ingo/3/232/8a7

Archie Cobbs

unread,
Apr 10, 2016, 2:45:46 PM4/10/16
to raft-dev, henri...@avoinelama.fi
On Sunday, April 10, 2016 at 12:31:21 PM UTC-5, Henrik Ingo wrote:
It seems you've managed to replicate a transactional relational
database using Raft. From knowing both Raft and something about RDBMS
transactions, it is not at all obvious how this is even possible. If
you'd like to take the time to explain more, I'd love to hear it.

I'll let Philip speak for himself, but this is indeed possible.

I did a similar thing with RaftKVDatabase, using an optimistic scheme that only communicates (and detects conflicts) on commit. This takes advantage of the fact that in Raft every node has a copy of the entire database.

Mine is a key/value database rather than a relational database, but the underlying principles are the same.

Details were described in this post.

-Archie

Oren Eini (Ayende Rahien)

unread,
Apr 10, 2016, 3:02:17 PM4/10/16
to raft...@googlegroups.com
I'm not sure about the details in this case, but this is easily done if what you are actually doing is running a transaction, then only doing the "commit" via Raft.
On commit to the Raft log, you apply it to sqlite, and that is the current state (including tx rollback).
You might need to wait until the tx is actually applied (vs. committed to the log) to get a meaningful reply back.

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 

philip...@yahoo.com

unread,
Apr 10, 2016, 5:28:26 PM4/10/16
to raft...@googlegroups.com
Hello -- thanks for the all the feedback.

I think some of the confusion comes from assuming that rqlite is more complex than it actually is. The SQLite code has not been touched, and rqlite is a Raft-based layer i.e. log on top of SQLite. Each rqlite node (and therefore SQLite database) gets exactly the same commands, in exactly the same order (thanks to Raft, as we all know). Whether the actual database mutation represented by each log entry actually succeeds, or not, *when it hits the database* doesn't matter. The state machine (again, the SQLite database) remains the same under each node.

Within rqlite the Raft commit is done *first*, then the committed log entry is applied to the SQLite database. Whether the SQLite mutation represented by that log entry succeeds or fails doesn't matter -- it may matter to the end-user, but not as far as keeping the SQLite file identical under each of the rqlite nodes (or quorum in the event of some nodes failing). Once you understand this, you realise it is key that all changes go through rqlite HTTP API -- directly modifying the SQLite file underneath a node would break the system

Thanks to the Hashicorp Raft implementation, which blocks requests until they are actually applied to the state machine (again, the SQLite file), it all works very nicely. In other words, it works the way Archie said.

Let me know if this doesn't make sense, or if I am missing something.

Philip
 
-----------------------------------------
http://www.philipotoole.com

Diego Ongaro

unread,
Apr 10, 2016, 5:35:54 PM4/10/16
to raft...@googlegroups.com
Philip,

What about time and randomness used in queries? Couldn't that cause the sqlite state machines to diverge?

-Diego

philip...@yahoo.com

unread,
Apr 10, 2016, 5:43:16 PM4/10/16
to raft...@googlegroups.com
Diego -- can you explain further what you mean? Give me an example? I'm not sure I follow you.

Philip
 
-----------------------------------------
http://www.philipotoole.com

Caleb Spare

unread,
Apr 10, 2016, 5:57:25 PM4/10/16
to raft...@googlegroups.com
sqlite> create table t (n integer not null);
sqlite> insert into t (n) values (random());
sqlite> select * from t;
-4969508820637493894

Apply the exact same statement to different nodes yields different DB state.

philip...@yahoo.com

unread,
Apr 10, 2016, 6:01:12 PM4/10/16
to raft...@googlegroups.com
Ah -- OK, yeah, I was wondering if you meant something like that. I guess I better add some more caveats to this section:


Something like "only statements that would result in deterministic outcome in the database should be used". Anything else I am missing?

Philip
 
-----------------------------------------
http://www.philipotoole.com

Alex Bligh

unread,
Apr 10, 2016, 6:36:36 PM4/10/16
to raft...@googlegroups.com, Alex Bligh

On 10 Apr 2016, at 22:26, philip.otoole via raft-dev <raft...@googlegroups.com> wrote:

> I think some of the confusion comes from assuming that rqlite is more complex than it actually is. The SQLite code has not been touched, and rqlite is a Raft-based layer i.e. log on top of SQLite. Each rqlite node (and therefore SQLite database) gets exactly the same commands, in exactly the same order (thanks to Raft, as we all know). Whether the actual database mutation represented by each log entry actually succeeds, or not, *when it hits the database* doesn't matter. The state machine (again, the SQLite database) remains the same under each node.

That's interesting. I read someone's (possibly your) account of doing this before.

Do you support restoration from a raft snapshot? If so how does that work?

--
Alex Bligh




philip...@yahoo.com

unread,
Apr 10, 2016, 6:45:54 PM4/10/16
to raft...@googlegroups.com, Alex Bligh
>Do you support restoration from a raft snapshot? If so how does that work?

Yes, in the sense it's all taken care of by the Hashicorp system, though rqlite must provide a snapshot of the underlying SQLite file (if this is what you mean). rqlite also supports hot backups.


Philip
 
-----------------------------------------
http://www.philipotoole.com



Henrik Ingo

unread,
Apr 11, 2016, 5:11:16 AM4/11/16
to raft...@googlegroups.com
Philip

I remember now that sqlite only allows a single write transaction at
the time, locking the entire database when you do "begin;". In this
case of course the sequence of transactions serializes trivially, and
what you're doing is fine. (Apart from the limitation on deterministic
transactions, etc. MySQL statement based replication has developed an
entire science around doing that correctly.)

Which is great, I think what you've put together should definitively
have some market out there.

henrik

On Mon, Apr 11, 2016 at 12:26 AM, philip.otoole via raft-dev

Henrik Ingo

unread,
Apr 11, 2016, 5:14:54 AM4/11/16
to raft...@googlegroups.com
Archie, Oren

Replicating on commit is what I'd expect such a "single master
synchronous replication system" to do as well. Thanks for sharing.

As this is not covered by the vanilla Raft paper, I think there's
certainly room for a separate paper covering that alteration to cover
multi-statement transactional systems. As Raft is becoming very
popular, I don't think it's safe to assume every implementor
understands this difference.

henrik
Reply all
Reply to author
Forward
0 new messages