questions about raft+rqlite

76 views
Skip to first unread message

Ozan hacıbekiroğlu

unread,
Aug 17, 2020, 5:18:12 PM8/17/20
to rqlite
Hello,

I am new here and I have couple of questions about hashicorp's raft package. Philip told me to share my questions here to discuss with more people so here I am.
Recently I have been obsessed with building a distributed key value store using raft and badger database. I will embed this to my applications, I hope :). There aren't much packages covering hashicorp's raft package to learn, I think rqlite is the best package to learn Raft package. I have reviewed many other packages but most of them are abandoned. Luckily we have rqlite.

Q1: Raft replays logs to FSM from the last snapshot after restoring after joining to a cluster. As rqlite saves data to sqlite persistently, how does it handle replayed logs at startup or after joining cluster? If database is a key/value store, it is not a big deal because re-setting values or re-deleting in the same order does not make sense on persistent store while replaying, however I did not understand how rqlite handles this with query types INSERT/UPDATE/DELETE etc.

Q2: I remember that rqlite can insert about 100-1000 records within a second, I read it but I cannot find it now in the docs. When I first saw these numbers, I imagined this can be improved using a key/value store as backend but I was wrong. My simple prototype with badger db can set a value within 3-5 millis. So it seems I cannot reach 1000 set/second. Metrics about FSM and other functions show that more than 2ms is wasted in raft.Apply. Of course it is a must for raft to get acknowledgements from nodes to commit but it is huge for write-heavy applications. I've prototyped two applications one was communicating commands with leader over HTTP using fastest encoding libraries and the last one communicates with gRPC. Nothing has changed about timing, may be it got slower a bit. Are there any best practices to speed things up a little bit?

Q3: This question is related with question 1. When client sends a INSERT statement to rqlite, it redirects query to leader and returns result to client. For example, query is executed successfully at FSM and success message could not be delivered to the client due to network failure. So client either retries to send same request again or gives up. This may cause inconsistency between server and client states. I really wonder how rqlite handles such scenarios. I'll be glad if you guide me. I naively think that clients can send queries/commands with a request id and at FSM database saves transaction result to a different table or key with given request id so client can check transaction result if it cannot get the proper response. This is for not pushing same command again. This has a overhead but guarantees client. I could not image another way other than storing transaction results in a distributed database. How does this sound? I am fully open to any suggestions. Designing everything retry-able is not easy. Retrying same command can update already updated data by another client. I will use key versioning in my key value store to prevent accidental overwrite, I hope this helps.

Q4: I need to implement optimistic locking for some database keys for consistent processing among nodes. This makes expiration times very important. How can I handle clock differences between nodes because when leader changes in cluster, some locks could be expired although they must not be. Let me know if there is a robust solution for this.

Well, I don't want to re-invent rqlite, etcd or consul :) I just need the basics of an embedded solution. I appreciate if you help me.

Cheers.
Ozan

Philip O'Toole

unread,
Aug 17, 2020, 8:58:16 PM8/17/20
to rql...@googlegroups.com
Answers inline.

On Mon, Aug 17, 2020 at 5:18 PM Ozan hacıbekiroğlu <ozan...@gmail.com> wrote:
Hello,

I am new here and I have couple of questions about hashicorp's raft package. Philip told me to share my questions here to discuss with more people so here I am.
Recently I have been obsessed with building a distributed key value store using raft and badger database. I will embed this to my applications, I hope :). There aren't much packages covering hashicorp's raft package to learn, I think rqlite is the best package to learn Raft package. I have reviewed many other packages but most of them are abandoned. Luckily we have rqlite.

Q1: Raft replays logs to FSM from the last snapshot after restoring after joining to a cluster. As rqlite saves data to sqlite persistently, how does it handle replayed logs at startup or after joining cluster? If database is a key/value store, it is not a big deal because re-setting values or re-deleting in the same order does not make sense on persistent store while replaying, however I did not understand how rqlite handles this with query types INSERT/UPDATE/DELETE etc.

I'm not following the question. Why is replaying log entries at start up an issue? The whole point is that the log contains the order of all SQLite commands that have taken place, in an unchanging order. The database is wiped out on startup, so the log is always replayed into a brand new database. Are you missing that point?

Q2: I remember that rqlite can insert about 100-1000 records within a second, I read it but I cannot find it now in the docs. When I first saw these numbers, I imagined this can be improved using a key/value store as backend but I was wrong. My simple prototype with badger db can set a value within 3-5 millis. So it seems I cannot reach 1000 set/second. Metrics about FSM and other functions show that more than 2ms is wasted in raft.Apply. Of course it is a must for raft to get acknowledgements from nodes to commit but it is huge for write-heavy applications. I've prototyped two applications one was communicating commands with leader over HTTP using fastest encoding libraries and the last one communicates with gRPC. Nothing has changed about timing, may be it got slower a bit. Are there any best practices to speed things up a little bit?

It's due to the way the Hashicorp Raft system works. I haven't dug into the details, but if you look into the numbers for Consul (which uses the same code) it's about the same. It's hard to get high throughput when you need a quorum of nodes to respond, and have every committed log entry written to the database. 

I suspect that the Hashicorp implementation could be improved, and made faster. But I haven't looked into it.

If you can execute statements in batches, it will be quicker. But at the end of the day, rqlite is not a high-performance database. But it's definitely a fault-tolerant database.


Q3: This question is related with question 1. When client sends a INSERT statement to rqlite, it redirects query to leader and returns result to client. For example, query is executed successfully at FSM and success message could not be delivered to the client due to network failure. So client either retries to send same request again or gives up. This may cause inconsistency between server and client states. I really wonder how rqlite handles such scenarios.

It doesn't in particular. This is an intrinsic problem in systems like these, it's not specific to rqlite.  Any system responding to a client can suffer such a failure.
 
I'll be glad if you guide me. I naively think that clients can send queries/commands with a request id and at FSM database saves transaction result to a different table or key with given request id so client can check transaction result if it cannot get the proper response. This is for not pushing same command again. This has a overhead but guarantees client.

Yes, that is a good way to solve it. for example. The client generates request IDs, which it uses for each request. If the client doesn't receive an ack for a given request, send the request again with the ID. If the system sees an ID come in it's already processed, assume it's a retransmit, and just respond OK. rqlite doesn't support anything like that right now, however. You could probably do something via SQL itself, using a WHERE CLAUSE, and some special table. Primary database keys may also be helpful here.
 
I could not image another way other than storing transaction results in a distributed database. How does this sound? I am fully open to any suggestions. Designing everything retry-able is not easy. Retrying same command can update already updated data by another client. I will use key versioning in my key value store to prevent accidental overwrite, I hope this helps.

Q4: I need to implement optimistic locking for some database keys for consistent processing among nodes.

Are you talking about consistent processing among *rqlite* nodes? If so, this question doesn't make sense. rqlite nodes are fully consistent with each other, assuming they are all caught up with the log. That is the point of the Raft.
 
This makes expiration times very important. How can I handle clock differences between nodes because when leader changes in cluster, some locks could be expired although they must not be. Let me know if there is a robust solution for this.

Well, I don't want to re-invent rqlite, etcd or consul :) I just need the basics of an embedded solution. I appreciate if you help me.

Cheers.
Ozan

--
You received this message because you are subscribed to the Google Groups "rqlite" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rqlite+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rqlite/2160d566-f148-4643-b198-4d2b9a580ad8o%40googlegroups.com.

Ozan hacıbekiroğlu

unread,
Aug 18, 2020, 9:46:00 PM8/18/20
to rqlite
Thank you for enlightening me, here are my comments.
- I missed the point that Rqlite removes database before creating Raft object in open() method.
- I need to live with hashicorp's raft performance. It may be probably the most understandable Go implementation.
- For SQL it is not difficult to create queries with primary keys, unique indexes and using where conditions to guarantee retry-able requests but as I have been working on a key/value store with Badger DB, it looks difficult for me. But I found a solution by using "managed mode" of Badger with key versioning and using raft log indexes as timestamps for Badger.
Cheers.

18 Ağustos 2020 Salı 03:58:16 UTC+3 tarihinde Philip O'Toole yazdı:
Answers inline.

To unsubscribe from this group and stop receiving emails from it, send an email to rql...@googlegroups.com.

Philip O'Toole

unread,
Aug 19, 2020, 4:24:01 PM8/19/20
to rql...@googlegroups.com
OK, great, good luck with your work.

To unsubscribe from this group and stop receiving emails from it, send an email to rqlite+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rqlite/bf5a766b-7a28-41ea-8402-2d9b58b9a52ao%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages