Hello,
I am new here and I have couple of questions about hashicorp's raft package. Philip told me to share my questions here to discuss with more people so here I am.
Recently I have been obsessed with building a distributed key value store using raft and badger database. I will embed this to my applications, I hope :). There aren't much packages covering hashicorp's raft package to learn, I think rqlite is the best package to learn Raft package. I have reviewed many other packages but most of them are abandoned. Luckily we have rqlite.
Q1: Raft replays logs to FSM from the last snapshot after restoring after joining to a cluster. As rqlite saves data to sqlite persistently, how does it handle replayed logs at startup or after joining cluster? If database is a key/value store, it is not a big deal because re-setting values or re-deleting in the same order does not make sense on persistent store while replaying, however I did not understand how rqlite handles this with query types INSERT/UPDATE/DELETE etc.
Q2: I remember that rqlite can insert about 100-1000 records within a second, I read it but I cannot find it now in the docs. When I first saw these numbers, I imagined this can be improved using a key/value store as backend but I was wrong. My simple prototype with badger db can set a value within 3-5 millis. So it seems I cannot reach 1000 set/second. Metrics about FSM and other functions show that more than 2ms is wasted in raft.Apply. Of course it is a must for raft to get acknowledgements from nodes to commit but it is huge for write-heavy applications. I've prototyped two applications one was communicating commands with leader over HTTP using fastest encoding libraries and the last one communicates with gRPC. Nothing has changed about timing, may be it got slower a bit. Are there any best practices to speed things up a little bit?
Q3: This question is related with question 1. When client sends a INSERT statement to rqlite, it redirects query to leader and returns result to client. For example, query is executed successfully at FSM and success message could not be delivered to the client due to network failure. So client either retries to send same request again or gives up. This may cause inconsistency between server and client states. I really wonder how rqlite handles such scenarios. I'll be glad if you guide me. I naively think that clients can send queries/commands with a request id and at FSM database saves transaction result to a different table or key with given request id so client can check transaction result if it cannot get the proper response. This is for not pushing same command again. This has a overhead but guarantees client. I could not image another way other than storing transaction results in a distributed database. How does this sound? I am fully open to any suggestions. Designing everything retry-able is not easy. Retrying same command can update already updated data by another client. I will use key versioning in my key value store to prevent accidental overwrite, I hope this helps.
Q4: I need to implement optimistic locking for some database keys for consistent processing among nodes. This makes expiration times very important. How can I handle clock differences between nodes because when leader changes in cluster, some locks could be expired although they must not be. Let me know if there is a robust solution for this.
Well, I don't want to re-invent rqlite, etcd or consul :) I just need the basics of an embedded solution. I appreciate if you help me.
Cheers.
Ozan