Hi Jose!Thanks for your reply. I notice the Google Group is not cc'ed, not sure if that was intentional or not.In any case, let me give you some context. The database I have in mind would be a case of bringing several existing pieces together, and the main data storage layer would be MVStore from the H2 embedded SQL database. So, I do indeed have my own WAL layer and file handling, so the RAFT layer would mainly need the replication log management I think. So a change would come to the RAFT master, synchronized through the protocol to all cluster members, and then applied to the MVStore data storage. MVStore does have backup features to allow me to get the binary data in a safe way for replication, so I think you are right, it would make the most sense to have the "snapshot" be a metadata descriptor and then on read it would get the data from another cluster member through some other means, like simply doing a HTTP request to fetch the data. That would be very efficient and fairly easy to implement. As long as it's ok that readSnapshot can take a long time on the new node it's fine. Would it block the master while this is happening, or only the new node that is joining the cluster?I agree that the simplicity of the single thread makes sense. I'm planning on using the single writer pattern anyway for the database, so that would align pretty well. At the end of the day I would have to do a test integration to see what happens, and go from there. But knowing that the snapshot does not need to contain all the data is a great start.cheers, RickardOn Tue, Aug 26, 2025 at 1:42 AM José Bolina <jbo...@redhat.com> wrote:Hey, Rickard
That sounds interesting. Currently, we have an “event loop” in JGroups Raft. A single thread operates requests in batches, write disk, etc., basically it runs the algorithm. This single thread is also responsible to trigger snapshot after a certain threshold in the log size.
The current implementation is pretty much the one described in the paper, so it writes/loads the full state machine content into the snapshot and truncate RAFT's log. I guess the answer would be: it depends.
If you are aiming to implement your own buffer manager and WAL, I believe you wouldn't necessarily need to keep the full contents in memory, or even, wouldn't need to write the full content to the RAFT snapshot. The snapshot itself could only be metadata needed to locate the 50Gb from the database files. So it should behave fine in this case.
However, if you are using RAFT's log as the WAL and need to flush the full database contents during the snapshot, then it could have issues. The delay to read/write snapshot would delay requests. It shouldn't trigger leader changes nor disrupt the cluster, though.
Does that help? I still need to research on how to make the process more effective in this case. I would like to still keep the simplicity of single-thread, which also helps when implementing the state machine.
--
You received this message because you are subscribed to the Google Groups "jgroups-raft" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jgroups-raft...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/jgroups-raft/c02acce2-7389-4cdd-818b-f94bab634687n%40googlegroups.com.
--José Bolina