jgoups-raft and large snapshots?

3 views
Skip to first unread message

Rickard Öberg

unread,
Aug 24, 2025, 8:16:47 PMAug 24
to jgroups-raft
Hi!

I am investigating building a new type of database (I know, I know..), and am looking for RAFT implementations for the clustering. My first instinct was to look if jgroups had one, and so here I am. The API seems to be going roughly in the direction I need, but the main question is how it would react to be integrated with an actual database. Mainly, how would it behave if the read/write snapshot was for, let's say, 50GB? Any thoughts on this would be appreciated!

cheers, Rickard

José Bolina

unread,
Aug 26, 2025, 11:33:40 AMAug 26
to Rickard Öberg, jgroups-raft
Hey, Rickard

Thanks for pointing that out. I've missed the reply-all button. I've added the group again.

That sounds like a good fit. Using the snapshot only for metadata should be fine, and it should be quick enough to write and transfer over the network.
Regarding the recovery of new members. RAFT has some known situations where adding a new (empty) member could cause unavailability, given bad timing.
For example, adding a new empty node and increasing the quorum size to commit an entry (2 -> 3). The new node won't commit a new entry until it catches up, either by replaying the log or transferring the snapshot. If something happens to some other nodes, the quorum won't be sufficient to commit new entries.

We have addressed this in JGroups Raft using learner nodes. This should be available in the next release.
Using a learner, you can spin a new node; it will learn about the updates, but it won't count for commit. You can include the new node in the RAFT membership after it is up-to-date, so it should be ready to commit new entries right away.

I'll see about a new release this or next week, so you can already run taking learner nodes into account.


Cheers,

On Tue, Aug 26, 2025 at 2:21 AM Rickard Öberg <rickar...@gmail.com> wrote:
Hi Jose!

Thanks for your reply. I notice the Google Group is not cc'ed, not sure if that was intentional or not.

In any case, let me give you some context. The database I have in mind would be a case of bringing several existing pieces together, and the main data storage layer would be MVStore from the H2 embedded SQL database. So, I do indeed have my own WAL layer and file handling, so the RAFT layer would mainly need the replication log management I think. So a change would come to the RAFT master, synchronized through the protocol to all cluster members, and then applied to the MVStore data storage. MVStore does have backup features to allow me to get the binary data in a safe way for replication, so I think you are right, it would make the most sense to have the "snapshot" be a metadata descriptor and then on read it would get the data from another cluster member through some other means, like simply doing a HTTP request to fetch the data. That would be very efficient and fairly easy to implement. As long as it's ok that readSnapshot can take a long time on the new node it's fine. Would it block the master while this is happening, or only the new node that is joining the cluster?

I agree that the simplicity of the single thread makes sense. I'm planning on using the single writer pattern anyway for the database, so that would align pretty well. At the end of the day I would have to do a test integration to see what happens, and go from there. But knowing that the snapshot does not need to contain all the data is a great start.

cheers, Rickard

On Tue, Aug 26, 2025 at 1:42 AM José Bolina <jbo...@redhat.com> wrote:
Hey, Rickard

That sounds interesting. Currently, we have an “event loop” in JGroups Raft. A single thread operates requests in batches, write disk, etc., basically it runs the algorithm. This single thread is also responsible to trigger snapshot after a certain threshold in the log size.
The current implementation is pretty much the one described in the paper, so it writes/loads the full state machine content into the snapshot and truncate RAFT's log. I guess the answer would be: it depends.

If you are aiming to implement your own buffer manager and WAL, I believe you wouldn't necessarily need to keep the full contents in memory, or even, wouldn't need to write the full content to the RAFT snapshot. The snapshot itself could only be metadata needed to locate the 50Gb from the database files. So it should behave fine in this case.
However, if you are using RAFT's log as the WAL and need to flush the full database contents during the snapshot, then it could have issues. The delay to read/write snapshot would delay requests. It shouldn't trigger leader changes nor disrupt the cluster, though.

Does that help? I still need to research on how to make the process more effective in this case. I would like to still keep the simplicity of single-thread, which also helps when implementing the state machine.

--
You received this message because you are subscribed to the Google Groups "jgroups-raft" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jgroups-raft...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/jgroups-raft/c02acce2-7389-4cdd-818b-f94bab634687n%40googlegroups.com.


--
José Bolina


--
José Bolina
Reply all
Reply to author
Forward
0 new messages