Suggestion about hot-swap application version upgrade

51 views
Skip to first unread message

Thiago Vidal

unread,
Nov 10, 2015, 9:50:10 AM11/10/15
to raft-dev
Hi,

I'm working on an implementation of raft using java and we are now thinking of a mechanism to do hot, on-the-fly application version upgrades, to minimize downtime.
To ensure a deterministic idempotent upgrade, we need to be sure that all nodes switch their state-machine application version on the same logical time.

To achieve this, we are considering a mechanism that looks like the "joint consensus" for cluster membership changes, adding the new-version nodes as new nodes to the cluster, and then removing the old ones, one at a time, as follows:

1. The leader adds a control entry to the log (that is not sent to the application), with instructions to download and install the new version binaries.

2. Once this entry is committed to each node's log, it starts processing the download/installation instructions and spawns a new instance of the node with both the old and new versions running on a different port TCP. This new instance then joins the cluster as a read-only, non-voting follower. Assuming a 5 node cluster, there should now be 10 active nodes, 5 on the old version, and 5 with both the old and new versions.

3. The leader appends an entry to add one of the new nodes to the cluster and then remove one of the old ones, changing the cluster size from 5 to 6 and then back to 5 again. This is repeated until all of the new nodes are part of the cluster, and all old-version nodes are removed, the previous leader then removes itself and shuts down and a new election should happen among the nodes with the new version.

4. The new leader would then append a "switch version" control entry to the log, that will be processed by the nodes once it gets committed. Starting from that point, all other entries will be processed using the new version of the state-machine application.

Advantages to this, would be having all the nodes changing to the new version on the same logical point (with minimum downtime) and the ability to replay the log on a development environment, simulating the exact moment when the state-machine rules changed.

I'm curious about how other implementations are doing this.
Has anybody considered implementing a hot-swap application version mechanism using raft? What problems have you faced?
Any other suggestion about how this could be achieved?

We are planning to use this raft implementation as a streaming journaling system to replace kafka. A lot of downstream systems depend on the outputs of the state machine, so minimum downtime is a very important requirement for this.

Thanks for any feedback.

Archie Cobbs

unread,
Nov 10, 2015, 1:18:53 PM11/10/15
to raft-dev
InOn Tuesday, November 10, 2015 at 8:50:10 AM UTC-6, Thiago Vidal wrote:
I'm working on an implementation of raft using java and we are now thinking of a mechanism to do hot, on-the-fly application version upgrades, to minimize downtime.
To ensure a deterministic idempotent upgrade, we need to be sure that all nodes switch their state-machine application version on the same logical time.

Interesting problem and proposed solution...

My gut tells me however that you might be unnecessarily combining two things which could instead be handled independently, and therefore with lower complexity.

Let's assume software version 1.0 uses state machine X1, and X1 accepts transition events (i.e., log entries) from the set E1, and software version 2.0 uses state machine X2, and X2 accepts transition events (i.e., log entries) from the set E2.

While in practice it's likely the case that E2 is a superset of E1, and X2 is therefore "backward compatible" with X1, let's not assume any compatibility for the sake of argument.

From Raft's perspective, it doesn't need to know or care anything about X1 or X2, and the events in E1 and E2 are just opaque binary blobs that get appended to a consensus log. Similarly X1 and X2 are opaque binary blobs that Raft itself only encounters in passing, as the payloads of InstallSnapshot messages. So Raft itself is not affected by any of what follows.

So now you can transform this problem by first creating a new state machine X3 which contains X1 + X2 + Z, where Z is a boolean state variable that means "if false, use X1, otherwise use X2", and then upgrading each one node at a time from X1 -> X3, and then finally flipping Z from false to true with a special new log entry E3.. This final Z flip then becomes your atomic, point-in-time upgrade point.

Each node, immediately after restarting after the upgrade, would contain one-time upgrade logic that looks at its state machine, and if it is in the form X1, converts X1 into X3 (just like any automated "schema update" would do).

With this approach, you just iterate over the nodes, doing: remove from cluster, upgrade, add back to cluster.

While you are upgrading nodes, some will have state machine X3 while others will still have X1, but all nodes will continue to send only E1 messages, and their state machines will agree on the X1 part.

When you flip the Z bit, everybody starts sending E2 messages, all at the same time.

Finally, you can "garbage collect" X1 later at your convenience by doing a second upgrade, this time going from X3 -> X2 by discarding X1.

Not sure if this is simper overall, but at least it has the property that you don't have to muck with the mechanics of Raft itself to solve the problem.

-Archie


Oren Eini (Ayende Rahien)

unread,
Nov 10, 2015, 4:58:35 PM11/10/15
to raft...@googlegroups.com
This seems like a very complex way to do this.
In particular, the number of failure points along the way are pretty high.

If you can't access the relevant url? If the installation failed?

Since you already call out for the new version to handle the old version messages, why not do a rolling upgrade of all the servers.
And once admin has done that, send a command in the log that switch them to the new state machine version?

Hibernating Rhinos Ltd  

Oren Eini l CEO Mobile: + 972-52-548-6969

Office: +972-4-622-7811 l Fax: +972-153-4-622-7811

 


--
You received this message because you are subscribed to the Google Groups "raft-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to raft-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Diego Ongaro

unread,
Nov 10, 2015, 6:24:34 PM11/10/15
to raft...@googlegroups.com
Hi Thiago,

Before I let this continue, let me get everyone caught up on prior discussions.  I brought this topic up back in April in the thread "state machine rolling upgrades" https://groups.google.com/d/msg/raft-dev/ZWV7OI0U9-4/luRrxB9arXgJ and the LogCabin #125 issue linked within.

At that time, Kijana suggested leveraging cluster membership changes during the upgrade, but I argued against it:
that's quite heavyweight for what's really needed (a marker in the log) and assumes you have the spare capacity to start up a new cluster. 

You're right about needing a log entry to control the switch and about needing the new code to emulate the old code for a while. But once you've implemented these things, what additional benefit does membership changes bring?

Best,
Diego

Thiago Vidal

unread,
Nov 11, 2015, 9:30:31 AM11/11/15
to raft-dev
Hi,

I have read some past messages, but it looks like I did not read long enough in the past. Apologies for duplicating the discussion.

Me and my team started to think about application version upgrade on the same week we are implementing the cluster membership part of the algorithm, maybe that made us think that those things could be related, but after reading all your messages and the other topic, it became more clear to us that this is not necessarily true.

It looks like the idea of having a version of the application that is backwards compatible protocol and application-wise, and having a single entry in the journal just to flip to the next version, once all the nodes have been upgraded sounds the most reasonable.

Thank you all for the support, that was really helpful.

Best regards.
Thiago.

Diego Ongaro

unread,
Nov 12, 2015, 4:50:41 PM11/12/15
to raft...@googlegroups.com
Hey Thiago,

No worries. It's an interesting topic, and new people thinking about
it from first principles is a good thing.

Best,
Diego
Reply all
Reply to author
Forward
0 new messages