Hi all,
I am glad to announce that the remaining issues with 2.15 have been resolved for GerritHub.io and we are ready to migrate!
I am sharing the plan publicly, because the associated discussions could be useful for all the other large-scale installations that will be willing to migrate in the near future.
(near)-Zero-downtime migration to Gerrit 2.15 / NoteDb
As we've done for the past two years (see [1]), we want to keep doing it with near/zero-downtime, even when a large data migration is involved.
When a schema migration is involved, we typically switch between DataCenters (DCs):
DC1. Primary DC - Canada: currently running 2.14.7-34
DC2. Failover DC - Germany: currently running 2.15-55 (+ read-only plugin)
Each DC are an HA configuration (2x nodes) with active failover, pretty similar to Hugo's setup :-)
There are 4 nodes in total, 2 per DC.
The time-gap between DCs is 30' at the moment, but as we approach the cutover we lower that to a few minutes.
During the switch, the read-only plugin is injected in DC1 and after maximum two minutes removed from DC2.
From a user's perspective, nothing changes. If you are making any change in the two minutes of read-only, you'll get a kind notice to ask to retry a bit later.
All the read operations are still served as normal. Stream events are blocked for the two minutes of the switch.
So far, so good: it has been like that for two years and we have 99.99% availability, which isn't bad if you think that we are running on standard Cloud infrastructure and with *zero* forked code from mainstream Gerrit Code Review.
Challenges with NoteDb migration
When migrating to Gerrit 2.15, the switch may take a lot longer because we cannot afford switching from a healthy and fast 2.14.7 to a newly refreshed 2.15 with two expensive tasks still to perform:
- Migration to NoteDb (takes around 40' in our tests)
- Online Reindex (takes around 20' in our tests)
GerritHub numbers are:
- 14k users
- 500k changes
- 40k repositories for 2TBytes of data overall
Proposal: NoteDb online upgrade with blue/green
To keep the switch window to maximum two minutes, I thought about breaking the two expensive tasks into two phases:
Phase-1: Online Reindex
Phase-2: NoteDb migration
The problem is that I cannot trigger a NoteDb migration (if I am not mistaken) when a Gerrit node has already started.
The plan is then to use Gerrit HA configuration to active Phase-2, switching the active and passive node.
Phase-1:
Active Node: Online Reindex
Passive Node: Failover
When Online Reindex is done, set the noteDb.changes.autoMigrate = true in notedb.config
Phase-2:
Active Node: Normal traffic with the new Index
Passive Node: Restart and trigger the NoteDb migration
Once the passive node is done, then a rolling restart of the Active Node will allow using NoteDb on both active and passive nodes.
What do you think? Proposals? Other ideas?
What do you think? Can you think of a better way to do it?
Would it be dangerous to have the Active Node serving ReviewDb (still active during the online NoteDb migration) and the Passive re-generating the NoteDb data?
What about giving the ability to trigger and switch the NoteDb modes with a new SSH command?
--- * ---
Any feedback and the associated discussion is *more than welcome* :-)
Luca.
References: