GerritHub.io and zero-downtime migration plan to Gerrit 2.15 / NoteDB

102 views
Skip to first unread message

lucamilanesio

unread,
Apr 11, 2018, 6:53:44 PM4/11/18
to Repo and Gerrit Discussion
Hi all,
I am glad to announce that the remaining issues with 2.15 have been resolved for GerritHub.io and we are ready to migrate!
I am sharing the plan publicly, because the associated discussions could be useful for all the other large-scale installations that will be willing to migrate in the near future.

(near)-Zero-downtime migration to Gerrit 2.15 / NoteDb

As we've done for the past two years (see [1]), we want to keep doing it with near/zero-downtime, even when a large data migration is involved.

When a schema migration is involved, we typically switch between DataCenters (DCs):
DC1. Primary DC - Canada: currently running 2.14.7-34
DC2. Failover DC - Germany: currently running 2.15-55 (+ read-only plugin)

Each DC are an HA configuration (2x nodes) with active failover, pretty similar to Hugo's setup :-)
There are 4 nodes in total, 2 per DC.

The time-gap between DCs is 30' at the moment, but as we approach the cutover we lower that to a few minutes.
During the switch, the read-only plugin is injected in DC1 and after maximum two minutes removed from DC2.

From a user's perspective, nothing changes. If you are making any change in the two minutes of read-only, you'll get a kind notice to ask to retry a bit later.
All the read operations are still served as normal. Stream events are blocked for the two minutes of the switch.

So far, so good: it has been like that for two years and we have 99.99% availability, which isn't bad if you think that we are running on standard Cloud infrastructure and with *zero* forked code from mainstream Gerrit Code Review.

Challenges with NoteDb migration

When migrating to Gerrit 2.15, the switch may take a lot longer because we cannot afford switching from a healthy and fast 2.14.7 to a newly refreshed 2.15 with two expensive tasks still to perform:
- Migration to NoteDb (takes around 40' in our tests)
- Online Reindex (takes around 20' in our tests)

GerritHub numbers are:
- 14k users
- 500k changes
- 40k repositories for 2TBytes of data overall

Proposal: NoteDb online upgrade with blue/green

To keep the switch window to maximum two minutes, I thought about breaking the two expensive tasks into two phases:

Phase-1: Online Reindex
Phase-2: NoteDb migration

The problem is that I cannot trigger a NoteDb migration (if I am not mistaken) when a Gerrit node has already started.
The plan is then to use Gerrit HA configuration to active Phase-2, switching the active and passive node.

Phase-1:
Active Node: Online Reindex
Passive Node: Failover

When Online Reindex is done, set the noteDb.changes.autoMigrate = true in notedb.config

Phase-2:
Active Node: Normal traffic with the new Index
Passive Node: Restart and trigger the NoteDb migration

Once the passive node is done, then a rolling restart of the Active Node will allow using NoteDb on both active and passive nodes.

What do you think? Proposals? Other ideas?

What do you think? Can you think of a better way to do it?
Would it be dangerous to have the Active Node serving ReviewDb (still active during the online NoteDb migration) and the Passive re-generating the NoteDb data?
What about giving the ability to trigger and switch the NoteDb modes with a new SSH command?

--- * ---

Any feedback and the associated discussion is *more than welcome* :-)

Luca.


References:

lucamilanesio

unread,
Apr 12, 2018, 7:05:32 AM4/12/18
to Repo and Gerrit Discussion
@Dave I would be curious to get your point of view from the plan below :-)

The criticality I see is the online notedb migration from the passive node: if one of the two nodes is converting, is it able to pick up the modifications coming from the first node as well?

Example:
- Change 123 is getting migrated from the failover node
- Change 123 is getting updated on the primary node but *after* being migrated from the failover node

How to notify the failover node that Change 123 needs to be re-migrated again?

Luca.

lucamilanesio

unread,
Apr 15, 2018, 7:17:26 PM4/15/18
to Repo and Gerrit Discussion


On Thursday, April 12, 2018 at 12:05:32 PM UTC+1, lucamilanesio wrote:
@Dave I would be curious to get your point of view from the plan below :-)

Ping ... I am relying on your JetLag ... so you should be possibly still awake ;-) 
 

The criticality I see is the online notedb migration from the passive node: if one of the two nodes is converting, is it able to pick up the modifications coming from the first node as well?

Example:
- Change 123 is getting migrated from the failover node
- Change 123 is getting updated on the primary node but *after* being migrated from the failover node

How to notify the failover node that Change 123 needs to be re-migrated again?

I did a bit of code inspection and the ReviewDb vs. NoteDb state is actually stored at change-level in ReviewDb in the note_db_state.
So, in theory, as soon as the changes will move to NoteDb, the primary node should detect that situation and stop using ReviewDb for it.

When all the changes have been migrated, then effectively the note_db_state should be all set to 'N' and the active node would serve only NoteDb calls.
The rolling restart is just an "act of courtesy" to allow the in-memory refresh of the global NoteDbMigrationState.

Will do some more testing next week and, if all succeed, GerritHub.io will migrate to 2.15 :-)

Luca.
Reply all
Reply to author
Forward
0 new messages