Cutover plan for a Gerrit upgrade in Continuous Deployment style (100% uptime)

216 views
Skip to first unread message

Luca Milanesio

unread,
Feb 19, 2016, 4:54:08 PM2/19/16
to Repo and Gerrit Discussion
Hi all,
I clearly remember how Shawn described the way they upgrade Gerrit with 100% uptime ... and I was *amazed* :-)
I re-published the post on [1].

At GerritHub.io we will soon need to move to a new infrastructure and I wanted to provide a 100% uptime experience during the upgrade/move of data and DBs over the new servers.
BitBucket and GitHub both failed and instead requested the user to accept a "planned outage", which seems to me quite unacceptable for services that are crucial for companies developing worldwide 7-days a week.

As we cannot clearly rely on Google's infrastructure (first of all on BigTable) I would need to accept some compromise:
a) Read-only access for a limited period of time (5-10' max)
b) Pre-migration and sync before roll-out
c) Temporary inconsistency on the new server due to on-line reindexing

Here is my draft cutover plan: I would definitely appreciate feedback to highlight potential problems and pitfalls :-)

--- * ---

Gerrit upgrade cutover plan
===========================
Basic principle: Blue/Green deployment (see http://martinfowler.com/bliki/BlueGreenDeployment.html)

Blue is old-Gerrit, Green is new-Gerrit

Phase-0: Pre-upgrade
Blue-Gerrit replicates repos to Red-Gerrit
Blue-Gerrit DB is transferred to Red-Gerrit every 10 mins
Red-Gerrit index gets updated for new changes every 10 mins

Phase-1: Blue-Gerrit/Git operations set to read-only
Blocks all POST/PUT/DELETE REST-API on Blue-Gerrit resources
Install plugin script on Blue-Gerrit to block all incoming Git receive-commit with a friendly message ("we are migrating to Red-Gerrit, please retry later")

Phase-2: Wait for all pending replication actions to complete
Whilst Blue-Gerrit is in read-only, transfer the Blue-Gerrit DB over the Red-Gerrit for the last time
Wait until the Blue-Gerrit replication queue is empty

Phase-3: Red-Gerrit upgrade
Red-Gerrit is upgraded to the new version
Red-Gerrit started
Trigger online-reindex and cache warm-up

Phase-4: Redirect traffic to Red-Gerrit
Flip the switch and redirect all traffic from Blue-Gerrit to Red-Gerrit

--- * ---

Feedback is more than welcome :-)

Luca.


Matthias Sohn

unread,
Feb 19, 2016, 5:27:56 PM2/19/16
to Luca Milanesio, Repo and Gerrit Discussion
so Green == Red ?
Looks like you invented a new deployment principle ;-)

I think you should also consider the Gerrit importer plugin [1] which we implemented to enable a smooth
migration from one Gerrit server to another Gerrit server one project after the other. Actually
we used it to merge two Gerrit servers into one since we wanted to reduce the number of systems
we have to maintain. 

This allows to completely avoid a global downtime. Developers can continue to work on the
old system until almost all changes are replicated. Then the project is read-only on both systems for
a short time to allow completion of the replication. After the switch the team can continue
working on the new system.

Out of the box this isn't transparent to the users, they need to explicitly use the new URL
after their project was switched over. This could be fixed by adding a smart reverse proxy
which could route requests to the system where the requests project is currently editable.


-Matthias

Luca Milanesio

unread,
Feb 19, 2016, 5:36:50 PM2/19/16
to Matthias Sohn, Repo and Gerrit Discussion
Hi Matthias,
good point about the importer plugin, I need to experiment it for differential upgrade.

I guess it may take a while to import over 20K projects, so I need to make sure that when the import is completed is able to "add/amend" already imported changes.
... and then set Blue (old Gerrit) to read-only, allow the importer to finalise the remaining bits (hopefully just a couple of mins) and then flip the switch.

How does it behave in terms of:
a) Workload produced on the Blue (old Gerrit) system ?
b) Ability to work in differential way, adding or amending current changes ?

Luca.

Luca Milanesio

unread,
Feb 19, 2016, 5:50:28 PM2/19/16
to Matthias Sohn, Repo and Gerrit Discussion
... and forgot to ask: does the migration plugin preserve the original change numbers?

Matthias Sohn

unread,
Feb 19, 2016, 6:25:53 PM2/19/16
to Luca Milanesio, Repo and Gerrit Discussion
On Fri, Feb 19, 2016 at 11:50 PM, Luca Milanesio <luca.mi...@gmail.com> wrote:
... and forgot to ask: does the migration plugin preserve the original change numbers?


I think it doesn't. This wasn't a goal since we needed the plugin to migrate a few thousand projects 
from a smaller server into another larger gerrit instance which already had 10k projects 
On 19 Feb 2016, at 22:36, Luca Milanesio <Luca.Mi...@gmail.com> wrote:

Hi Matthias,
good point about the importer plugin, I need to experiment it for differential upgrade.

I guess it may take a while to import over 20K projects, so I need to make sure that when the import is completed is able to "add/amend" already imported changes.
yes it's maybe slower than migrating everything in one big step.

But I think migrating with the importer plugin reduces the total downtime experienced per project
and reduces the risk of disaster since after migrating a project the affected team can test the result
before the switch over. If anything went wrong the migration can be resumed or redone.
If it turns out that after switch over something is wrong there are still good chances to fix the
problem until the original project in the old system is deleted.
... and then set Blue (old Gerrit) to read-only, allow the importer to finalise the remaining bits (hopefully just a couple of mins) and then flip the switch.

How does it behave in terms of:
a) Workload produced on the Blue (old Gerrit) system ?
AFAIR it doesn't create a big load, though the migration was spread across several weeks since
after initial migration teams first tested their changed build jobs before switching over to the new location 
b) Ability to work in differential way, adding or amending current changes ?
you can stop and resume replication any number of times, the replication state is persisted.
When the migration of a project is completed this persisted state is deleted.

-Matthias

Matthias Sohn

unread,
Feb 19, 2016, 6:27:56 PM2/19/16
to Luca Milanesio, Repo and Gerrit Discussion
one more note:

our servers use LDAP authentication, so maybe for other authentication types
some piece is missing.

Luca Milanesio

unread,
Feb 21, 2016, 5:09:58 PM2/21/16
to Matthias Sohn, Repo and Gerrit Discussion
Thanks for the hint, but unfortunately we cannot afford changing the Change numbers: all existing users will have their hyperlinks broken :-(
Did you have similar problems @SAP when migrating from one instance to another? How did you manage the hyperlinks redirections?

Luca.

Matthias Sohn

unread,
Feb 21, 2016, 5:33:56 PM2/21/16
to Luca Milanesio, Repo and Gerrit Discussion
in our case, merging two Gerrit servers into one, we saw no other simple option
than accepting that we have to break hyperlinks for the users of the smaller server.
There were no complaints about that.

-Matthias

Luca Milanesio

unread,
Feb 24, 2016, 6:09:48 AM2/24/16
to Matthias Sohn, Repo and Gerrit Discussion
I see your point, in your case people had to choice !

I believe we should drop sometime in the future of Gerrit the Change number in the URL, that would help a lot when migrating / upgrading / merging instances.

Example:

Current URL for a change:

it should become in the future:

... and *when* this will be achieved ... we will not have these sort of issues anymore :-)

Doesn't seem to be a difficult thing to implement in Gerrit, isn't it?

Luca.

Edwin Kempin

unread,
Feb 24, 2016, 7:09:02 AM2/24/16
to Luca Milanesio, Matthias Sohn, Repo and Gerrit Discussion
On Wed, Feb 24, 2016 at 12:09 PM, Luca Milanesio <luca.mi...@gmail.com> wrote:
I see your point, in your case people had to choice !

I believe we should drop sometime in the future of Gerrit the Change number in the URL, that would help a lot when migrating / upgrading / merging instances.

Example:

Current URL for a change:

it should become in the future:
Ic6c599ee035d7231c5796204e8edd98778d7e872 may not be unique, so you would need the project and branch in addition.
 

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Luca Milanesio

unread,
Feb 24, 2016, 7:24:06 AM2/24/16
to Edwin Kempin, Matthias Sohn, Repo and Gerrit Discussion
True, you may have other projects with the Change-Id.
However *within* the same project it should be *reasonable* unique isn't it?

Luca.

Edwin Kempin

unread,
Feb 24, 2016, 7:26:54 AM2/24/16
to Luca Milanesio, Matthias Sohn, Repo and Gerrit Discussion
On Wed, Feb 24, 2016 at 1:23 PM, Luca Milanesio <luca.mi...@gmail.com> wrote:
True, you may have other projects with the Change-Id.
However *within* the same project it should be *reasonable* unique isn't it?
No, e.g. the Change-Id stays the same when you cherry-pick a change between branches.
This is intended so that you can easily find all branches to which a bug-fix was cherry-picked.

luca.mi...@gmail.com

unread,
Feb 24, 2016, 8:03:08 AM2/24/16
to Edwin Kempin, Matthias Sohn, Repo and Gerrit Discussion
Ah, true !!! So we need target branch as well ;-)

Luca

Sent from my iPhone

lucamilanesio

unread,
Feb 26, 2016, 6:51:29 AM2/26/16
to Repo and Gerrit Discussion
I've been testing the Gerrit Blue-Green approach for a week and seems to work :-)

See below my feedback on the execution times.

Gerrit upgrade cutover plan
===========================
Basic principle: Blue/Green deployment (see http://martinfowler.com/bliki/BlueGreenDeployment.html)

Blue is old-Gerrit, Green is new-Gerrit

Phase-0: Pre-upgrade
Blue-Gerrit replicates repos to Green-Gerrit

Typically works with 15s delay, however for new repos may take longer as Gerrit's replication logic is:
a) Create a new empty repo using the adminUrl 
b) Re-schedule a full replication in the queue

The replication queue could be at times quite long ... and you end up with an empty repo on the other side :-(

Even worse then replication auto-reload is enabled => replication queue is cleaned and the replication side's repo will remain empty.
This isn't nice and needs to be fixed in the replication plugin reload mechanism.
 
Blue-Gerrit DB is transferred to Green-Gerrit every 10 mins
Green-Gerrit index gets updated for new changes every 10 mins

It worked fine, the entire DB transfer + upgrade + reindex can be actually completed in less than 3 mins!
Our new infrastructure is massively parallel and the reindex with parallel threads works nicely.
 

Phase-1: Blue-Gerrit/Git operations set to read-only
Blocks all POST/PUT/DELETE REST-API on Blue-Gerrit resources
Install plugin script on Blue-Gerrit to block all incoming Git receive-commit with a friendly message ("we are migrating to Greeb-Gerrit, please retry later")

The Git side is easy, the REST-API isn't so easy as the Gerrit UX could be quite unexpected.
Need to do some more testing and see if a compensation logic can be found.
 

Phase-2: Wait for all pending replication actions to complete
Whilst Blue-Gerrit is in read-only, transfer the Blue-Gerrit DB over the Greeb-Gerrit for the last time
Wait until the Gereen-Gerrit replication queue is empty

Done and it should take no more than 3 minutes.
 

Phase-3: Green-Gerrit upgrade
Green-Gerrit is upgraded to the new version
Green-Gerrit started 
Trigger online-reindex and cache warm-up

This phase can be actually reduced to only "cache warm-up" as I managed to include Phase-0 (pre-upgrade) the upgrade + index update.
The most critical parts of the cache warm-up are:
- project cache
- groups cache

Of the two, the groups cache is the most critical of all as it relies on GitHub API which are (unfortunately) quite slow and unstable.
Additionally they may consider the "cache warm-up" as possible DoS attack (they are quite paranoid about that nowadays) and then throttling the group lookups.

I may consider to do a "phased" warm-up and allow some traffic through anyway.
 

Phase-4: Redirect traffic to Green-Gerrit
Flip the switch and redirect all traffic from Blue-Gerrit to Greeb-Gerrit

Flipping the switch is easy for the HTTP parts (just redirect from old IP to new IP) whilst the SSH traffic would actually need for the new IP to be propagated through the DNS world-wide.
On the SSH side I will have to use port-forwarding in the meantime, which isn't ideas as the traffic will still go through the old server and then be redirected to the new one.

As soon as the DNS will propagate further, the amount of calls through this "forwarded path" will drop and I am expecting to go down to zero in around 24h.

Any more feedback, suggestion?

Luca. 
 

lucamilanesio

unread,
Feb 26, 2016, 6:09:33 PM2/26/16
to Repo and Gerrit Discussion
Last bit and piece resolved, see below.


On Friday, February 26, 2016 at 11:51:29 AM UTC, lucamilanesio wrote:
I've been testing the Gerrit Blue-Green approach for a week and seems to work :-)

See below my feedback on the execution times.

Gerrit upgrade cutover plan
===========================
Basic principle: Blue/Green deployment (see http://martinfowler.com/bliki/BlueGreenDeployment.html)

Blue is old-Gerrit, Green is new-Gerrit

Phase-0: Pre-upgrade
Blue-Gerrit replicates repos to Green-Gerrit

Typically works with 15s delay, however for new repos may take longer as Gerrit's replication logic is:
a) Create a new empty repo using the adminUrl 
b) Re-schedule a full replication in the queue

The replication queue could be at times quite long ... and you end up with an empty repo on the other side :-(

Even worse then replication auto-reload is enabled => replication queue is cleaned and the replication side's repo will remain empty.
This isn't nice and needs to be fixed in the replication plugin reload mechanism.
 
Blue-Gerrit DB is transferred to Green-Gerrit every 10 mins
Green-Gerrit index gets updated for new changes every 10 mins

It worked fine, the entire DB transfer + upgrade + reindex can be actually completed in less than 3 mins!
Our new infrastructure is massively parallel and the reindex with parallel threads works nicely.
 

Phase-1: Blue-Gerrit/Git operations set to read-only
Blocks all POST/PUT/DELETE REST-API on Blue-Gerrit resources
Install plugin script on Blue-Gerrit to block all incoming Git receive-commit with a friendly message ("we are migrating to Greeb-Gerrit, please retry later")

The Git side is easy, the REST-API isn't so easy as the Gerrit UX could be quite unexpected.
Need to do some more testing and see if a compensation logic can be found.

I've contributed a simple Groovy script to set all projects read-only and reject all incoming commits:

Whilst for blocking all REST-API, a couple of Apache rewrite rules and it's done:

RewriteCond %{REQUEST_METHOD} =PUT

RewriteRule ^(.*) "-" [F] 


RewriteCond %{REQUEST_METHOD} =POST

RewriteRule ^(.*) "-" [F]


RewriteCond %{REQUEST_METHOD} =DELETE

RewriteRule ^(.*) "-" [F]


The only drawback is that the user will get a generic "Forbidden" message, without the possibility to say "Hey! we are upgrading the system, please come back later".

 
 

Phase-2: Wait for all pending replication actions to complete
Whilst Blue-Gerrit is in read-only, transfer the Blue-Gerrit DB over the Greeb-Gerrit for the last time
Wait until the Gereen-Gerrit replication queue is empty

Done and it should take no more than 3 minutes.
 

Phase-3: Green-Gerrit upgrade
Green-Gerrit is upgraded to the new version
Green-Gerrit started 
Trigger online-reindex and cache warm-up

This phase can be actually reduced to only "cache warm-up" as I managed to include Phase-0 (pre-upgrade) the upgrade + index update.
The most critical parts of the cache warm-up are:
- project cache
- groups cache

Of the two, the groups cache is the most critical of all as it relies on GitHub API which are (unfortunately) quite slow and unstable.
Additionally they may consider the "cache warm-up" as possible DoS attack (they are quite paranoid about that nowadays) and then throttling the group lookups.

I may consider to do a "phased" warm-up and allow some traffic through anyway.

Cache warm-up can be implemented in Groovy as well ... will post it in the next few days.

lucamilanesio

unread,
Mar 21, 2016, 5:50:20 PM3/21/16
to Repo and Gerrit Discussion
We actually implemented the zero-downtime cut-over plan discussed here ... and worked 100% !
No outages reported by PingDom or by users.

I have published a full report on:

The experience could be useful for others (non-Googlers) that need a ZeroDowntime roll-out and failover without necessarily having to implement multi-master.

Hope this helps other people !

Luca.

Saša Živkov

unread,
Mar 22, 2016, 6:43:35 AM3/22/16
to lucamilanesio, Repo and Gerrit Discussion
On Mon, Mar 21, 2016 at 10:50 PM, lucamilanesio <luca.mi...@gmail.com> wrote:
We actually implemented the zero-downtime cut-over plan discussed here ... and worked 100% !

Cool!

 
No outages reported by PingDom or by users.

I have published a full report on:

This looks like a report for your users.
The technical side of it is only available in this discussion thread, right?
Do you plan to summarize the discussion done here?

 


The experience could be useful for others (non-Googlers) that need a ZeroDowntime roll-out and failover without necessarily having to implement multi-master.

Hope this helps other people !

Definitely a very interesting topic.
 

--

Luca Milanesio

unread,
Mar 22, 2016, 6:54:56 AM3/22/16
to Saša Živkov, Repo and Gerrit Discussion
On 22 Mar 2016, at 10:42, Saša Živkov <ziv...@gmail.com> wrote:



On Mon, Mar 21, 2016 at 10:50 PM, lucamilanesio <luca.mi...@gmail.com> wrote:
We actually implemented the zero-downtime cut-over plan discussed here ... and worked 100% !

Cool!

 
No outages reported by PingDom or by users.

I have published a full report on:

This looks like a report for your users.
The technical side of it is only available in this discussion thread, right?
Do you plan to summarize the discussion done here?

Good point, I will definitely follow-up with the associated discussions, including the option of using the project migration tool :-)

Luca.
Reply all
Reply to author
Forward
0 new messages