Some insights from successful Gerrit 2.15 -> 3.2 upgrade at Wikimedia

634 views
Skip to first unread message

Christian Aistleitner

unread,
Jun 29, 2020, 9:50:23 AM6/29/20
to repo-d...@googlegroups.com
Hi,

Wikimedia recently successfully upgraded their Gerrit at

https://gerrit.wikimedia.org/

from Gerrit 2.15 to Gerrit 3.2. And since there are from time to time
reports/questions on this list about what worked and did not work,
I figured I'll post what we did. Maybe it helps someone.

Wikimedia is running two Gerrit servers in production. One master
node, and one read-only replica.

It was ok for us, to have a bit of downtime on the weekend. So we went
for a simpler upgrade procedure, where we can take all nodes offline
for a bit on a Saturday and not have to worry about keeping
consistency and some such.

We took both nodes offline, and migrated the master node to
* latest 2.15, then
* latest 2.16, then
* NoteDB (without reindexing), then
* latest 3.0, then
* latest 3.1, then
* latest 3.2, and finally
* reindexed changes.

Then we rsynced repos, lfs data, and indices (!) from the master to
the replication node.


The five most relevant items for us were:

1. index.batchThreads affects not only online (as Documentation
suggested back then), but also offline reindexing. So we want that
value to be low in production, but high when doing offline
reindexing. Setting it to the number of available CPUs during the
migration obviously had a drastic effect.

2. Don't omit flags for NoteDB migration and reindexing steps. For
example, set permitted memory (E.g.: -Xmx32g) and used threads (E.g.:
--threads 32)

3. Reindexing was keeping our CPUs busy only initially.
After a while, all but 2 CPUs idled. Re-working the reindexing step in
Ic7b36b5b8badab502370d79085f329f9b8c70d9d made that problem go
away. This cut down overall time considerably again.
So if you have >1 CPUs available for reindexing, make sure you include
that commit in your build, as—if I'm not mistaken—it is not yet in a
released war file.

4. Since the upgrade to Gerrit 3.2 requires a reindex anyways, there
is no need to reindex during the NoteDB migration. This was not fully
clear before-hand that all the intermediate upgrades will pass fine
without the reindexing after the NoteDB migration. But this worked out
nicely for us. (Use '--reindex false' in migrate-to-note-db)

5. During test migrations, GC-ing repos aggressively (for us) took
more time than it saved in the end. So for example gc-ing our biggest
two repos took 18 minutes, but saved only 7 minutes during the
migration. So when migrating in production, we skipped the repo GCing
to save us 10 minutes of downtime. YMMV. And GCing of course comes
with many benefits, but our focus was on keeping downtime short.

6. If you run in an environment that does not allow 3rd party
resources, make sure to include
Ifac1b798ce87f3125d48bab643047113581acd9a
I491435ad6295c3df0fc6d41f91df16e2fc5cea5d
in your gerrit build and if you use gitiles, also
I4f4a0b7dd575cbc27641063e05d13b8a43a51d8b
(and maybe
I7322c2762f6bce1a6fc6b0de1a2a3674092ef8a1)
to keep policy makers happy :-)

----------------

Judging from from a test setup, the initial estimate for the overall
downtime was ~40 hours. With all of the above and a few other minor
improvements, Gerrit was in the end inaccessible only for one hour.
(There were some networking issues on the Gerrit host. So you'll see a
longer downtime on the Wikimedia IRC logs, but it would be unfair to
count these unrelated issues towards Gerrit).

The biggest chunk of the original estimate was offline
reindexing. With all the improvements the reindexing took only 27
minutes in the end, but was still the dominating item.

Big thanks to the Gerrit community and maintainers, for their work and
their help. Special thanks got to David Pursehouse whose quick,
friendly, and spot-on reviews and merges of patches helped a lot.

Needless to say that Wikimedia's Gerrit upgrade was of course a team
effort on Wikimedia's end. mutante, paladox, hashar, thcipriani and
many others invested countless hours to make the (up to now smooth)
upgrade happen for Wikimedia.

Have fun,
Christian



--
---- quelltextlich e.U. ---- \\ ---- Christian Aistleitner ----
Companies' registry: 360296y in Linz
Christian Aistleitner
Kefermarkterstrasze 6a/3 Email: chri...@quelltextlich.at
4293 Gutau, Austria Phone: +43 7946 / 20 5 81
Fax: +43 7946 / 20 5 81
Homepage: http://quelltextlich.at/
---------------------------------------------------------------
signature.asc

Luca Milanesio

unread,
Jun 29, 2020, 10:16:08 AM6/29/20
to Repo and Gerrit Discussion, Luca Milanesio, Christian Aistleitner
Thanks, Christian, for sharing your experience.
Also thanks for all the contributions to the Gerrit Code Review project: during the migration exercise you found and fixed *lots* of issues.

That is a fantastic example of positive and constructive collaboration and the beauty of Open-Source.

Luca.
> --
> --
> To unsubscribe, email repo-discuss...@googlegroups.com
> More info at http://groups.google.com/group/repo-discuss?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/20200629135016.GA12714%40quelltextlich.at.

Sven Selberg

unread,
Jun 29, 2020, 10:49:13 AM6/29/20
to Repo and Gerrit Discussion


On Monday, June 29, 2020 at 3:50:23 PM UTC+2, Christian Aistleitner wrote:
Hi,

Wikimedia recently successfully upgraded their Gerrit at

  https://gerrit.wikimedia.org/

from Gerrit 2.15 to Gerrit 3.2. And since there are from time to time
reports/questions on this list about what worked and did not work,
I figured I'll post what we did. Maybe it helps someone.

Thanks a lot for sharing this. It will certainly help a lot of people. 

Wikimedia is running two Gerrit servers in production. One master
node, and one read-only replica.

It was ok for us, to have a bit of downtime on the weekend. So we went
for a simpler upgrade procedure, where we can take all nodes offline
for a bit on a Saturday and not have to worry about keeping
consistency and some such.

We took both nodes offline, and migrated the master node to
* latest 2.15, then
* latest 2.16, then
* NoteDB (without reindexing), then
* latest 3.0, then
* latest 3.1, then
 
Why did you choose not to omit the upgrade-step to v3.1? 
Did you find that it was necessary?

Edwin Kempin

unread,
Jun 29, 2020, 10:56:32 AM6/29/20
to Luca Milanesio, Repo and Gerrit Discussion, Christian Aistleitner
On Mon, Jun 29, 2020 at 4:16 PM Luca Milanesio <luca.mi...@gmail.com> wrote:
Thanks, Christian, for sharing your experience.
Also thanks for all the contributions to the Gerrit Code Review project: during the migration exercise you found and fixed *lots* of issues.
+1

Thanks a lot for sharing these insights! I find this very helpful for others and uploaded some kudos for you :)

Nasser Grainawi

unread,
Jun 29, 2020, 2:35:07 PM6/29/20
to Christian Aistleitner, repo-d...@googlegroups.com

> On Jun 29, 2020, at 7:50 AM, 'Christian Aistleitner' via Repo and Gerrit Discussion <repo-d...@googlegroups.com> wrote:
>
> Hi,
>
> Wikimedia recently successfully upgraded their Gerrit at
>
> https://gerrit.wikimedia.org/
>
> from Gerrit 2.15 to Gerrit 3.2. And since there are from time to time
> reports/questions on this list about what worked and did not work,
> I figured I'll post what we did. Maybe it helps someone.
>
> Wikimedia is running two Gerrit servers in production. One master
> node, and one read-only replica.
>

Thanks for all this great detail Christian! Would it also be possible to give some rough stats on the size of the Wikimedia Gerrit instance? How many users/projects/changes? Any especially big projects with many changes (if yes, how many)?

Also, are you using Lucene for your index or ElasticSearch?

David Ostrovsky

unread,
Jun 29, 2020, 3:41:06 PM6/29/20
to Repo and Gerrit Discussion
They are using Lucene: [1]. IIRC recent poll revealed that only single
user is using ES in production.


Christian Aistleitner

unread,
Jul 1, 2020, 5:24:09 PM7/1/20
to Edwin Kempin, Repo and Gerrit Discussion
Hi Edwin,

On Mon, Jun 29, 2020 at 04:55:49PM +0200, 'Edwin Kempin' via Repo and Gerrit Discussion wrote:
> Thanks a lot for sharing these insights! I find this very helpful for
> others and uploaded some kudos for you :)
> https://gerrit-review.googlesource.com/c/homepage/+/273695

Wow :-D Thanks!
signature.asc

Christian Aistleitner

unread,
Jul 1, 2020, 5:31:20 PM7/1/20
to Sven Selberg, Repo and Gerrit Discussion
Hi,

On Mon, Jun 29, 2020 at 07:49:13AM -0700, Sven Selberg wrote:
> On Monday, June 29, 2020 at 3:50:23 PM UTC+2, Christian Aistleitner wrote:
> > [...]
> > * latest 3.0, then
> > * latest 3.1, then
> >
>
> Why did you choose not to omit the upgrade-step to v3.1?
> Did you find that it was necessary?

I have actually never tried without it.
I did not even check if the 3.1 init does anything.

The 3.1 upgrade got included, because it is often said to upgrade to
the latest version of a stable branch before hopping to the next
stable branch. And the v3.1 init took 16 seconds.

So it was quicker for us to just follow common practice and do it,
than look at the code to find what it is actually doing or if it is
doing anything at all.

Yes, there might certainly be things to optimize here :-)
signature.asc

Christian Aistleitner

unread,
Jul 1, 2020, 6:01:50 PM7/1/20
to Nasser Grainawi, repo-d...@googlegroups.com
Hi,

On Mon, Jun 29, 2020 at 12:34:24PM -0600, Nasser Grainawi wrote:
> Would it also be possible to give some rough stats on the size of the Wikimedia Gerrit instance?

Sure. WMF's gerrit instance is open to the world, so anyone can have a
look at

https://gerrit.wikimedia.org/

The main server has 32 CPUs and 64GB RAM.



> How many users/projects/changes?

On WMFs Gerrit, there are about 6K users, 2.4K projects, and 600K changes.
Many of the projects are small.



> Any especially big projects with many changes (if yes, how many)?

WMF's "biggest" repos are operations/puppet (71K changes) and
mediawiki/core (41K changes, but ~100K commits)



> Also, are you using Lucene for your index or ElasticSearch?

(Confirming davido's response that WMF is using Lucene)



Have fun,
Christian


P.S.: Just in case anyone would be concerned about me sharing the
above information, I only gathered it from public sources.
signature.asc

Antoine Musso

unread,
Jul 2, 2020, 3:41:35 AM7/2/20
to Christian Aistleitner, Nasser Grainawi, repo-d...@googlegroups.com
Le 02/07/2020 à 00:01, 'Christian Aistleitner' via Repo and Gerrit Discussion a écrit :
How many users/projects/changes?
On WMFs Gerrit, there are about 6K users, 2.4K projects, and 600K changes.
Many of the projects are small.

Any especially big projects with many changes (if yes, how many)?
WMF's "biggest" repos are operations/puppet (71K changes) and
mediawiki/core (41K changes, but ~100K commits)

With respectively 190k and 240k references. The introduction of git protocol v2 nicely speeds up fetches for those repositories. If someone reading this got involved in the new wire protocol: you have my eternal gratitude!

Some more metrics that are not public based on yesterday:

~ a million http request on the master or ~ 11.2/seconds. Of them 366k are git-upload-pack, primarily from the CI systems. The CI systems have a local read-only copy of the repositories which we clone from before hitting the master, that helps offloading.

The replica served ~ 500k git-upload-pack. We have pointed a search indexer to it and whatever tools that likes to fetch every single repositories for analysis and audits.

In the recent past we had multiple issues such as: java gc getting wild, Gerrit magically deadlocking, indexer lagging. Those issues entirely disappeared after we switched to a more powerful hardware and to the latest Debian distribution (10, Buster). We never quite found the root cause, maybe a contention issue in one of the system libraries.  So yes do upgrade the system and hardware :]

Beside that, Gerrit is a walk in the park. It barely has any load and serves several hundred users and millions of fetches just fine.


-- 
Antoine "hashar" Musso
Release Engineering

Sven Selberg

unread,
Jul 2, 2020, 4:48:10 AM7/2/20
to Repo and Gerrit Discussion


On Wednesday, July 1, 2020 at 11:31:20 PM UTC+2, Christian Aistleitner wrote:
Hi,

On Mon, Jun 29, 2020 at 07:49:13AM -0700, Sven Selberg wrote:
> On Monday, June 29, 2020 at 3:50:23 PM UTC+2, Christian Aistleitner wrote:
> > [...]
> > * latest 3.0, then
> > * latest 3.1, then
> >
>  
> Why did you choose not to omit the upgrade-step to v3.1?
> Did you find that it was necessary?

I have actually never tried without it.
I did not even check if the 3.1 init does anything.

Thanks for the explanation.
We have chosen to skip v3.1 in our upgrade tests and it seems to work fine so far.
I was just worried that I had missed some important detail that was going to bite us when we upgrade production. :-)
Reply all
Reply to author
Forward
0 new messages