Re: Upgrade from Gerrit 2.15 to 3.

885 views
Skip to first unread message

Elijah Newren

unread,
May 20, 2020, 6:51:45 PM5/20/20
to AF, Repo and Gerrit Discussion
On Wed, May 20, 2020 at 3:58 AM AF <are...@gmail.com> wrote:
>
> Hello I would like to upgrade 2.15 (standalone Lentos installation) to latest 3.1.4 from Docker image.
>
> Do I have to upgrade to 2.16 first? Or just start container with new version and try migrate data ?
>
> Is there any guide how to migrate Gerrit between versions and avoid common pitfalls if any ?
>
> Best regards

I just did this same upgrade last week, with about half a day of
downtime. Somewhat amusingly, I was wondering the same thing and did
some searches only to find a post by Luca agreeing that the upgrade
instructions were somewhat lacking and that he'd give a talk on it at
the 2019 Gerrit Summit. While I found links that mentioned the talk,
I couldn't find slides or video for the talk anywhere. :-(

Anyway, ignoring plugin-specific stuff, here's some things I couldn't
find in one place (or at all) that would have saved me some time in
going from 2.15->3.1:

1) Yes, you have to upgrade to 2.16 first, then to 3.x. Although 3.x
documents the migrate-to-note-db subcommand and makes it look like you
can run it, it only exists in gerrit-2.16.x. (Some places do document
this, but some places don't and make it look like the command also
exists in 3.x. I got unlucky when first looking for migration
instructions.)

2) The instructions do not specify the order of steps to upgrade
between init, reindex, and migrate-to-note-db. These are the ones I
used:

cd $GERRIT_HOME
java -Xmx12g -jar bin/gerrit-2.16.18.war init --batch -d $(pwd)
java -Xmx12g -jar bin/gerrit-2.16.18.war reindex --index projects -d $(pwd)
java -Xmx12g -jar bin/gerrit-2.16.18.war reindex --index groups -d $(pwd)
java -Xmx12g -jar bin/gerrit-2.16.18.war migrate-to-note-db -d $(pwd)

3) The migrate-to-note-db step spewed a huge number of stacktraces,
spending at least 10 minutes doing nothing but that at the end of its
process. As best I can tell, this was related to historical database
corruption we had (people deleting projects on disk without updating
the database to match -- before the days of the delete-project plugin,
random downtimes in the really old days of gerrit causing us to have
some repositories with missing objects -- though only defunct
repositories still have such problems). It probably would have been
better for us to install the delete-project plugin first and delete
the old projects before doing the migration, but I instead did extra
checking after. *shrug*. Hopefully this doesn't happen to you too,
but despite the scare it was actually okay for us.

4) The migrate-to-note-db step is not quite enough. After it's done
and before you try upgrading to 3.x, you also need to run

git config -f etc/notedb.config noteDb.changes.disableReviewDb true

5) You can skip the migrate-to-note-db step and #4 if you do online
conversion to note-db instead. If you're happy running on gerrit-2.16
for a while, that might be an easier option for you. I tried it out
in testing and it worked well (still spewing the same stack traces to
the error log over a longer period of time), but we really wanted to
jump to 3.1.

6) Found in some issues somewhere that it's important to repack the
All-Users.git repo after the NoteDb migration. So, before continuing
on to the 3.1 upgrade, you'll want to run

git -C git/All-Users.git/ gc --aggressive

I also added --prune=now, since it seemed to leave me with a lot of
loose objects otherwise, and the loose objects all seemed to be
ancient draft comments that I preferred to trash.

Also, Luca suggested in another thread that post-NoteDB migration, you
should really consider an aggressive gc of all your repositories. It
shrunk one of our big ones by nearly 33%. So we actually did that
post-upgrade but it would have been better to do that before
continuing with the 3.1 upgrade. (We did do a gc without the
--aggressive flag during the upgrade on our most important repo, but
adding the --aggressive flag would have been better.)

7) Proceed with the 3.1 upgrade

java -jar bin/gerrit-3.1.4.war init -d $(pwd)
java -Xmx32g -jar bin/gerrit-3.1.4.war reindex --threads 12 --index
changes # Takes forever...

8) Cleanup

You no longer need the [database] section of gerrit.config, you can
stop mysql/postgres/whatever and make sure your gerrit startup scripts
don't trigger it to start up.


I'm not a gerrit developer or expert though, so if someone else chimes
in to point out I'm doing something less than optimally, listen to
them. A lot of this was pieced together by finding various emails,
issues, tying together different bits of documentation, or trial and
error, and I can say no more than it happened to work for me.

Hope that helps,
Elijah

Luca Milanesio

unread,
May 20, 2020, 7:13:26 PM5/20/20
to Elijah Newren, Luca Milanesio, AF, Repo and Gerrit Discussion
Hi Elijah,
Thanks for answering, see my comments inline.

On 20 May 2020, at 23:51, Elijah Newren <new...@gmail.com> wrote:

On Wed, May 20, 2020 at 3:58 AM AF <are...@gmail.com> wrote:

Hello I would like to upgrade 2.15 (standalone Lentos installation) to latest 3.1.4 from Docker image.

Do I have to upgrade to 2.16 first? Or just start container with new version and try migrate data ?

Is there any guide how to migrate Gerrit between versions and avoid common pitfalls if any ?

Best regards

I just did this same upgrade last week, with about half a day of
downtime.  Somewhat amusingly, I was wondering the same thing and did
some searches only to find a post by Luca agreeing that the upgrade
instructions were somewhat lacking and that he'd give a talk on it at
the 2019 Gerrit Summit.  While I found links that mentioned the talk,
I couldn't find slides or video for the talk anywhere.  :-(

Apologies, yes I did a talk and yes I am super-late in post-production and publishing :-(

Anyway, ignoring plugin-specific stuff, here's some things I couldn't
find in one place (or at all) that would have saved me some time in
going from 2.15->3.1:

1) Yes, you have to upgrade to 2.16 first, then to 3.x.  Although 3.x
documents the migrate-to-note-db subcommand and makes it look like you
can run it, it only exists in gerrit-2.16.x.  (Some places do document
this, but some places don't and make it look like the command also
exists in 3.x.  I got unlucky when first looking for migration
instructions.)

Yes, v2.16 is a mandatory intermediate step. And please migrate to v2.16 ReviewDb, then settle, then migrate ReviewDb to NoteDb, then move to the next step.


2) The instructions do not specify the order of steps to upgrade
between init, reindex, and migrate-to-note-db.  These are the ones I
used:

I would actually recommend to split into two migrations:


cd $GERRIT_HOME
java -Xmx12g -jar bin/gerrit-2.16.18.war init --batch -d $(pwd)

First migration

java -Xmx12g -jar bin/gerrit-2.16.18.war reindex --index projects -d $(pwd)
java -Xmx12g -jar bin/gerrit-2.16.18.war reindex --index groups -d $(pwd)

Reindex of projects and groups isn’t needed: the online migration would do the job

java -Xmx12g -jar bin/gerrit-2.16.18.war migrate-to-note-db -d $(pwd)

Second migration.
But I personally prefer to the online migration to NoteDb.


3) The migrate-to-note-db step spewed a huge number of stacktraces,
spending at least 10 minutes doing nothing but that at the end of its
process.  As best I can tell, this was related to historical database
corruption we had (people deleting projects on disk without updating
the database to match -- before the days of the delete-project plugin,
random downtimes in the really old days of gerrit causing us to have
some repositories with missing objects -- though only defunct
repositories still have such problems).  It probably would have been
better for us to install the delete-project plugin first and delete
the old projects before doing the migration, but I instead did extra
checking after.  *shrug*.  Hopefully this doesn't happen to you too,
but despite the scare it was actually okay for us.

*before* running any migration, the current Gerrit system should work fine and be healthy.
That means, no changes inconsistencies, no corrupted Git repos, no super-big binaries or super-fragmented repos.

Of course, you could migrate an inconsistent and partially unstable Gerrit setup, but it is going to be challenging.


4) The migrate-to-note-db step is not quite enough.  After it's done
and before you try upgrading to 3.x, you also need to run

git config -f etc/notedb.config noteDb.changes.disableReviewDb true

That should not be needed: if the migration is successful, it will be set automatically.
If the migration IS NOT successful, setting this would actually risk to leave changes behind and inconsistencies in the changes meta-data.


5) You can skip the migrate-to-note-db step and #4 if you do online
conversion to note-db instead.  If you're happy running on gerrit-2.16
for a while, that might be an easier option for you.  I tried it out
in testing and it worked well (still spewing the same stack traces to
the error log over a longer period of time), but we really wanted to
jump to 3.1.

+1, that’s my preferred option.


6) Found in some issues somewhere that it's important to repack the
All-Users.git repo after the NoteDb migration.  So, before continuing
on to the 3.1 upgrade, you'll want to run

git -C git/All-Users.git/ gc --aggressive

NOT just once, but continuously after migrating to v2.16 onwards.
All draft comments are stored on All-Users.git, for *ALL REPOS* and *ALL USERS*. That makes the repo growing indefinitely and being always very fragmented.

We do an aggressive gc of All-Users.git every 15 minutes !


I also added --prune=now, since it seemed to leave me with a lot of
loose objects otherwise, and the loose objects all seemed to be
ancient draft comments that I preferred to trash.

That’s good if you do with Gerrit inactive. However, *never* do that with Gerrit up and running, as you would confuse *A LOT* the JGit cache.


Also, Luca suggested in another thread that post-NoteDB migration, you
should really consider an aggressive gc of all your repositories.  It
shrunk one of our big ones by nearly 33%.  So we actually did that
post-upgrade but it would have been better to do that before
continuing with the 3.1 upgrade.  (We did do a gc without the
--aggressive flag during the upgrade on our most important repo, but
adding the --aggressive flag would have been better.)

7) Proceed with the 3.1 upgrade

java -jar bin/gerrit-3.1.4.war init -d $(pwd)
java -Xmx32g -jar bin/gerrit-3.1.4.war reindex --threads 12 --index
changes  # Takes forever...

Why doing offline reindex? There are *NO SCHEMA CHANGES* in v3.1.4.


8) Cleanup

You no longer need the [database] section of gerrit.config, you can
stop mysql/postgres/whatever and make sure your gerrit startup scripts
don't trigger it to start up.


I'm not a gerrit developer or expert though, so if someone else chimes
in to point out I'm doing something less than optimally, listen to
them.  A lot of this was pieced together by finding various emails,
issues, tying together different bits of documentation, or trial and
error, and I can say no more than it happened to work for me.

Hope that helps,
Elijah

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/CABPp-BGPPxQ2s4rNFspsxcJma7bsgTqcWyukN5S6gpe%3DuNji9w%40mail.gmail.com.

Elijah Newren

unread,
May 20, 2020, 10:03:05 PM5/20/20
to Luca Milanesio, AF, Repo and Gerrit Discussion
Hi Luca,

Thanks for the extra comments and pointers. Answers to your questions
inline, plus a few extra comments of my own.
Is asking people to settle on 2.16 ReviewDb, then after a while
migrating and settling on 2.16 NoteDb, then after another while moving
on to a new version, all consistent with marking 2.16 EOL in a week or
two? What if 2.16 has issues that affect folks? I'll also comment on
the "go through all versions" at the end of the email...

> 2) The instructions do not specify the order of steps to upgrade
> between init, reindex, and migrate-to-note-db. These are the ones I
> used:
>
>
> I would actually recommend to split into two migrations:
>
>
> cd $GERRIT_HOME
> java -Xmx12g -jar bin/gerrit-2.16.18.war init --batch -d $(pwd)
>
>
> First migration
>
> java -Xmx12g -jar bin/gerrit-2.16.18.war reindex --index projects -d $(pwd)
> java -Xmx12g -jar bin/gerrit-2.16.18.war reindex --index groups -d $(pwd)
>
>
> Reindex of projects and groups isn’t needed: the online migration would do the job
_if_ you do an online migration. :-)

>
> java -Xmx12g -jar bin/gerrit-2.16.18.war migrate-to-note-db -d $(pwd)
>
>
> Second migration.
> But I personally prefer to the online migration to NoteDb.
>
>
> 3) The migrate-to-note-db step spewed a huge number of stacktraces,
> spending at least 10 minutes doing nothing but that at the end of its
> process. As best I can tell, this was related to historical database
> corruption we had (people deleting projects on disk without updating
> the database to match -- before the days of the delete-project plugin,
> random downtimes in the really old days of gerrit causing us to have
> some repositories with missing objects -- though only defunct
> repositories still have such problems). It probably would have been
> better for us to install the delete-project plugin first and delete
> the old projects before doing the migration, but I instead did extra
> checking after. *shrug*. Hopefully this doesn't happen to you too,
> but despite the scare it was actually okay for us.
>
>
> *before* running any migration, the current Gerrit system should work fine and be healthy.
> That means, no changes inconsistencies, no corrupted Git repos, no super-big binaries or super-fragmented repos.
>
> Of course, you could migrate an inconsistent and partially unstable Gerrit setup, but it is going to be challenging.
>
>
> 4) The migrate-to-note-db step is not quite enough. After it's done
> and before you try upgrading to 3.x, you also need to run
>
> git config -f etc/notedb.config noteDb.changes.disableReviewDb true
>
>
> That should not be needed: if the migration is successful, it will be set automatically.
> If the migration IS NOT successful, setting this would actually risk to leave changes behind and inconsistencies in the changes meta-data.

I don't think the offline migration spewed any more errors than the
online migration wrote out to the error_log, though I didn't carefully
check and compare. The only difference (other than online being
slower and writing to error_log rather than stdout) was that at the
end the online migration not only set primaryStorage to "note db" but
also marked disableReviewDb as true. The offline migration only set
primaryStorage to "note db" while leaving disableReviewDb as false.

Also, while both spewed errors (due to our defunct repositories that
were either known to be missing or were known to have missing objects
-- so not a bug in the migration code), I'm pretty sure I remember
both the online and offline migrations printing a message saying that
the migration was successful.

> 5) You can skip the migrate-to-note-db step and #4 if you do online
> conversion to note-db instead. If you're happy running on gerrit-2.16
> for a while, that might be an easier option for you. I tried it out
> in testing and it worked well (still spewing the same stack traces to
> the error log over a longer period of time), but we really wanted to
> jump to 3.1.
>
>
> +1, that’s my preferred option.
>
>
> 6) Found in some issues somewhere that it's important to repack the
> All-Users.git repo after the NoteDb migration. So, before continuing
> on to the 3.1 upgrade, you'll want to run
>
> git -C git/All-Users.git/ gc --aggressive
>
>
> NOT just once, but continuously after migrating to v2.16 onwards.
> All draft comments are stored on All-Users.git, for *ALL REPOS* and *ALL USERS*. That makes the repo growing indefinitely and being always very fragmented.
>
> We do an aggressive gc of All-Users.git every 15 minutes !

Good to know. I think you have way more users than we do, but I'll
put something in place to do this periodically.

> I also added --prune=now, since it seemed to leave me with a lot of
> loose objects otherwise, and the loose objects all seemed to be
> ancient draft comments that I preferred to trash.
>
>
> That’s good if you do with Gerrit inactive. However, *never* do that with Gerrit up and running, as you would confuse *A LOT* the JGit cache.

Yeah, there was no way I was going to do that on an active Gerrit. Good point.

> Also, Luca suggested in another thread that post-NoteDB migration, you
> should really consider an aggressive gc of all your repositories. It
> shrunk one of our big ones by nearly 33%. So we actually did that
> post-upgrade but it would have been better to do that before
> continuing with the 3.1 upgrade. (We did do a gc without the
> --aggressive flag during the upgrade on our most important repo, but
> adding the --aggressive flag would have been better.)
>
> 7) Proceed with the 3.1 upgrade
>
> java -jar bin/gerrit-3.1.4.war init -d $(pwd)
> java -Xmx32g -jar bin/gerrit-3.1.4.war reindex --threads 12 --index
> changes # Takes forever...
>
>
> Why doing offline reindex? There are *NO SCHEMA CHANGES* in v3.1.4.
> (See https://www.gerritcodereview.com/3.1.html#schema-changes)

Because it was mandatory; 3.1.4 would not start until I did. Perhaps
if you're upgrading from 3.0.x to 3.1.4 you don't need to reindex, but
going from 2.15->2.16->3.1 (with 2.16 only intermediate and gerrit
still offline), it refused to start and told me I needed the changes
index built before it would run. I spent a week or two trying to find
ways around the offline index, or trying to reduce the amount of
downtime needed while waiting for an offline reindex, all to no avail.
It sounds like I was close with my idea of reindexing an old version
of the data and copying it over based on Matthias' comments in another
thread, but he has some secret sauce that I was missing (and still
am).

Yes, I know I could have run every intermediate version of Gerrit in
prod for a while, but I vastly preferred a half day outage on a
weekend than dragging out the upgrade over multiple weeks and playing
roulette with whether I could get all the plugins working on all
intermediate Gerrit versions. (Huge thanks to David for fixing up
find-owners and saml on 3.1 for me, by the way.)

I know _you_ can't take a half day outage and even a 15 second outage
is huge; but while a half day outage on a weekend was big enough to
make us squirm very uncomfortably, for _me_ the really huge thing was
worrying about getting plugins to function on that many different
versions of Gerrit. Maybe I shouldn't be so scared of that, but
experience has just made me feel that way even though we don't even
have any in-house custom plugins anymore. (Things like finding the
heartbeat plugin broken earlier this week and you mentioning that it
was something in core affecting all plugins that you'd fix up by end
of week is just one issue in a sea of many.)

Anyway, hope that helps explain the "why" a little bit and the
tradeoffs I'm working with.

David Pursehouse

unread,
May 20, 2020, 10:47:21 PM5/20/20
to Elijah Newren, Luca Milanesio, AF, Repo and Gerrit Discussion
This is something we've discussed.  We need to balance the workload of having 4 active stable branches (2.16, 3.0, 3.1, and 3.2) with the fact that there are likely many people in the situation that you find yourself in now, migrating from pre 2.16 and wanting to get onto 3.x.

IIRC what we settled on is that we'll still make 2.16 EOL but in addition to the "critical and security fixes" that we would normally allow for EOL branches, we will also allow fixes related to the note db migration.  We should formalize this decision and update the information on the homepage.

 
--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.

David Ostrovsky

unread,
May 21, 2020, 4:28:34 AM5/21/20
to Repo and Gerrit Discussion

Am Donnerstag, 21. Mai 2020 04:03:05 UTC+2 schrieb Elijah Newren:
Hi Luca,

Thanks for the extra comments and pointers.  Answers to your questions
inline, plus a few extra comments of my own.


In 2.16.x change index schema version is: 50.
In 3.1.x change index schema version is: 57.

You can always check the sources: [1].

In gerrit site the place where the current active index is defined:
<gerrit_site>/index/gerrit_index.config

On 2.16 site the content for changes index type is:

[index "changes_0050"]
ready = true

While on 3.1 the content is:

[index "changes_0057"]
        ready = true

Apparently, you have only copied the changes_0057 directory
with pre-populated index data, that you have created on staging
site, but you missed to actually activate the new change index
schema version in gerrit_index.config, to change it from:

[index "changes_0050"]
ready = true

to:

[index "changes_0050"]
ready = false
[index "changes_0057"]
        ready = true

I think gerrit could benefit from a small site program: index-admin.
It should print currently active schema version pro index type and
also be able to activate new schema version for specific index type.
Note, that such program should be index backend agnostic and
work for Lucene and Elasticsearch index backends.



Luca Milanesio

unread,
May 21, 2020, 4:41:00 AM5/21/20
to David Pursehouse, Elijah Newren, Luca Milanesio, AF, Repo and Gerrit Discussion
+1 to that.

v2.16 is still the *most popular* Gerrit release to date, because allows people to have a fully featured PolyGerrit side by side with GWT *AND* allows to have ReviewDb migrated to NoteDb. The problem of supporting the migration from ReviewDb to NoteDb in any release >= 3.0 is that the source code is gone and doesn’t exist anymore in the code base.

That sounds like a bug to me, but I need to check the documentation if we mentioned anything to “set manually disableReviewDb to true”.
Ah, I missed the fact that you skipped 3.0. Yes, of course if you skip steps, you need to go through the pain of off-line reindexing :-(

it refused to start and told me I needed the changes
index built before it would run.  I spent a week or two trying to find
ways around the offline index,

Did you write to the mailing list and we did not answer for a week?
There are also company giving Gerrit Enterprise Support with support SLA, that would definitely saved you time.

or trying to reduce the amount of
downtime needed while waiting for an offline reindex, all to no avail.
It sounds like I was close with my idea of reindexing an old version
of the data and copying it over based on Matthias' comments in another
thread, but he has some secret sauce that I was missing (and still
am).

Yes, I know I could have run every intermediate version of Gerrit in
prod for a while, but I vastly preferred a half day outage on a
weekend than dragging out the upgrade over multiple weeks and playing
roulette with whether I could get all the plugins working on all
intermediate Gerrit versions.  (Huge thanks to David for fixing up
find-owners and saml on 3.1 for me, by the way.)

We typically make sure that all the most popular plugins are in workable state in the past supported releases.
The issue I believe with plugins is that some times we allow breaking changes on stable branches (we shouldn’t do that !!!) and that breaks plugins on stable branches.
Those breakages are unnoticed because we don’t rebuild 100s of plugins for every single Gerrit change.


I know _you_ can't take a half day outage and even a 15 second outage
is huge; but while a half day outage on a weekend was big enough to
make us squirm very uncomfortably, for _me_ the really huge thing was
worrying about getting plugins to function on that many different
versions of Gerrit.

What are the plugins that gave you trouble?

  Maybe I shouldn't be so scared of that, but
experience has just made me feel that way even though we don't even
have any in-house custom plugins anymore.  (Things like finding the
heartbeat plugin broken earlier this week and you mentioning that it
was something in core affecting all plugins that you'd fix up by end
of week is just one issue in a sea of many.)

Anyway, hope that helps explain the "why" a little bit and the
tradeoffs I'm working with.

Thanks, yes, that explains it.

Luca.

David Pursehouse

unread,
May 21, 2020, 5:06:31 AM5/21/20
to Luca Milanesio, Elijah Newren, AF, Repo and Gerrit Discussion
For the plugins that have a standalone build we (meaning me and Marco) have been making effort to ensure that they are still building and have passing tests after updating to every major and minor release. This is mostly an automated activity.  Some 20+ plugins are included here for stable-2.16 [1], with the number decreasing on later stable branches either because the plugins don't build any more or they are not relevant on that branch.   Plugins that don't have a standalone build are not covered, nor are plugins that have standalone builds but have been explicitly requested to be omitted from this activity.

Matthias Sohn

unread,
May 21, 2020, 5:14:12 AM5/21/20
to David Pursehouse, Luca Milanesio, Elijah Newren, AF, Repo and Gerrit Discussion
maybe it would make sense to mark these plugins in some way so that this is visible on the plugin matric [2]
 



 


I know _you_ can't take a half day outage and even a 15 second outage
is huge; but while a half day outage on a weekend was big enough to
make us squirm very uncomfortably, for _me_ the really huge thing was
worrying about getting plugins to function on that many different
versions of Gerrit.

What are the plugins that gave you trouble?

  Maybe I shouldn't be so scared of that, but
experience has just made me feel that way even though we don't even
have any in-house custom plugins anymore.  (Things like finding the
heartbeat plugin broken earlier this week and you mentioning that it
was something in core affecting all plugins that you'd fix up by end
of week is just one issue in a sea of many.)

Anyway, hope that helps explain the "why" a little bit and the
tradeoffs I'm working with.

Thanks, yes, that explains it.

Luca.

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.

Luca Milanesio

unread,
May 21, 2020, 5:14:15 AM5/21/20
to David Pursehouse, Elijah Newren, Luca Milanesio, AF, Repo and Gerrit Discussion
I typically try to cover those as well, so between David, Marco and me, we have a quite good coverage on that :-)
But as I said, give us the list and we’ll find out which ones fell through the cracks :-(

Luca.

Luca Milanesio

unread,
May 21, 2020, 5:18:58 AM5/21/20
to Matthias Sohn, Luca Milanesio, David Pursehouse, Elijah Newren, AF, Repo and Gerrit Discussion
If a plugin builds and works, what’s the purpose of this marking? Is a “quality check guarantee”?
My understanding is that those plugins are built and tests are passing, exactly like the others

WDYT?

Luca.

David Ostrovsky

unread,
May 21, 2020, 5:19:40 AM5/21/20
to Repo and Gerrit Discussion

Am Donnerstag, 21. Mai 2020 10:41:00 UTC+2 schrieb lucamilanesio:


On 21 May 2020, at 03:47, David Pursehouse <david.p...@gmail.com> wrote:

On Thu, May 21, 2020 at 11:03 AM Elijah Newren <new...@gmail.com> wrote:
Hi Luca,

Thanks for the extra comments and pointers.  Answers to your questions
inline, plus a few extra comments of my own.


+1 to that.

He reported that at least two plugins were broken: saml and find-owners
plugin. I fixed both plugins, but I had to conduct a custom release for
find-owners plugin: [1], because the CI is broken: [2], and this issue
is even reported on gerrit issue tracker: [3].

So that it's a pane for gerrit users to differentiate between breakages: plugin
is broken? CI is broken? Both?

Anyway until the CI is fixed, ping me and I will conduct yet another build of
that plugin if needed.


Luca Milanesio

unread,
May 21, 2020, 5:23:21 AM5/21/20
to David Ostrovsky, Luca Milanesio, Repo and Gerrit Discussion
Right, and yes, the SAML plugin is missing intermediate releases, as it was unsupported for a while, until you and me started to take it onboard and modernising it.

Find-owners is used and maintained by Google, isn’t it?

I fixed both plugins, but I had to conduct a custom release for
find-owners plugin: [1], because the CI is broken: [2], and this issue
is even reported on gerrit issue tracker: [3].

So that it's a pane for gerrit users to differentiate between breakages: plugin
is broken? CI is broken? Both?

Hard to say: if the build is broken, it could be that the code is broken or the build script is broken, you do not really know until you open the box, isn’t it?

Thanks for fixing both of them.

Luca.

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.

Elijah Newren

unread,
May 21, 2020, 2:17:47 PM5/21/20
to David Ostrovsky, Repo and Gerrit Discussion
Actually, I was at version 48. I did notice when I did an online
migration from ReviewDB to NoteDB in a practice upgrade, that once the
NoteDB migration was complete that it then printed a message about
"Performing deferred schema upgrade" and something about going from
version 48 to 50. That also took quite a while. Since I ended up not
using the online notedb migration, though, when it came time to
upgrade to 3.1 I had to jump from change index schema version 48 to
57.

> In gerrit site the place where the current active index is defined:
> <gerrit_site>/index/gerrit_index.config
>
> On 2.16 site the content for changes index type is:
>
> [index "changes_0050"]
> ready = true
>
> While on 3.1 the content is:
>
> [index "changes_0057"]
> ready = true
>
> Apparently, you have only copied the changes_0057 directory
> with pre-populated index data, that you have created on staging
> site, but you missed to actually activate the new change index
> schema version in gerrit_index.config, to change it from:
>
> [index "changes_0050"]
> ready = true
>
> to:
>
> [index "changes_0050"]
> ready = false
> [index "changes_0057"]
> ready = true

Oh, man, I was _that_ close to having it work? Thanks for the tip,
that may yet come in handy in the future!

> I think gerrit could benefit from a small site program: index-admin.
> It should print currently active schema version pro index type and
> also be able to activate new schema version for specific index type.
> Note, that such program should be index backend agnostic and
> work for Lucene and Elasticsearch index backends.

Sounds great. :-)

Elijah Newren

unread,
May 21, 2020, 4:23:14 PM5/21/20
to Luca Milanesio, David Pursehouse, AF, Repo and Gerrit Discussion
Hi Luca,

On Thu, May 21, 2020 at 1:40 AM Luca Milanesio <luca.mi...@gmail.com> wrote:
>
> On 21 May 2020, at 03:47, David Pursehouse <david.pu...@gmail.com> wrote:
>
> On Thu, May 21, 2020 at 11:03 AM Elijah Newren <new...@gmail.com> wrote:
>>

>> > Why doing offline reindex? There are *NO SCHEMA CHANGES* in v3.1.4.
>> > (See https://www.gerritcodereview.com/3.1.html#schema-changes)
>>
>> Because it was mandatory; 3.1.4 would not start until I did. Perhaps
>> if you're upgrading from 3.0.x to 3.1.4 you don't need to reindex, but
>> going from 2.15->2.16->3.1 (with 2.16 only intermediate and gerrit
>> still offline),
>
>
> Ah, I missed the fact that you skipped 3.0. Yes, of course if you skip steps, you need to go through the pain of off-line reindexing :-(
>
>> it refused to start and told me I needed the changes
>> index built before it would run. I spent a week or two trying to find
>> ways around the offline index,
>
> Did you write to the mailing list and we did not answer for a week?

No, I had read in several places over the years that upgrading through
individual versions was required and that offline reindexing was
mandatory if you skip major versions. I saw it so much, I just
assumed the answer without asking.

(In the past, I once succesfully did an test upgrade where I started
the offline reindexing, almost immediately aborted it, marked the
super-incomplete index as active, and then manually fired off a lot of
indexing jobs. In fact, it was the only way I could get the upgrade
to work since the offline reindexing never completed and the plugin
story was an even bigger mess back then. But some other backward
compatible change, I think with the hooks, prevented us from upgrading
at the time, and then Gerrit-2.15 made the offline reindexing
dramatically faster and actually complete before being terminated. I
didn't take good notes on how I did that
kinda-sorta-bypass-the-reindexing step, though, and didn't know how to
replicate it or if it was even possible.)

> There are also company giving Gerrit Enterprise Support with support SLA, that would definitely saved you time.

Yes; currently there's a long ongoing battle between GHE and Gerrit
proponents within our company (Stash and Gitolite lost already years
ago, though there was a time when all four were active); nearly all
repositories have moved to GHE, but one big super-important repo
hasn't and has lots of developers that do not want it to move. The
GHE side seems to have the upper hand, at least among leadership
(among developers it's probably more of a draw, although maybe that's
just my proximity to fellow proponents showing through). The "cost of
conversion" is probably what has kept a mandate from coming down, but
any question about putting resources into Gerrit (e.g. "we need to
update off an EOL version") or even announcements of downtimes for
upgrades is often met with "Would it make sense to just switch to GHE
now?"

I might try to play the GerritForge/GerritHub "Gerrit+GHE" card if
we're forced to make GHE the source of truth and even ask for approval
for Enterprise Support (they did so with Reviewable.io on top of GHE
after all), but for the most part I like to avoid political battles as
much as I possibly can.

Your point is well taken, though, and I am certainly keeping this in mind.

>> or trying to reduce the amount of
>> downtime needed while waiting for an offline reindex, all to no avail.
>> It sounds like I was close with my idea of reindexing an old version
>> of the data and copying it over based on Matthias' comments in another
>> thread, but he has some secret sauce that I was missing (and still
>> am).
>>
>> Yes, I know I could have run every intermediate version of Gerrit in
>> prod for a while, but I vastly preferred a half day outage on a
>> weekend than dragging out the upgrade over multiple weeks and playing
>> roulette with whether I could get all the plugins working on all
>> intermediate Gerrit versions. (Huge thanks to David for fixing up
>> find-owners and saml on 3.1 for me, by the way.)
>
>
> We typically make sure that all the most popular plugins are in workable state in the past supported releases.
> The issue I believe with plugins is that some times we allow breaking changes on stable branches (we shouldn’t do that !!!) and that breaks plugins on stable branches.
> Those breakages are unnoticed because we don’t rebuild 100s of plugins for every single Gerrit change.
>
>>
>> I know _you_ can't take a half day outage and even a 15 second outage
>> is huge; but while a half day outage on a weekend was big enough to
>> make us squirm very uncomfortably, for _me_ the really huge thing was
>> worrying about getting plugins to function on that many different
>> versions of Gerrit.
>
>
> What are the plugins that gave you trouble?

1) Auth has been a historical painpoint, across multiple auth systems.

In the Gerrit-2.5 era, the changes to be strict about captizalization
with LDAP as the backend (not noted in the release notes IIRC) caused
us a fair amount of debugging work (sadly someone had capitalized a
domain component somewhere). Granted, that wasn't a plugin, but kind
of falls under auth.

Around the 2.10/2.11 ERA I was given a dictate to move Gerrit to the
cloud and to not use LDAP anymore. I don't think the saml plugin
existed back then; I certainly didn't find it in searches. We used
the GitHub plugin against an internal GHE, but _only_ for auth. At
the time, the plugin was on some random github repo or something, not
closely associated with the Gerrit project. The documentation was bad
and assumed you'd use github.com, and also only showed how to build
against 2.10 (and I think used pre-bazel instructions in contrast to
2.11 talking about bazel). And it seemed to remain that way for a few
years, making me concerned about upgrades. Of course, there were also
the breakages in the hooks API (some of which I thought made things
better so I was sympathetic, but meant a lot of work for us to convert
our hooks over) around that time, and the massive reindexing problems
trying to skip from 2.11 to 2.13 or further.

When we switched to 2.15, the saml plugin existed and we picked it up
(most Gerrit outages had been caused by GHE outages, and it felt weird
using GitHub OAuth). Good ol'
com.thesamet.gerrit.plugins.saml.SamlWebFilter, which was also on some
random github repo, had horrible documentation, didn't even build, and
took approximately a full day for some ADFS engineer I suckered in to
helping out to figure out how to configure the ADFS servers to talk to
this plugin.

With 2.16, the plugin story was getting better. The github and saml
plugins were both on gerrit-ci.gerritforge.com. The saml plugin was
now com.googlesource.gerrit.plugins.saml.SamlWebFilter. Was very
encouraging, but upgrading to it broke auth. Had to get another ADFS
engineer involved and he spent a couple hours figuring it out what
changes were needed on the server.

With Gerrit-3.1, I wanted to try the gerrit-ci.gerritforge.com saml
plugin too. Too bad the build was red and had been. David Ostrovsky
helped me out. Ran with that plugin, but auth broke. Roped in the
same ADFS engineer as helped with 2.16. Found out it was the plugin
that was broken this time rather than that more server config needed
to be changed; the plugin itself on the master branch was missing an
important change that had been included in the 2.16 branch. David
Ostrovsky again helped me out, merging that change in to master and
giving me a new build. That one worked well perfectly for us.

2) Owners was a huge problem.

We first installed the owners plugin with 2.15, where it was just for
a specific user with a special usecase. Discovered that the plugin
literally made CRs inaccessible with nasty stacktraces in the
error_log if you used certain features of the plugin. Filed some bugs
in the gerrit tracker (never got a response). Ended up using the
plugin anyway because of the importance of the usecase after first
verifying that the plugin would only be used in a very specific manner
that wouldn't trigger the bugs I saw on staging. Almost immediately
saw the plugin being adopted by dozens of folks for other cases, and
had to repeatedly warn folks to avoid certain constructs. People
complained about the owners plugin for years, especially the default
+2 requirement (and how unclear it was in the docs to attempt to allow
just a +1 requirement). The docs weren't the best either, assuming
that you'd use a submit_rule instead of a submit_filter. Had lots of
stacktraces in the error_log over the past couple years from this
plugin, but things seemed to mostly work. I was super uneasy with
this plugin the whole time.

gerrit-2.16 made this worse. I wanted to upgrade shortly after
release, but that release broke something with the owners plugin so it
needed to be updated. Watched and checked frequently at first, then
stopped at some point and 8-10 months later noticed there was an email
from Luca to this list saying there was now a CR which updated it. I
went to the CR and saw that it was abandoned with no link to any new
CRs. (I think it did get fixed not long after that specific time I
checked, but it was fixed differently and in a different CR. At the
very least, I didn't find it at that time.)

There were a couple times I started investigating an upgrade again (I
typically attempt a gerrit upgrade like 2-3 times for every time we
actually do the upgrade, but just run into issues and abort), and the
plugins as found on gerrit-ci.gerritforge.com suggested to me that it
wasn't building for various stable branches (or there was only a
master branch that I'd have to try my luck with). Of course, by the
time we were finally read to upgrade, and the building of the owners
plugin had long since been sorted out upstream, I was at the point
where I wanted very badly to not ever touch that plugin again.

With the upgrade from 2.15->3.1, we decided to switch from owners to
find-owners. People have been *extremely* happy with the change, but
find-owners didn't have releases that worked with 3.1 on
gerrit-ci.gerritforge.com when I went looking. David Ostrovsky built
one for me when I asked; super helpful.

3) The typical fact that gerrit-ci.gerritforge.com will often have
plugins without a version for a given branch, and will often have red
builds, some dating back as much as a year. (And when I attempt to
"just use master" like I did with find-owners, I find it's been
adapted to some API break and thus will only work with development
versions of Gerrit, not with the stable releases I am trying.)

4) gitiles was once upon a time a concern. The configuration needed
to get it or gitblit or cgit going was kinda painful once upon a time,
but having gitiles pulled in as a core plugin was hugely beneficial in
reducing upgrade concerns.

5) The movement of hooks into a plugin, but more so the various
backward incompatible changes made to the hooks at the time delayed
previous upgrades. (On the positive side, those backward incompatible
changes did allow me to make the hooks cleaner and easier to maintain
in some cases). I think we found a bug somewhere that still causes
hooks to fire when they shouldn't (e.g. the hooks say that one type of
hook fires for direct pushes and a different hook fires for pushes to
refs/for/, but the actual behavior I observed was the refs/for/ hook
fired in both cases), but I didn't record the details and we just
temporarily worked around it at the time by disabling the hook for a
brief window. I know, lame report and I should dig up the details but
I'm just trying to answer "What plugins gave you trouble?"


Anyway, hope that helps somehow.

Matthias Sohn

unread,
May 21, 2020, 5:10:38 PM5/21/20
to Elijah Newren, Luca Milanesio, David Pursehouse, AF, Repo and Gerrit Discussion
Are you aware of the plugins page on the gerrit home page [1] ? We added that a couple of weeks back
in order to help users find plugins and their availability for different gerrit versions.
If you have any proposals how to improve Gerrit's plugin story let us know in the backlog
of the plugin working group [2].


-Matthias

Elijah Newren

unread,
May 21, 2020, 5:54:04 PM5/21/20
to Matthias Sohn, Luca Milanesio, David Pursehouse, AF, Repo and Gerrit Discussion
On Thu, May 21, 2020 at 2:10 PM Matthias Sohn <matthi...@gmail.com> wrote:

> Are you aware of the plugins page on the gerrit home page [1] ? We added that a couple of weeks back
> in order to help users find plugins and their availability for different gerrit versions.
> If you have any proposals how to improve Gerrit's plugin story let us know in the backlog
> of the plugin working group [2].
>
> [1] https://www.gerritcodereview.com/plugins.html
> [2] https://bugs.chromium.org/p/gerrit/issues/list?q=label%3AHotlist-Plugin-Working-Group
>
> -Matthias

I think I saw it once, and noticed that it gives approximately the
same information that a stroll over to
https://gerrit-ci.gerritforge.com/ will provide. I couldn't find any
links to download from the page you provide whereas
gerrit-ci.gerritforge.com provides that very handy service. This page
does have three things that the jenkins build page doesn't provide:
number of changes, descriptions, and maintainers. So it's a useful
resource, and I'd probably utilize both going forward.

Also, that page just made me aware of the code-owners plugin, which
has a description that sounds identical to the reason find-owners was
created. I guess owners and find-owners isn't enough. ;-)

This page as well as gerrit-ci.gerritforge tell me that several
plugins are master-only, have red builds, are "maintained to varying
degrees, therefore the Gerrit project does not guarantee their
reliability"; all of which generally reinforce "use at your own risk".
That's fair and fine, but "use at your own risk" does make upgrades
difficult.

I really hope I'm not coming across as complaining. The fact that
you're building these pages, documenting who owns them, noting which
branches exist, building them in CI (and even exposing it on
gerrit-ci.gerritforge.com), responding to issues that are reported,
are all really big efforts that you all are making that I appreciate a
lot. It's a really huge service you all provide to the community.


The primary improvements I can think of (which obviously require
resources and have to be weighed against other endeavors) would be:
* create stable branches for plugins more proactively (e.g. saml and
find-owners) so I don't have to worry about development API breaks
preventing these plugins from working on stable releases
* address red builds of plugins more actively (e.g.
plugin-find-owners-bazel-master has no green builds since March of
last year, with a single red build this year).
* merge important fixes across relevant support branches (e.g. the
saml plugin on master was missing an important fix that had been
merged to the 2.16 branch).

Luca Milanesio

unread,
May 21, 2020, 6:21:54 PM5/21/20
to Elijah Newren, Luca Milanesio, David Pursehouse, AF, Repo and Gerrit Discussion
Yes, I know. Whenever you tell the upper management that you need to upgrade, they come up with “why don’t we go with GHE, GitLab or BitBucket”?

The problem is: the upper management doesn’t know (or knows but pretend to ignore?) that *every* product needs upgrading, including GHE, GitLab and BitBucket.
We work with clients not only on Gerrit, but also on BitBucket and GHE: upgrading is never easy, also with commercial products.

With regards to Gerrit, have you ever seen a message on the mailing list saying “we have scheduled a maintenance window for Gerrit on gerrit-review.googlesource.com for upgrading from v3.0 to v3.1: it is planned to last for 1h and we will let you know on twitter on the status of the upgrade” ? I do not remember to have ever seen it since its inception 11 years ago.

Upgrading shouldn’t be an “extraordinary thing” to do in Gerrit, but a “normal maintenance operation” that we should do continuously, like we do exercise every day to keep ourself healthy and in shape.


I might try to play the GerritForge/GerritHub "Gerrit+GHE" card if
we're forced to make GHE the source of truth and even ask for approval
for Enterprise Support (they did so with Reviewable.io on top of GHE
after all), but for the most part I like to avoid political battles as
much as I possibly can.

I have to say that the GitHub plugin in Gerrit is very popular *exactly* because marry GitHub with Gerrit very nicely :-)
It’s OpenSource, you don’t need any contract to use it.


Your point is well taken, though, and I am certainly keeping this in mind.

+1
I am feeling very guilty about that: yes, the documentation needs improving.
With regards to the build, also I am guilty, it is still done with Maven, because I used Lombok that isn’t supported by Bazel yet.

But at least it is built and available on Gerrit-CI :-)

 Of course, there were also
the breakages in the hooks API (some of which I thought made things
better so I was sympathetic, but meant a lot of work for us to convert
our hooks over) around that time, and the massive reindexing problems
trying to skip from 2.11 to 2.13 or further.

When we switched to 2.15, the saml plugin existed and we picked it up
(most Gerrit outages had been caused by GHE outages, and it felt weird
using GitHub OAuth).  Good ol'
com.thesamet.gerrit.plugins.saml.SamlWebFilter, which was also on some
random github repo, had horrible documentation, didn't even build, and
took approximately a full day for some ADFS engineer I suckered in to
helping out to figure out how to configure the ADFS servers to talk to
this plugin.

Yeah, then DavidO and have done quite a lot of modernisation on that: it is now in a much better shape I believe :-)


With 2.16, the plugin story was getting better.  The github and saml
plugins were both on gerrit-ci.gerritforge.com.  The saml plugin was
now com.googlesource.gerrit.plugins.saml.SamlWebFilter.  Was very
encouraging, but upgrading to it broke auth.  Had to get another ADFS
engineer involved and he spent a couple hours figuring it out what
changes were needed on the server.

With Gerrit-3.1, I wanted to try the gerrit-ci.gerritforge.com saml
plugin too.  Too bad the build was red and had been.  David Ostrovsky
helped me out.  Ran with that plugin, but auth broke.  Roped in the
same ADFS engineer as helped with 2.16.  Found out it was the plugin
that was broken this time rather than that more server config needed
to be changed; the plugin itself on the master branch was missing an
important change that had been included in the 2.16 branch.  David
Ostrovsky again helped me out, merging that change in to master and
giving me a new build.  That one worked well perfectly for us.

I have to say that certain “classes” of plugin should be in a priority list, like the ones related to authentication.

Example: SAML and OAuth plugins are *fundamental* for making Gerrit work, they aren’t really an “optional” component. However, they aren’t core plugins yet.


2) Owners was a huge problem.

We first installed the owners plugin with 2.15, where it was just for
a specific user with a special usecase.  Discovered that the plugin
literally made CRs inaccessible with nasty stacktraces in the
error_log if you used certain features of the plugin.  Filed some bugs
in the gerrit tracker (never got a response).  Ended up using the
plugin anyway because of the importance of the usecase after first
verifying that the plugin would only be used in a very specific manner
that wouldn't trigger the bugs I saw on staging.  Almost immediately
saw the plugin being adopted by dozens of folks for other cases, and
had to repeatedly warn folks to avoid certain constructs.  People
complained about the owners plugin for years, especially the default
+2 requirement (and how unclear it was in the docs to attempt to allow
just a +1 requirement).  The docs weren't the best either, assuming
that you'd use a submit_rule instead of a submit_filter.  Had lots of
stacktraces in the error_log over the past couple years from this
plugin, but things seemed to mostly work.  I was super uneasy with
this plugin the whole time.

Yes, the plugin was developed initially by VMware and hosted on a GitHub repository.
We onboarded and did a lot of work in the past few years, but yes it isn’t easy to use at all.
Mostly, my opinion, because is based on complex Prolog rules and, then it fails, it’s almost impossible to trace what’s going on.


gerrit-2.16 made this worse.  I wanted to upgrade shortly after
release, but that release broke something with the owners plugin so it
needed to be updated.  Watched and checked frequently at first, then
stopped at some point and 8-10 months later noticed there was an email
from Luca to this list saying there was now a CR which updated it.  I
went to the CR and saw that it was abandoned with no link to any new
CRs.  (I think it did get fixed not long after that specific time I
checked, but it was fixed differently and in a different CR.  At the
very least, I didn't find it at that time.)

The support for v2.16 was delayed indeed: mostly because some of its parts broke up and also its way of creating “artificial labels” doesn’t work very well with PolyGerrit.
I believe a redesign based on the custom submit rules system in Gerrit v2.16 would make it easier to use in the future.


There were a couple times I started investigating an upgrade again (I
typically attempt a gerrit upgrade like 2-3 times for every time we
actually do the upgrade, but just run into issues and abort), and the
plugins as found on gerrit-ci.gerritforge.com suggested to me that it
wasn't building for various stable branches (or there was only a
master branch that I'd have to try my luck with).  Of course, by the
time we were finally read to upgrade, and the building of the owners
plugin had long since been sorted out upstream, I was at the point
where I wanted very badly to not ever touch that plugin again.

With the upgrade from 2.15->3.1, we decided to switch from owners to
find-owners.  People have been *extremely* happy with the change, but
find-owners didn't have releases that worked with 3.1 on
gerrit-ci.gerritforge.com when I went looking.  David Ostrovsky built
one for me when I asked; super helpful.

I believe find-owners is maintained by Google, that means mostly on master.


3) The typical fact that gerrit-ci.gerritforge.com will often have
plugins without a version for a given branch, and will often have red
builds, some dating back as much as a year.  (And when I attempt to
"just use master" like I did with find-owners, I find it's been
adapted to some API break and thus will only work with development
versions of Gerrit, not with the stable releases I am trying.)

Yes, see above the reason why.


4) gitiles was once upon a time a concern.  The configuration needed
to get it or gitblit or cgit going was kinda painful once upon a time,
but having gitiles pulled in as a core plugin was hugely beneficial in
reducing upgrade concerns.

5) The movement of hooks into a plugin, but more so the various
backward incompatible changes made to the hooks at the time delayed
previous upgrades.  (On the positive side, those backward incompatible
changes did allow me to make the hooks cleaner and easier to maintain
in some cases).  I think we found a bug somewhere that still causes
hooks to fire when they shouldn't (e.g. the hooks say that one type of
hook fires for direct pushes and a different hook fires for pushes to
refs/for/, but the actual behavior I observed was the refs/for/ hook
fired in both cases), but I didn't record the details and we just
temporarily worked around it at the time by disabling the hook for a
brief window.  I know, lame report and I should dig up the details but
I'm just trying to answer "What plugins gave you trouble?"

Yes, that is really useful, thanks a lot for sharing your experience.
I believe that my idea of defining a “priority list” of plugins could actually help out preventing painful experiences.

Another *BIG TOPIC* that we are tackling is also the E2E testing of the “whole platform”, where we install and test Gerrit in a production-like setup and plugins on top.

We are currently working on that with the following two projects:
- aws-gerrit (a real-life E2E setup of Gerrit in the cloud)
- gatling-git (a Gatling plugin to use Gerrit and Git API)
- E2E tests inside Gerrit and plugins

That would allow to test the “whole integrated thing” and we are applying this methodology for the certification of Gerrit v3.2.0, due to be released on the 1st of June.
It should allow to alleviate future “pain-points” like the ones you have highlighted so far.

Thanks again for your comprehensive and honest feedback.

Luca.

Nasser Grainawi

unread,
May 21, 2020, 6:44:28 PM5/21/20
to Repo and Gerrit Discussion
I don't think it's realistic to use gerrit-review as an example here for several reasons, primary of which 1) it never uses stable branches and therefore doesn't do upgrades like anyone else and 2) it runs on a custom Google backend and again, therefore doesn't do upgrades like anyone else.
 

Upgrading shouldn’t be an “extraordinary thing” to do in Gerrit, but a “normal maintenance operation” that we should do continuously, like we do exercise every day to keep ourself healthy and in shape.

I think that's what we'd all like, but there have been enough posts (and user summit talks ;-)) that I think it's clear it's not the current reality. I think improvements have been made, but there are likely more possible.
I completely second this. Feedback like this helps the community grow to support everyone. Thank you.
 

Luca.

Luca Milanesio

unread,
May 21, 2020, 6:46:20 PM5/21/20
to Elijah Newren, Luca Milanesio, David Pursehouse, AF, Repo and Gerrit Discussion
FYI, I’ve created this issue for the Gerrit Community, as a constructive way to improve the experience with plugins and their impact on the upgrade process:

Feel free to put your contributions to the issue.
The Gerrit Community Managers will review and take this action forward, with the cooperation of the entire Gerrit Community.

Again, thanks for speaking up :-)

Luca.

Elijah Newren

unread,
May 23, 2020, 5:16:34 PM5/23/20
to Luca Milanesio, David Pursehouse, AF, Repo and Gerrit Discussion
On Thu, May 21, 2020 at 3:21 PM Luca Milanesio <luca.mi...@gmail.com> wrote:

> Upgrading shouldn’t be an “extraordinary thing” to do in Gerrit, but a “normal maintenance operation” that we should do continuously, like we do exercise every day to keep ourself healthy and in shape.

So, the expected state of affairs is that everyone knows it's
something we should do regularly, but we don't? Most people only
expend energy in the form of feeling regret and guilt that we aren't
doing what we are supposed to do?

Sorry, couldn't resist... ;-)

Luca Milanesio

unread,
May 23, 2020, 5:35:01 PM5/23/20
to Elijah Newren, Luca Milanesio, David Pursehouse, AF, Repo and Gerrit Discussion


> On 23 May 2020, at 22:16, Elijah Newren <new...@gmail.com> wrote:
>
> On Thu, May 21, 2020 at 3:21 PM Luca Milanesio <luca.mi...@gmail.com> wrote:
>
>> Upgrading shouldn’t be an “extraordinary thing” to do in Gerrit, but a “normal maintenance operation” that we should do continuously, like we do exercise every day to keep ourself healthy and in shape.
>
> So, the expected state of affairs is that everyone knows it's
> something we should do regularly, but we don't?

;-)

> Most people only
> expend energy in the form of feeling regret and guilt that we aren't
> doing what we are supposed to do?

I believe the future would be … you shouldn’t worry about it at all: Gerrit should just auto-upgrade daily, without any downtime at all.
We’re working towards that, there is still a long way to go though, as you’ve noticed.

Anyway, feedback like your allows us to make baby steps towards it, so thanks again.

Luca.

Nuno Costa

unread,
Aug 27, 2020, 1:15:08 PM8/27/20
to Repo and Gerrit Discussion
Hi All,

Really informative post here. Congrats to everyone involved!! :)

I'm posting on this thread because it does have a lot of useful information for Upgrade procedures and I hope I can add some additional info.

We are working on our upgrade from 2.16.17 to 3.2.3 (2.16.22 >> 2.16.22 NoteDB >> 3.0.12 >> 3.1.8 >> 3.2.3) and have some questions related with below subjects.

I don't think the offline migration spewed any more errors than the
online migration wrote out to the error_log, though I didn't carefully
check and compare.  The only difference (other than online being
slower and writing to error_log rather than stdout) was that at the
end the online migration not only set primaryStorage to "note db" but
also marked disableReviewDb as true.  The offline migration only set
primaryStorage to "note db" while leaving disableReviewDb as false.

That sounds like a bug to me, but I need to check the documentation if we mentioned anything to “set manually disableReviewDb to true”.

Is manual set of disableReviewDb= true really needed?
Documentation from 2.16.22 [1] is not referring it.

We were also "hit" with orphan changes that are being detected during migration but according to change 235928 [2], those are being skipped.

[2020-08-26 20:21:46,397] [RebuildChange-24] WARN  com.google.gerrit.server.notedb.rebuild.NoteDbMigrator : Change 17589 previously failed to rebuild; skipping primary storage migration
com.google.gerrit.server.notedb.PrimaryStorageMigrator$NoNoteDbStateException: change 17589 has no note_db_state; rebuild it first
    at com.google.gerrit.server.notedb.PrimaryStorageMigrator$1.update(PrimaryStorageMigrator.java:299)
    at com.google.gerrit.server.notedb.PrimaryStorageMigrator$1.update(PrimaryStorageMigrator.java:289)
    at com.google.gwtorm.server.AbstractAccess.atomicUpdate(AbstractAccess.java:80)
    at com.google.gerrit.server.notedb.PrimaryStorageMigrator.setReadOnlyInReviewDb(PrimaryStorageMigrator.java:287)
    at com.google.gerrit.server.notedb.PrimaryStorageMigrator.migrateToNoteDbPrimary(PrimaryStorageMigrator.java:254)
    at com.google.gerrit.server.notedb.rebuild.NoteDbMigrator.lambda$setNoteDbPrimary$2(NoteDbMigrator.java:658)
    at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
    at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
    at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
    at com.google.gerrit.server.logging.LoggingContextAwareRunnable.run(LoggingContextAwareRunnable.java:83)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at com.google.gerrit.server.git.WorkQueue$Task.run(WorkQueue.java:646)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)


We also saw a bunch of database issues after the ones above

[2020-08-26 20:21:53,209] [RebuildChange-31] ERROR com.google.gerrit.server.notedb.rebuild.NoteDbMigrator : Error migrating primary storage for 67941
com.google.gwtorm.server.OrmException: Cannot open database connection
    at com.google.gwtorm.jdbc.Database.newConnection(Database.java:130)
    at com.google.gwtorm.jdbc.JdbcSchema.<init>(JdbcSchema.java:43)
    at com.google.gerrit.reviewdb.server.ReviewDb_Schema_GwtOrm$$6.<init>(Unknown Source)
    at com.google.gerrit.reviewdb.server.ReviewDb_Schema_GwtOrm$$6_Factory_GwtOrm$$7.open(Unknown Source)
    at com.google.gwtorm.jdbc.Database.open(Database.java:122)
    at com.google.gerrit.server.schema.NotesMigrationSchemaFactory.open(NotesMigrationSchemaFactory.java:64)
    at com.google.gerrit.server.schema.NotesMigrationSchemaFactory.open(NotesMigrationSchemaFactory.java:25)
    at com.google.gerrit.server.util.ManualRequestContext.<init>(ManualRequestContext.java:45)
    at com.google.gerrit.server.util.ManualRequestContext.<init>(ManualRequestContext.java:36)
    at com.google.gerrit.server.notedb.rebuild.NoteDbMigrator$ContextHelper.open(NoteDbMigrator.java:1049)
    at com.google.gerrit.server.notedb.rebuild.NoteDbMigrator.lambda$setNoteDbPrimary$2(NoteDbMigrator.java:656)
    at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
    at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
    at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
    at com.google.gerrit.server.logging.LoggingContextAwareRunnable.run(LoggingContextAwareRunnable.java:83)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at com.google.gerrit.server.git.WorkQueue$Task.run(WorkQueue.java:646)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure


In our case it seems we are not fully migrated and maybe that's why we are not seeing disableReviewDb= true set in notedb.config file.
Could also the orphan changes still being treated as errors and "voting" for disableReviewDb= true not being set?

We are preparing for a new migration test to see if we still have DB access issues during migration.
It should not have happened since DB was online all the time and it is located in the same instance as Gerrit. We are using default mysql connection timeout (8h).
We noticed that we only allowed 150 connections to DB. We will use the same value as PROD (900) for the new test.

Is there any other setting we should look into, related with Cannot open database connection error?

We also noticed that the DB status field is modified from A to n.
Command to get the information is >> mysql> select * from changes where change_id='your_change_number_here' \G;
In the documentation[3] is mentioned that this action is rolled back
Due to an implementation detail, writes to Changes or related tables still result in write calls to the database layer, but they are inside a transaction that is always rolled back
Is it expected for the DB being written and the modification persisted there after migration?

Does this means that the notedb rollback script[4] is not enough for a successful rollback but a DB restore is also needed?

Regarding notedb rollback script, it does not handle repositories that have a space in its name. Issue[5] created in the bug tracker.

> 6) Found in some issues somewhere that it's important to repack the
> All-Users.git repo after the NoteDb migration.  So, before continuing
> on to the 3.1 upgrade, you'll want to run
>
> git -C git/All-Users.git/ gc --aggressive
>
>
> NOT just once, but continuously after migrating to v2.16 onwards.
> All draft comments are stored on All-Users.git, for *ALL REPOS* and *ALL USERS*. That makes the repo growing indefinitely and being always very fragmented.
>
> We do an aggressive gc of All-Users.git every 15 minutes !

Matthias Sohn

unread,
Aug 27, 2020, 6:06:28 PM8/27/20
to Nuno Costa, Repo and Gerrit Discussion
On Thu, Aug 27, 2020 at 7:15 PM Nuno Costa <nunoco...@gmail.com> wrote:
Hi All,

Really informative post here. Congrats to everyone involved!! :)

I'm posting on this thread because it does have a lot of useful information for Upgrade procedures and I hope I can add some additional info.

We are working on our upgrade from 2.16.17 to 3.2.3 (2.16.22 >> 2.16.22 NoteDB >> 3.0.12 >> 3.1.8 >> 3.2.3) and have some questions related with below subjects.

I don't think the offline migration spewed any more errors than the
online migration wrote out to the error_log, though I didn't carefully
check and compare.  The only difference (other than online being
slower and writing to error_log rather than stdout) was that at the
end the online migration not only set primaryStorage to "note db" but
also marked disableReviewDb as true.  The offline migration only set
primaryStorage to "note db" while leaving disableReviewDb as false.

That sounds like a bug to me, but I need to check the documentation if we mentioned anything to “set manually disableReviewDb to true”.

Is manual set of disableReviewDb= true really needed?

if the migration didn't set disableReviewDb=true then it didn't finish successfully due to errors during the migration.
Corrupt changes which definitely cannot be migrated since some data is missing (e.g. change exists in reviewdb but corresponding ref is missing)
should not cause this but should be skipped while logging an error.
That's a bug. The migration creates one database connection per change and closes it again in a very rapid sequence.
If the operating system can't keep pace reclaiming resources this works only for a while until opening new connections.
We observed up to 27k concurrent database connections using netstat when errors occurred. This is fixed in [6] by
slicing the set of all changes into chunks and reusing one database connection per chunk of 1000 changes.

Also the other changes in this patch series might be interesting for you, they improve performance of the notedb migration.

Matthias Sohn

unread,
Aug 27, 2020, 6:07:40 PM8/27/20
to Nuno Costa, Repo and Gerrit Discussion
until opening new connections fails.

Nuno Costa

unread,
Aug 28, 2020, 2:20:06 PM8/28/20
to Repo and Gerrit Discussion
Hi Matthias, Thanks for your feedback.
You can check my replies inline.

On Thursday, 27 August 2020 at 23:06:28 UTC+1 Matthias Sohn wrote:
Is manual set of disableReviewDb= true really needed?

if the migration didn't set disableReviewDb=true then it didn't finish successfully due to errors during the migration.
Corrupt changes which definitely cannot be migrated since some data is missing (e.g. change exists in reviewdb but corresponding ref is missing)
should not cause this but should be skipped while logging an error.

Yes, I'm assuming the errors are related with the DB connection errors you mention below.
We run a new test and we were checking msql connections and never saw more than ~110 connections and we have mysql max_connections set to 900. We will take a look with netstat.
Probably we need to take a look on wait_timeout and interactive_timeout since they are using the default (8h). Maybe setting this to 5m could help freeing connections the OS is not closing correctly.

Thanks for pointing that bug. I'm already following it.

Another thing I was wandering is the note_db_state field in changes DB table, having different values.
From what I understand:
* N means migrated to NoteDB.
* NULL means Orphan or deleted Change. No migration is done.

The only value I'm not understanding is the one where is having a SHA1 hash. I noticed that hash is the same that is written to refs/change/XX/XXXXX/meta file.
I would assume that after writing to file, note_db_state would be changed to N but it is not.

I tested deleting the hash from a particular change, rerun the migration only for that change, but it created a different hash and not changed to N.
I also did a migration of that entire project and hashs are still. I see some warnings related with changes already having note_db_state = N but that is expected.

Can you clarify the possible values for note_db_state field, their meaning and if my previous assumptions are correct?

Have a great weekend!!

Thanks,

Nuno Costa

unread,
Sep 3, 2020, 7:25:17 AM9/3/20
to Repo and Gerrit Discussion
Some updates:

Setting wait_timeout did not helped. We tried with 10 and we had errors with prepare SQL:

[2020-08-31 16:53:57,429] [RebuildChange-35] ERROR com.google.gerrit.server.notedb.rebuild.NoteDbMigrator : Error migrating primary storage for 137103
com.google.gwtorm.server.OrmException: prepare SQL
SELECT T.change_key,T.created_on,T.last_updated_on,T.owner_account_id,T.dest_project_name,T.dest_branch_name,T.status,T.current_patch_set_id,T.subject,T.topic,T.original_subject,T.submission_id,T.assignee,T.is_private,T.work_in_progress,T.review_started,T.revert_of,T.note_db_state,T.row_version,T.change_id FROM changes T WHERE T.change_id=?
 failure on changes
        at com.google.gwtorm.schema.sql.SqlDialect.convertError(SqlDialect.java:162)
        at com.google.gwtorm.schema.sql.DialectMySQL.convertError(DialectMySQL.java:232)
        at com.google.gwtorm.jdbc.JdbcAccess.convertError(JdbcAccess.java:489)
        at com.google.gwtorm.jdbc.JdbcAccess.prepareStatement(JdbcAccess.java:96)
        at com.google.gerrit.reviewdb.client.Change_Access_changes_GwtOrm$$1.get(Unknown Source)
        at com.google.gerrit.reviewdb.client.Change_Access_changes_GwtOrm$$1.get(Unknown Source)
        at com.google.gwtorm.server.AbstractAccess.atomicUpdate(AbstractAccess.java:76)
        at com.google.gerrit.server.notedb.PrimaryStorageMigrator.setReadOnlyInReviewDb(PrimaryStorageMigrator.java:287)
        at com.google.gerrit.server.notedb.PrimaryStorageMigrator.migrateToNoteDbPrimary(PrimaryStorageMigrator.java:254)
        at com.google.gerrit.server.notedb.rebuild.NoteDbMigrator.lambda$setNoteDbPrimary$2(NoteDbMigrator.java:658)
        at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
        at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
        at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
        at com.google.gerrit.server.logging.LoggingContextAwareRunnable.run(LoggingContextAwareRunnable.java:83)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at com.google.gerrit.server.git.WorkQueue$Task.run(WorkQueue.java:646)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: No operations allowed after connection closed.
        at sun.reflect.GeneratedConstructorAccessor54.newInstance(Unknown Source)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at com.mysql.jdbc.Util.handleNewInstance(Util.java:425)
        at com.mysql.jdbc.Util.getInstance(Util.java:408)
        at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:918)
        at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:897)
        at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:886)
        at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:860)
        at com.mysql.jdbc.ConnectionImpl.throwConnectionClosedException(ConnectionImpl.java:1187)
        at com.mysql.jdbc.ConnectionImpl.checkClosed(ConnectionImpl.java:1182)
        at com.mysql.jdbc.ConnectionImpl.prepareStatement(ConnectionImpl.java:4068)
        at com.mysql.jdbc.ConnectionImpl.prepareStatement(ConnectionImpl.java:4037)
        at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at net.bull.javamelody.JdbcWrapper$ConnectionInvocationHandler.invoke(JdbcWrapper.java:202)
        at net.bull.javamelody.JdbcWrapper$DelegatingInvocationHandler.invoke(JdbcWrapper.java:300)
        at com.sun.proxy.$Proxy18.prepareStatement(Unknown Source)
        at com.google.gwtorm.jdbc.JdbcAccess.prepareStatement(JdbcAccess.java:94)
        ... 18 more
Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure

The last packet successfully received from the server was 47,341 milliseconds ago.  The last packet sent successfully to the server was 1 milliseconds ago.
        at sun.reflect.GeneratedConstructorAccessor45.newInstance(Unknown Source)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at com.mysql.jdbc.Util.handleNewInstance(Util.java:425)
        at com.mysql.jdbc.SQLError.createCommunicationsException(SQLError.java:989)
        at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3559)
        at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3459)
        at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3900)
        at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2527)
        at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2680)
        at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2494)
        at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1858)
        at com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1966)
        at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at net.bull.javamelody.JdbcWrapper.doExecute(JdbcWrapper.java:422)
        at net.bull.javamelody.JdbcWrapper$StatementInvocationHandler.invoke(JdbcWrapper.java:142)
        at net.bull.javamelody.JdbcWrapper$DelegatingInvocationHandler.invoke(JdbcWrapper.java:300)
        at com.sun.proxy.$Proxy83.executeQuery(Unknown Source)
        at com.google.gwtorm.jdbc.JdbcAccess.queryOne(JdbcAccess.java:119)
        ... 18 more
Caused by: java.io.EOFException: Can not read response from server. Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost.
        at com.mysql.jdbc.MysqlIO.readFully(MysqlIO.java:3011)
        at com.mysql.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:3469)
        ... 33 more

Did not test with higher values and went back to MySQL default.

From here I was doing some monitoring on the db connections and it never went over ~101 connections.

After some research not having any outcome and based on the change Matthias shared in previous comment, I looked into gerrit database configuration and tested setting
database.connectionPool = true.
According to the documentation[7], it is set to false when using MySQL.

Our team had the impression that this setting wouldn't work with MySQL and that would be the reason for it to be disabled by default.

After using the rollback script for repository cleanup and DB restore, we started a new migration process (without reindex) and we did not had any issues with database connections and migration process finished successfully (after ~1h20m) with disableReviewDb being set to true.

Is connectionPool = true value supposed to work with MySQL?

I our case, it was a way to workaround until change 278937 (and all related ones) is merged and available.

Regards,
Nuno

[7] https://gerrit-documentation.storage.googleapis.com/Documentation/2.16.22/config-gerrit.html#database.connectionPool
Reply all
Reply to author
Forward
0 new messages