Qualcomm 2.7 -> 3.3+ Data Upgrade Status, Issues, and Plans

362 views
Skip to first unread message

Martin Fick

unread,
Jan 15, 2021, 11:34:56 PM1/15/21
to repo-d...@googlegroups.com

In an effort to get Qualcomm's primary Gerrit instance off of our internal Gerrit 2.7 fork, we have been investigating what it would look like to perform the data upgrade to a recent stable version of Gerrit.

 

While we are not ready yet to get off our fork, we wanted to get ahead and see if the Gerrit upgrade process is ready for us. As you might suspect, it is not ready and we will be working to improve this. However we have had some good testing success and I want to share that. During our upcoming journey to improve things, I wanted to engage the community and keep you updated of our status, objectives, investigations, and thoughts for improvements. So here it goes!

 

 

 

--- Status ---

 

 

We have successfully upgraded an old copy of our data (3/4 the current size, ~2.4M changes, ~5M patchsets, ~11K projects) all the way to Gerrit 3.3.x and it seems to run fine! There are some issues during the upgrades, but in the end we are able to query data, run the WUI, and see changes. We were even able install and run some of our custom upgraded plugins! This felt like a major milestone for us to see our data run in the beautiful new WUI, and I am very excited about this. I am hoping that this will light a bit of a fire under us to want to get there sooner rather than later.

 

Having successfully upgraded our real data set gives us a lot more confidence that the upgrade will be possible, and it helped identify the most important areas to start working on to make the upgrade something that we can eventually perform on our production system. One of our initial fears was that since our data set is so old, likely the oldest Gerrit data set in the world, we were concerned that perhaps some of the data would cause serious upgrade issues. It is nice to know that this does not appear to be the case, a testament to the quality of the upgrade code!

 

Of course, I did start out this email by saying that the process is not ready for us, and as you can probably guess that is because it takes too long (over a week), and this will not work for us in production. We are focusing our near term efforts on performance.

 

 

 

--- Objective ---

 

 

To be able to upgrade our primary Gerrit server data from version 2.7 to the latest stable release of Gerrit with a max of 4 hours of offline data processing using standard upgrade processes as documented in the Gerrit docs.

 

 

We want to be able to achieve this without:

 

1) Running any other versions except our 2.7 fork and the latest stable version (currently 3.3) of Gerrit in production. We specifically do NOT want to have to run Gerrit 2.16 in production as we cannot port our internal plugins to version 2.16. This means not being able to perform online NoteDb migration.

 

Because the online re-indexing can only be done from the previous version, and because we don't even have a version with the index, we must perform offline indexing during our downtime.

 

 

2) Running Gerrit in production in a read-only mode (as we would consider that downtime).

 

 

3) Having complicated custom steps for our Gerrit admins to have to perform in order to make things go faster than what Gerrit can do out of the box.

 

 

 

--- Current Downtime Estimate ---

 

 

Currently rough estimates indicate that using a normal upgrade process it would likely take over a week, but likely under two weeks of downtime to upgrade our data set to Gerrit 3.3.

 

 

We took the data and upgraded it to 2.16, one major version at a time, then completed a NoteDb migration, and then continued to upgrade to 3.3 one version at a time, and finally we indexed the data.

 

 

Here are some of the approximate times it took for each step:

 

upgrade to gerrit-2.9.5.war # < 5 min

 

upgrade to gerrit-2.10.8.war # < 5 min

 

upgrade to gerrit-2.11.12.war # ~ 5 min

 

upgrade to gerrit-2.12.9.war # ~25 min

 

upgrade to gerrit-2.13.14.war # ~ 1hour

 

upgrade to gerrit-2.14.22.war # ~20 min

 

upgrade to gerrit-2.15.21.war # ~ 3 hours

 

upgrade to gerrit-2.16.25.war # ~10 min

 

noteDb migration gerrit-2.16.25.war # ~ 3 days 32 threads

 

upgrade to gerrit-3.0.15.war # < 1 min

 

upgrade to gerrit-3.1.11.war # < 1 min

 

upgrade to gerrit-3.2.6.war # < 1 min

 

upgrade to gerrit-3.3.1.war # < 1 min

 

indexing gerrit-3.3.1.war # ~ 2 days 32 threads

 

 

This adds up to around 5 days, and likely would be just over a week with our current full data set. This will grow more by the time we are actually ready to upgrade. Not all these numbers (especially the largest ones) have been reproduced yet, so these numbers might vary.

 

 

 

--- Errors/Issues ---

 

 

1) Default groups column length for the patch_sets table in 2.12 is not long enough, requires manual setup to make it longer.

 

 

2) PatchListLoader timesout on over 13K changes in our data. I did not try altering diff.timeout.

 

 

WARN com.google.gerrit.server.patch.PatchListLoader : 5000 ms timeout reached for Diff loader in project ...

 

 

3) The NoteDb leases timeout during migration, is there some way to tweak that?

 

 

com.google.gwtorm.server.OrmRuntimeException: read-only lease on change 1031444 expired at 2020-12-24

 

 

4) The 3.0.15 upgrade tries to index our data automatically, there doesn't seem to be a switch to prevent this?

 

 

 

--- Efforts To Reduce Downtime ---

 

 

Since migration and indexing are currently much longer than everything else, we will be focusing on improving those times first.

 

 

1) Since both the NoteDb migration and the indexing resulted in timeouts for the PatchListLoader it got me to realize that it might be possible to preload the diff cache using a gerrit query with --patch-sets and --files. I am hoping that not only should this eliminate these timeouts, but that it likely may drastically reduce the migration and indexing times. Has anyone tried this, I think others have mentioned this before?

 

 

2) Offline NoteDb does a lot of DB work that is likely not needed since the DB is "thrown away" after the migration. We are investigating if we can avoid this: https://gerrit-review.googlesource.com/c/gerrit/+/291162

 

However, in order to reach our sub 4 hour goal, some of the other steps will likely also need to be sped up eventually.

 

 

 

Well there you go that's what we have for now! I will keep you updated with any new milestones that we reach,

 

 

 

-Martin

 

--

--

The Qualcomm Innovation Center, Inc. is a member of Code

Aurora Forum, hosted by The Linux Foundation

 

Edwin Kempin

unread,
Jan 18, 2021, 3:01:13 AM1/18/21
to Martin Fick, Repo and Gerrit Discussion
Thanks Martin for sharing this!
Very interesting to read and happy that you already made some good progress on your upgrade.

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/repo-discuss/3001721.A5hEmuvmuJ%40mfick-lnx.

Saša Živkov

unread,
Jan 18, 2021, 8:44:02 AM1/18/21
to Martin Fick, Repo and Gerrit Discussion
On Sat, Jan 16, 2021 at 5:34 AM Martin Fick <mf...@codeaurora.org> wrote:
From 2.14 we introduced the diff_summary cache, specifically to help with faster reindexing whose performance is mostly dominated by the JGit operations.
When reindexing a change, we need to compute diff in JGit only in order to find: the list of affected paths and the number of changed lines.
While these infos can also be extracted from the diff cache, the idea behind the diff_summary cache was to make it stable in the sense that it, ideally, never has to get invalidated and thus recomputed again.
If you pre-populate the diff_summary (make diskLimit large enough to never evict entries from it) then the reindexing should be significantly faster.
Maybe the simplest way of doing that is to create a copy of your production server and run reindexing once. Then the diff_summary cache will be populated.
For the production system upgrade copy the diff_summary there before starting reindexing.

 

2) Offline NoteDb does a lot of DB work that is likely not needed since the DB is "thrown away" after the migration. We are investigating if we can avoid this: https://gerrit-review.googlesource.com/c/gerrit/+/291162

 

However, in order to reach our sub 4 hour goal, some of the other steps will likely also need to be sped up eventually.

 

 

 

Well there you go that's what we have for now! I will keep you updated with any new milestones that we reach,

 

 

 

-Martin

 

--

--

The Qualcomm Innovation Center, Inc. is a member of Code

Aurora Forum, hosted by The Linux Foundation

 

--

Martin Fick

unread,
Jan 19, 2021, 2:24:52 PM1/19/21
to Saša Živkov, Repo and Gerrit Discussion

Sasa, thanks for your thoughts!

 

On Monday, January 18, 2021 2:43:20 PM MST Saša Živkov wrote:

> On Sat, Jan 16, 2021 at 5:34 AM Martin Fick <mf...@codeaurora.org> wrote:

> > 1) Since both the NoteDb migration and the indexing resulted in timeouts

> > for the PatchListLoader it got me to realize that it might be possible to

> > preload the diff cache using a gerrit query with --patch-sets and --files.

> > I am hoping that not only should this eliminate these timeouts, but that

> > it

> > likely may drastically reduce the migration and indexing times. Has anyone

> > tried this, I think others have mentioned this before?

>

> From 2.14 we introduced the diff_summary cache, specifically to help with

> faster reindexing whose performance is mostly dominated by the JGit

> operations.

> When reindexing a change, we need to compute diff in JGit only in order to

> find: the list of affected paths and the number of changed lines.

> While these infos can also be extracted from the diff cache, the idea

> behind the diff_summary cache was to make it stable in the sense that it,

> ideally, never has to get invalidated and thus recomputed again.

> If you pre-populate the diff_summary (make diskLimit large enough to never

> evict entries from it) then the reindexing should be significantly faster.

 

Does this mean that the diff_summary cache is required, or is the diff_summary cache smart enough to use/fallback to the diff cache if it exists?

 

> Maybe the simplest way of doing that is to create a copy of your production

> server and run reindexing once. Then the diff_summary cache will be

> populated.

 

I suspect this would also require first upgrading our copy of the data to 2.14 before we could run the indexing?

 

 

-Martin

Saša Živkov

unread,
Jan 19, 2021, 4:41:45 PM1/19/21
to Martin Fick, Repo and Gerrit Discussion
On Tue, Jan 19, 2021 at 8:24 PM Martin Fick <mf...@codeaurora.org> wrote:

Sasa, thanks for your thoughts!

 

On Monday, January 18, 2021 2:43:20 PM MST Saša Živkov wrote:

> On Sat, Jan 16, 2021 at 5:34 AM Martin Fick <mf...@codeaurora.org> wrote:

> > 1) Since both the NoteDb migration and the indexing resulted in timeouts

> > for the PatchListLoader it got me to realize that it might be possible to

> > preload the diff cache using a gerrit query with --patch-sets and --files.

> > I am hoping that not only should this eliminate these timeouts, but that

> > it

> > likely may drastically reduce the migration and indexing times. Has anyone

> > tried this, I think others have mentioned this before?

>

> From 2.14 we introduced the diff_summary cache, specifically to help with

> faster reindexing whose performance is mostly dominated by the JGit

> operations.

> When reindexing a change, we need to compute diff in JGit only in order to

> find: the list of affected paths and the number of changed lines.

> While these infos can also be extracted from the diff cache, the idea

> behind the diff_summary cache was to make it stable in the sense that it,

> ideally, never has to get invalidated and thus recomputed again.

> If you pre-populate the diff_summary (make diskLimit large enough to never

> evict entries from it) then the reindexing should be significantly faster.

 

Does this mean that the diff_summary cache is required, or is the diff_summary cache smart enough to use/fallback to the diff cache if it exists?


Take a look at the DiffSummaryLoader. It will always load via the diff cache, so yes it will use/fallback to it.
However, the diff cache got invalidated many times, whenever the PatchListKey and/or the PatchList (the value) changed.
Just check the PatchListKey.serialVersionUUID, it is far larger than 1. So relying on the diff cache for a fast(er) reindexing
didn't always work as the whole cache often got invalidated on the next major release.

 

> Maybe the simplest way of doing that is to create a copy of your production

> server and run reindexing once. Then the diff_summary cache will be

> populated.

 

I suspect this would also require first upgrading our copy of the data to 2.14 before we could run the indexing?

 
You can also upgrade all the way to 3.3, on a copy of your production server, and reindex there. This will build the diff_summary (and the diff)
caches. Then copy these two caches from there to your production server during the upgrade i.e. when you reach the 3.3. version and just
before running reindexing. It will not be 100% complete but it will cover close to 99% of the changes.
 

 

 

-Martin

 

--

The Qualcomm Innovation Center, Inc. is a member of Code

Aurora Forum, hosted by The Linux Foundation

 

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.

Joe Hicks

unread,
Jan 19, 2021, 11:41:13 PM1/19/21
to Repo and Gerrit Discussion
Thanks Martin!

I am just wondering what hardware you are running the upgrade on and if there is some different hardware / cloud instances that would make it faster?

Thanks again,

Joe

Martin Fick

unread,
Jan 20, 2021, 5:48:56 PM1/20/21
to Saša Živkov, Repo and Gerrit Discussion

OK, I see that now, the extra context and explanation helps to understand it, thanks!

 

> > I suspect this would also require first upgrading our copy of the data to

> > 2.14 before we could run the indexing?

>

> You can also upgrade all the way to 3.3, on a copy of your production

> server, and reindex there. This will build the diff_summary (and the diff)

> caches. Then copy these two caches from there to your production server

> during the upgrade i.e. when you reach the 3.3. version and just

> before running reindexing. It will not be 100% complete but it will cover

> close to 99% of the changes.

 

I will be trying this for testing to see how much this helps. Unfortunately making up copy of the data from a live system takes close to a day and it is hard to get a copy of both the repos and the DB from a live repo to be in sync. Then performing the upgrade takes a week, so as a final solution it would not fit my criteria below of not:

 

> 3) Having complicated custom steps for our Gerrit admins to have to perform

> in order to make things go faster than what Gerrit can do out of the box.

If the results show this to be valuable (as it should), then I will be looking for a way to make this something that Gerrit can do out of the box somehow. I am not sure how to do this. Since the diff_cache has not changed since 2.16, we are toying with the idea of back porting the diff_cache from 2.16 to our fork so that we can create the needed diffs on the live production data. However, perhaps it makes more sense to backport the diff_summary cache? Maybe this would make sense to be able to do for any release prior to 2.16 as a way to make it possible for anyone with 2.7 or higher to upgrade in a timely matter? What do you think?

Saša Živkov

unread,
Jan 21, 2021, 8:01:08 AM1/21/21
to Martin Fick, Repo and Gerrit Discussion
Can't you use file system snapshot(s) to create a copy of the live system, with only a tiny downtime.
What our admins do is:
1 stop the production server
2 create a file system snapshot
3 start the production server 
4. copy the snapshot created in 2 to another machine

The steps 1-2-3 take only a few minutes. I don't really know all the technical details on snapshot creation but if necessary I can check and provide more info here.


 

> 3) Having complicated custom steps for our Gerrit admins to have to perform

> in order to make things go faster than what Gerrit can do out of the box.

If the results show this to be valuable (as it should), then I will be looking for a way to make this something that Gerrit can do out of the box somehow. I am not sure how to do this. Since the diff_cache has not changed since 2.16, we are toying with the idea of back porting the diff_cache from 2.16 to our fork so that we can create the needed diffs on the live production data. However, perhaps it makes more sense to backport the diff_summary cache?

 
Yes, if you go that direction then I also believe that back-porting the diff_summary cache to 2.7 is a better path.
 

Maybe this would make sense to be able to do for any release prior to 2.16 as a way to make it possible for anyone with 2.7 or higher to upgrade in a timely matter? What do you think?

 

 

-Martin

 

--

The Qualcomm Innovation Center, Inc. is a member of Code

Aurora Forum, hosted by The Linux Foundation

 

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.

Oswald Buddenhagen

unread,
Jan 22, 2021, 5:23:27 AM1/22/21
to repo-d...@googlegroups.com
On Thu, Jan 21, 2021 at 02:00:26PM +0100, Saša Živkov wrote:
>Can't you use file system snapshot(s) to create a copy of the live
>system,
>
provided the files do actually live on a copy-on-write fs or volume
manager that supports the feature.

>with only a tiny downtime.
>
why downtime at all? if gerrit is properly transactional by now (i
remember times when it definitely wasn't, so ...), then you can just
snapshot a running system.

Saša Živkov

unread,
Jan 22, 2021, 6:14:43 AM1/22/21
to Repo and Gerrit Discussion
2.7 is "still" using reviewdb. We never had atomic updates of git + reviewdb. 


--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.

Martin Fick

unread,
Jan 25, 2021, 3:50:28 PM1/25/21
to repo-d...@googlegroups.com, Joe Hicks

On Tuesday, January 19, 2021 5:36:44 PM MST 'Joe Hicks' via Repo and Gerrit Discussion wrote:

> I am just wondering what hardware you are running the upgrade on and if

> there is some different hardware / cloud instances that would make it

> faster?

 

My current test host is a 32CPU host with around 130GB of RAM and using lvm snapshots. It's somewhat of an outdated machine, yet still quite beefy. A more modern machine can probably continue to scale a bit better with more CPUs, and possibly with more RAM. Using SSDs would likely help too. We will be using NFS for production which could increase IO latency, but potentially also increases throughput over a local disk.

 

I have not had any time to reproduce these numbers with different settings yet to see how much more parallelism things might be able to handle, I was using 32 threads for both migration and indexing which should be pushing things quite a bit already. I will probably try a few variations eventually, but as you can imagine permutations take a while to perform so that approach is a slow one. It would be useful to also try these on a decent NFS system, I will pursue some of these ideas too. I am hoping for now to focus on improving the codebase and the techniques to make things faster on all hardware.

 

We will try to also take advantage any benefits that a cloud can provide. Currently I don't believe that the Gerrit upgrade processes are designed to take advantage of a cloud? Since the migration and indexing are the two slowest pieces currently, does anyone know if it would be possible to make those pieces able to run cooperatively across multiple nodes at the same time?

 

1) It seems that the NoteDb migration code is meant to handle concurrent operation with a live system, does that mean that the offline migration could also be run on more than one host at the same time? Would that result in duplicated work, or would the code just skip already migrated changes?

 

2) Is it possible to slice up indexing across multiple nodes and then re-merge the indexes from multiple nodes using a tool like this: https://lucene.apache.org/solr/guide/6_6/merging-indexes.html

ravirajk...@gmail.com

unread,
Jan 29, 2021, 3:36:30 AM1/29/21
to Repo and Gerrit Discussion
Hi Martin,

If your data has more recent commits than your DB thinks, should not be a problem for gerrit in the test environment?

Kaushik Lingarkar

unread,
Mar 30, 2021, 3:01:47 PM3/30/21
to Repo and Gerrit Discussion
I wanted to post an update here that we tried doing a notedb migration with diff and diff_summary
caches populated. We were expecting the migration to go faster with those populated, however,
they didn't impact the migration time. These caches were populated by doing a reindex on 2.16. 

Han-Wen Nienhuys

unread,
Mar 30, 2021, 3:34:38 PM3/30/21
to Kaushik Lingarkar, Repo and Gerrit Discussion
I guess you'd have to do some profiling  / timing to see where the time goes.

Random related question, though: do you have many changes with large binary files? We found a few places where we mistakenly do rename detection, which slows things down by a lot.


--
Han-Wen Nienhuys - Google Munich
I work 80%. Don't expect answers from me on Fridays.
--

Google Germany GmbH, Erika-Mann-Strasse 33, 80636 Munich

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg

Geschäftsführer: Paul Manicle, Halimah DeLaine Prado

Nasser Grainawi

unread,
Mar 30, 2021, 3:37:06 PM3/30/21
to Han-Wen Nienhuys, Kaushik Lingarkar, Repo and Gerrit Discussion
We probably do have quite a few in some repos. What qualifies as a large binary for this question? Can I run something to detect these?



-- 
Han-Wen Nienhuys - Google Munich
I work 80%. Don't expect answers from me on Fridays.
--
Google Germany GmbH, Erika-Mann-Strasse 33, 80636 Munich
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschäftsführer: Paul Manicle, Halimah DeLaine Prado

-- 
-- 
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

--- 
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.

Han-Wen Nienhuys

unread,
Mar 30, 2021, 3:43:37 PM3/30/21
to Nasser Grainawi, Kaushik Lingarkar, Repo and Gerrit Discussion
We shortcut diffing for files over 50M.  https://git.eclipse.org/r/c/jgit/jgit/+/177553 adds this behavior for renames too, and you could reduce the limit at which this kicks in.

We have an outstanding bug to also skip rename detection for APK files (or any other binary files), regardless of size, which should be fixed in O(weeks). 

I don't know if rename detection triggers in the code path for indexing/upgrading, though. You'd have to check by instrumenting gerrit.

Reply all
Reply to author
Forward
0 new messages