Important info for upgrading to 2.12 and online reindexing

299 views
Skip to first unread message

Saša Živkov

unread,
May 2, 2016, 7:55:27 AM5/2/16
to repo-d...@googlegroups.com
If you are running a large Gerrit server with high load, are considering upgrading to 2.12
*and* rely on online reindexing then this info is important to you.

We did the upgrade last week and experienced some critical issues which we didn't see
when testing the upgrade on a test Gerrit server.

Due to an issue mentioned in [1], discussed further in [2] and also [3] we had to restart Gerrit
every 2-3 hours because an "NRT open/closed" thread died due to a StackOverflowError.
The problem was caused by a deep nesting of Lucene IndexReader's which seems to
happen only under a high load (therefore we didn't get this issue on our test Gerrit server)
and only when Lucene 5 has to read/write Lucene version 4 indexes.

Gerrit server was slow most of the time and our users kept complaining.
After every restart the online reindexing started from scratch and it couldn't finish until the next restart... so we had to look for other options.

We resolved the issue in the following way:
* create a copy of the production server on another machine at the time T
* run offline reindexing on the copy
* when the offline reindexing is done:
* shutdown the production Gerrit server
* copy the Lucene changes_0025 index from the copy to the production instance however, still
  keep the index version 14 active
* start the production instance
* find all changes which were updated (or created) since the time T:
"select change_id from changes where last_updated_on >= T"
and reindex these changes via REST API calls. We ran 30 parallel requests all the time
and it took about 10 minutes
* stop the production Gerrit server
* activate the index version 25, deactivate version 14
* start the production instance
* Done: the main problem is gone and Gerrit is again smooth and fast.

Another option would be to run offline reindexing on the production instance but we
couldn't afford 10 hours of downtime.

Ideally we should find/fix the issue with the deep nesting of IndexReader's.. until then
this info may save you quite some trouble.


Doug Kelly

unread,
May 2, 2016, 5:58:55 PM5/2/16
to Repo and Gerrit Discussion
Thanks Sasa, we're actually just coming up on our upgrade to 2.12, so it sounds like taking the time for an offline reindex is in our best interest.  Thankfully, our instance isn't nearly as large... a full reindex on our development VM (much less powerful than the prod server) only took about four hours. :)

Luca Milanesio

unread,
May 3, 2016, 4:00:25 AM5/3/16
to Repo and Gerrit Discussion, Doug Kelly, Zivkov, Sasa
We are actually using a different strategy for roll-out, which actually uses the workaround described by Saša as "standard rollout" procedure :-)
On-line reindexing can be somehow painful as it consumes precious CPU cycles and, in case of peak times, may even be the indirect cause of an outage.

We used the zero-downtime roll-out described at:

We have over 21K projects and around 80K changes, which means the *full reindex* would take around 30' for us, less than the 10h or the 4h of Doug, still too much for GerritHub.io.
That's why we elaborated our "zero downtime" upgrade strategy, which does NOT include the on-line reindexing.

The *delta reindex* could be a very easy feature to be implemented as "scripting plugin" on:

 ... I'll try to put something up so that people can reuse it in the future :-)

Luca.

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Saša Živkov

unread,
May 3, 2016, 4:43:58 AM5/3/16
to Doug Kelly, Repo and Gerrit Discussion
On Mon, May 2, 2016 at 11:58 PM, Doug Kelly <doug...@gmail.com> wrote:
Thanks Sasa, we're actually just coming up on our upgrade to 2.12, so it sounds like taking the time for an offline reindex is in our best interest.  Thankfully, our instance isn't nearly as large... a full reindex on our development VM (much less powerful than the prod server) only took about four hours. :)

Actually, our offline reindexing reached 90% in just about 1,5 hours (using all available CPUs on that machine).
However, the last 10% took more than 8 hours.
The reason is that there was one (or two) project(s) for which reindexing was very slow and
Gerrit runs reindexing utilizing one thread per project. Therefore, when everything but the last project
was reindexed we had only one thread running and indexing 2-3 changes per second.

This is something that could be improved i.e. instead of using one thread per project use one thread
per change.
 

Luca Milanesio

unread,
May 3, 2016, 5:14:18 AM5/3/16
to Saša Živkov, Doug Kelly, Repo and Gerrit Discussion
Having multiple threads for the same repo would help ... we need to make sure that threads would not block each other though.
We had the same issue with our repos, some of them were eating up the 90% of the overall elapsed time :-(

Luca.

Saša Živkov

unread,
May 3, 2016, 6:54:55 AM5/3/16
to Luca Milanesio, Doug Kelly, Repo and Gerrit Discussion
On Tue, May 3, 2016 at 11:14 AM, Luca Milanesio <luca.mi...@gmail.com> wrote:
Having multiple threads for the same repo would help ... we need to make sure that threads would not block each other though.
I don't think that they would block each other. Reindexing one change consists of loading its data,
creating a Lucene document from this data and writing (replacing) that document in the Lucene index. 

Dave Borowitz

unread,
May 3, 2016, 10:02:35 AM5/3/16
to Saša Živkov, Doug Kelly, Repo and Gerrit Discussion
On Tue, May 3, 2016 at 4:43 AM, Saša Živkov <ziv...@gmail.com> wrote:


On Mon, May 2, 2016 at 11:58 PM, Doug Kelly <doug...@gmail.com> wrote:
Thanks Sasa, we're actually just coming up on our upgrade to 2.12, so it sounds like taking the time for an offline reindex is in our best interest.  Thankfully, our instance isn't nearly as large... a full reindex on our development VM (much less powerful than the prod server) only took about four hours. :)

Actually, our offline reindexing reached 90% in just about 1,5 hours (using all available CPUs on that machine).
However, the last 10% took more than 8 hours.
The reason is that there was one (or two) project(s) for which reindexing was very slow and
Gerrit runs reindexing utilizing one thread per project. Therefore, when everything but the last project
was reindexed we had only one thread running and indexing 2-3 changes per second.

This is something that could be improved i.e. instead of using one thread per project use one thread
per change.

This was very much an intentional decision. If you use one thread per change, computing the changed files is *extremely* thrashy on your JVM and kernel buffer cache. AllChangesIndexer walks the graph in topo order to get better pack locality.

I mean, you can try rewriting AllChangesIndexer to use a thread per change. I just don't think you're going to like the results.

Saša Živkov

unread,
May 3, 2016, 4:29:01 PM5/3/16
to Dave Borowitz, Doug Kelly, Repo and Gerrit Discussion
On Tue, May 3, 2016 at 4:02 PM, Dave Borowitz <dbor...@google.com> wrote:


On Tue, May 3, 2016 at 4:43 AM, Saša Živkov <ziv...@gmail.com> wrote:


On Mon, May 2, 2016 at 11:58 PM, Doug Kelly <doug...@gmail.com> wrote:
Thanks Sasa, we're actually just coming up on our upgrade to 2.12, so it sounds like taking the time for an offline reindex is in our best interest.  Thankfully, our instance isn't nearly as large... a full reindex on our development VM (much less powerful than the prod server) only took about four hours. :)

Actually, our offline reindexing reached 90% in just about 1,5 hours (using all available CPUs on that machine).
However, the last 10% took more than 8 hours.
The reason is that there was one (or two) project(s) for which reindexing was very slow and
Gerrit runs reindexing utilizing one thread per project. Therefore, when everything but the last project
was reindexed we had only one thread running and indexing 2-3 changes per second.

This is something that could be improved i.e. instead of using one thread per project use one thread
per change.

This was very much an intentional decision. If you use one thread per change, computing the changed files is *extremely* thrashy on your JVM and kernel buffer cache. AllChangesIndexer walks the graph in topo order to get better pack locality.

I see, make sense.
 

I mean, you can try rewriting AllChangesIndexer to use a thread per change. I just don't think you're going to like the results.

Yes, there doesn't seem to an obvious easy solution.
Reply all
Reply to author
Forward
Message has been deleted
0 new messages