If you are running a large Gerrit server with high load, are considering upgrading to 2.12
*and* rely on online reindexing then this info is important to you.
We did the upgrade last week and experienced some critical issues which we didn't see
when testing the upgrade on a test Gerrit server.
Due to an issue mentioned in [1], discussed further in [2] and also [3] we had to restart Gerrit
every 2-3 hours because an "NRT open/closed" thread died due to a StackOverflowError.
The problem was caused by a deep nesting of Lucene IndexReader's which seems to
happen only under a high load (therefore we didn't get this issue on our test Gerrit server)
and only when Lucene 5 has to read/write Lucene version 4 indexes.
Gerrit server was slow most of the time and our users kept complaining.
After every restart the online reindexing started from scratch and it couldn't finish until the next restart... so we had to look for other options.
We resolved the issue in the following way:
* create a copy of the production server on another machine at the time T
* run offline reindexing on the copy
* when the offline reindexing is done:
* shutdown the production Gerrit server
* copy the Lucene changes_0025 index from the copy to the production instance however, still
keep the index version 14 active
* start the production instance
* find all changes which were updated (or created) since the time T:
"select change_id from changes where last_updated_on >= T"
and reindex these changes via REST API calls. We ran 30 parallel requests all the time
and it took about 10 minutes
* stop the production Gerrit server
* activate the index version 25, deactivate version 14
* start the production instance
* Done: the main problem is gone and Gerrit is again smooth and fast.
Another option would be to run offline reindexing on the production instance but we
couldn't afford 10 hours of downtime.
Ideally we should find/fix the issue with the deep nesting of IndexReader's.. until then
this info may save you quite some trouble.