Lucene index problem on heavily loaded gerrit server

820 views
Skip to first unread message

Krešimir Tonković

unread,
Jun 9, 2016, 10:50:50 AM6/9/16
to Repo and Gerrit Discussion
Gerrit 2.12.2

Ubuntu 15.04

java version "1.8.0_45" (Oracle)

On AWS


We're getting this exception:

2016-06-09 13:59:22,156] [SSH git-receive-pack '/product-configurator' (<uname>)] ERROR com.google.gerrit.server.git.ReceiveCommitsAdvertiseRefsHook : Cannot list open changes of <repo-name>

com.google.gwtorm.server.OrmException: java.io.IOException: null: NIOFSIndexInput(path="/mnt/ebs2/services/gerrit/index/changes_0025/open/_4g_Lucene50_0.tim")

        at com.google.gerrit.lucene.LuceneChangeIndex$QuerySource.read(LuceneChangeIndex.java:465)

        at com.google.gerrit.server.index.IndexedChangeQuery.read(IndexedChangeQuery.java:106)

        at com.google.gerrit.server.query.change.AndSource.readImpl(AndSource.java:115)

        at com.google.gerrit.server.query.change.AndSource.read(AndSource.java:99)

        at com.google.gerrit.server.query.change.QueryProcessor.queryChanges(QueryProcessor.java:162)

        at com.google.gerrit.server.query.change.QueryProcessor.queryChanges(QueryProcessor.java:103)

        at com.google.gerrit.server.query.change.QueryProcessor.queryChanges(QueryProcessor.java:87)

        at com.google.gerrit.server.query.change.InternalChangeQuery.query(InternalChangeQuery.java:245)

        at com.google.gerrit.server.query.change.InternalChangeQuery.byProjectOpen(InternalChangeQuery.java:207)

        at com.google.gerrit.server.git.ReceiveCommitsAdvertiseRefsHook.advertiseOpenChanges(ReceiveCommitsAdvertiseRefsHook.java:101)

        at com.google.gerrit.server.git.ReceiveCommitsAdvertiseRefsHook.advertiseRefs(ReceiveCommitsAdvertiseRefsHook.java:90)

        at org.eclipse.jgit.transport.AdvertiseRefsHookChain.advertiseRefs(AdvertiseRefsHookChain.java:85)

        at org.eclipse.jgit.transport.BaseReceivePack.sendAdvertisedRefs(BaseReceivePack.java:1042)

        at org.eclipse.jgit.transport.ReceivePack.service(ReceivePack.java:179)

        at org.eclipse.jgit.transport.ReceivePack.receive(ReceivePack.java:161)

....

Caused by: java.io.IOException: null: NIOFSIndexInput(path="/mnt/ebs2/services/gerrit/index/changes_0025/open/_4g_Lucene50_0.tim")

        at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:190)

        at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:342)

        at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:54)

        at org.apache.lucene.store.DataInput.readVInt(DataInput.java:125)

        at org.apache.lucene.store.BufferedIndexInput.readVInt(BufferedIndexInput.java:221)

        at org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.loadBlock(SegmentTermsEnumFrame.java:157)

        at org.apache.lucene.codecs.blocktree.SegmentTermsEnum.seekExact(SegmentTermsEnum.java:507)

...

Caused by: java.nio.channels.ClosedByInterruptException

        at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)

        at sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:713)

        at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:694)

        at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:180)

        ... 47 more


I suspect that the problem is caused by this scenario:

- get a loaded server (in our case load is between 8 and 9). The load is usually coming from gerrit running a lucene reindex after a restart

- try to push a change and impatiently interrupt the push (in our case, eclipse egit automatically interrupts the push after 30s of no reply from server because of the load)

Above exception will pop up in gerrit's error_log

When pushed from the command line (no timeout), there is no exception.


It seems to me that an interrupted connection should not break the connection to the Lucene database.


Any comments?

Kresimir

Shawn Pearce

unread,
Jun 9, 2016, 12:15:43 PM6/9/16
to Krešimir Tonković, Repo and Gerrit Discussion
No way.  You've got to be kidding me.

Gerrit had this exact bug in 2010. We fixed it by moving off NIO and using "old school" IO to read files from disk, because the SSH daemon inside of Gerrit absolutely sends interrupts to threads when users do things like disconnect.

In this case it looks like someone aborted a `git push` by pressing Ctrl-C on their workstation, the SSH daemon interrupted the work thread, the work thread was reading from the Lucene index, and the interrupt shot the index in the head and killed the system.



--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Luca Milanesio

unread,
Jun 10, 2016, 4:28:48 AM6/10/16
to Shawn Pearce, Krešimir Tonković, Repo and Gerrit Discussion
On 9 Jun 2016, at 17:15, 'Shawn Pearce' via Repo and Gerrit Discussion <repo-d...@googlegroups.com> wrote:

No way.  You've got to be kidding me.

Gerrit had this exact bug in 2010. We fixed it by moving off NIO and using "old school" IO to read files from disk, because the SSH daemon inside of Gerrit absolutely sends interrupts to threads when users do things like disconnect.

In this case it looks like someone aborted a `git push` by pressing Ctrl-C on their workstation, the SSH daemon interrupted the work thread, the work thread was reading from the Lucene index, and the interrupt shot the index in the head and killed the system.

Wow! ... I am assuming then that all Gerrit versions are impacted since we started using Lucene :-(

Luca.

Björn Pedersen

unread,
Jun 10, 2016, 4:58:19 AM6/10/16
to Repo and Gerrit Discussion, s...@google.com, kresimir...@gmail.com

Luca Milanesio

unread,
Jun 10, 2016, 5:05:41 AM6/10/16
to Björn Pedersen, Repo and Gerrit Discussion, Shawn Pearce, kresimir...@gmail.com
Yes, it should.

Krešimir Tonković

unread,
Jun 10, 2016, 9:01:37 AM6/10/16
to Repo and Gerrit Discussion, ice...@googlemail.com, s...@google.com, kresimir...@gmail.com
Thanks guys.

If I understand correctly, the change will make sure that there is a pool to the lucene database and if one connection dies another one will be used? Will there be an auto-reconnect option so that the pool would not get drained?

When is this change scheduled for release?

Thanks,
Kresimir

Björn Pedersen

unread,
Jun 10, 2016, 9:40:03 AM6/10/16
to Repo and Gerrit Discussion, ice...@googlemail.com, s...@google.com, kresimir...@gmail.com


Am Freitag, 10. Juni 2016 15:01:37 UTC+2 schrieb Krešimir Tonković:
Thanks guys.

If I understand correctly, the change will make sure that there is a pool to the lucene database and if one connection dies another one will be used? Will there be an auto-reconnect option so that the pool would not get drained?

No, it ensures the interrupt never reaches lucene index writing. So no risk of pool draining.
 

Bassem Rabil

unread,
Jun 10, 2016, 11:30:17 AM6/10/16
to Repo and Gerrit Discussion, ice...@googlemail.com, s...@google.com, kresimir...@gmail.com
You can workaround the issue in Gerrit 2.12.2 by setting the config parameter index.threads [1] to a value larger than 1 which used to be the default in Gerrit 2.12, but now in master branch  the default has changed to half number of CPUs +1.

[1] https://gerrit-review.googlesource.com/Documentation/config-gerrit.html#index

Regards
Bassem

Hugo Arès

unread,
Jun 10, 2016, 12:54:29 PM6/10/16
to Repo and Gerrit Discussion, ice...@googlemail.com, s...@google.com, kresimir...@gmail.com
..to a value larger than 0...

Actually, the real default is 0 (documentation is wrong[1]) which means that indexing it not using a thread pool by default.

[1]https://gerrit-review.googlesource.com/#/c/78710

Krešimir Tonković

unread,
Jun 11, 2016, 5:09:02 AM6/11/16
to Repo and Gerrit Discussion, ice...@googlemail.com, s...@google.com, kresimir...@gmail.com
I have changed the number of threads to 2

[index] 

        type = LUCENE

        threads = 2


and tried reproducing the issue. I still got this exception:

[2016-06-11 08:58:21,306] [SSH git-receive-pack '/<repo name>' (kresimir.tonkovic)] ERROR com.google.gerrit.server.git.ReceiveCommitsAdvertiseRefsHook : Cannot list open changes of <repo name>

com.google.gwtorm.server.OrmException: java.io.IOException: null: NIOFSIndexInput(path="/mnt/ebs2/services/gerrit/index/changes_0025/open/_4g_Lucene50_0.tim")

        at com.google.gerrit.lucene.LuceneChangeIndex$QuerySource.read(LuceneChangeIndex.java:465)

        at com.google.gerrit.server.index.IndexedChangeQuery.read(IndexedChangeQuery.java:106)

        at com.google.gerrit.server.query.change.AndSource.readImpl(AndSource.java:115)

        at com.google.gerrit.server.query.change.AndSource.read(AndSource.java:99)

        at com.google.gerrit.server.query.change.QueryProcessor.queryChanges(QueryProcessor.java:162)

        at com.google.gerrit.server.query.change.QueryProcessor.queryChanges(QueryProcessor.java:103)

        at com.google.gerrit.server.query.change.QueryProcessor.queryChanges(QueryProcessor.java:87)

        at com.google.gerrit.server.query.change.InternalChangeQuery.query(InternalChangeQuery.java:245)

        at com.google.gerrit.server.query.change.InternalChangeQuery.byProjectOpen(InternalChangeQuery.java:207)

        at com.google.gerrit.server.git.ReceiveCommitsAdvertiseRefsHook.advertiseOpenChanges(ReceiveCommitsAdvertiseRefsHook.java:101)

        at com.google.gerrit.server.git.ReceiveCommitsAdvertiseRefsHook.advertiseRefs(ReceiveCommitsAdvertiseRefsHook.java:90)

...

Caused by: java.io.IOException: null: NIOFSIndexInput(path="/mnt/ebs2/services/gerrit/index/changes_0025/open/_4g_Lucene50_0.tim")

        at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:190)

        at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:342)

        at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:54)

        at org.apache.lucene.store.DataInput.readVInt(DataInput.java:125)

        at org.apache.lucene.store.BufferedIndexInput.readVInt(BufferedIndexInput.java:221)

...

Caused by: java.nio.channels.ClosedChannelException

        at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:109)

        at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:688)

        at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:180)

        ... 47 more


I interrupted 5 connections. It seems to me that the interrupt still reaches Lucene, but at least now it looks like it recovers and I can continue work without having to restart gerrit.


Thank you very much for your help!

Kresimir

Saša Živkov

unread,
Jun 17, 2016, 10:43:55 AM6/17/16
to Björn Pedersen, Repo and Gerrit Discussion, Shawn Pearce, kresimir...@gmail.com
On Fri, Jun 10, 2016 at 10:58 AM, 'Björn Pedersen' via Repo and Gerrit Discussion <repo-d...@googlegroups.com> wrote:

It protects from user canceling a push but it doesn't protect when the worker thread is killed
after reaching receive.timeout. See my comment in that change.

Shawn Pearce

unread,
Jun 20, 2016, 12:52:44 AM6/20/16
to Saša Živkov, Björn Pedersen, Repo and Gerrit Discussion, Krešimir Tonković
On Fri, Jun 17, 2016 at 7:42 AM, Saša Živkov <ziv...@gmail.com> wrote:


On Fri, Jun 10, 2016 at 10:58 AM, 'Björn Pedersen' via Repo and Gerrit Discussion <repo-d...@googlegroups.com> wrote:

It protects from user canceling a push but it doesn't protect when the worker thread is killed
after reaching receive.timeout. See my comment in that change.

(Run all Lucene queries on background threads)

This may help, as it addresses the case from the OP when a push is Ctrl-C'd part way through. Almost any Git wire protocol operation (including fetch and push) need to search the index for changes in that project to perform filtering of the refs/changes/ namespace when not all refs are readable by the client.


However I agree with Saša, this is still not enough. The indexAsync() futures can be cancelled by a calling thread, which may cause an interrupt in the wrong place within Lucene. We need to do more work to isolate Lucene from the interrupts.

Shawn Pearce

unread,
Jun 20, 2016, 3:04:26 AM6/20/16
to Saša Živkov, Björn Pedersen, Repo and Gerrit Discussion, Krešimir Tonković
On Sun, Jun 19, 2016 at 9:52 PM, Shawn Pearce <s...@google.com> wrote:
On Fri, Jun 17, 2016 at 7:42 AM, Saša Živkov <ziv...@gmail.com> wrote:


On Fri, Jun 10, 2016 at 10:58 AM, 'Björn Pedersen' via Repo and Gerrit Discussion <repo-d...@googlegroups.com> wrote:

It protects from user canceling a push but it doesn't protect when the worker thread is killed
after reaching receive.timeout. See my comment in that change.

(Run all Lucene queries on background threads)

This may help, as it addresses the case from the OP when a push is Ctrl-C'd part way through. Almost any Git wire protocol operation (including fetch and push) need to search the index for changes in that project to perform filtering of the refs/changes/ namespace when not all refs are readable by the client.


However I agree with Saša, this is still not enough. The indexAsync() futures can be cancelled by a calling thread, which may cause an interrupt in the wrong place within Lucene. We need to do more work to isolate Lucene from the interrupts.

https://gerrit-review.googlesource.com/79044 and its ancestors may be enough. :)
Reply all
Reply to author
Forward
0 new messages