Lucene index problem on heavily loaded gerrit server

Krešimir Tonković

unread,

Jun 9, 2016, 10:50:50 AM6/9/16

to Repo and Gerrit Discussion

Gerrit 2.12.2

Ubuntu 15.04

java version "1.8.0_45" (Oracle)

On AWS

We're getting this exception:

2016-06-09 13:59:22,156] [SSH git-receive-pack '/product-configurator' (<uname>)] ERROR com.google.gerrit.server.git.ReceiveCommitsAdvertiseRefsHook : Cannot list open changes of <repo-name>

com.google.gwtorm.server.OrmException: java.io.IOException: null: NIOFSIndexInput(path="/mnt/ebs2/services/gerrit/index/changes_0025/open/_4g_Lucene50_0.tim")

at com.google.gerrit.lucene.LuceneChangeIndex$QuerySource.read(LuceneChangeIndex.java:465)

at com.google.gerrit.server.index.IndexedChangeQuery.read(IndexedChangeQuery.java:106)

at com.google.gerrit.server.query.change.AndSource.readImpl(AndSource.java:115)

at com.google.gerrit.server.query.change.AndSource.read(AndSource.java:99)

at com.google.gerrit.server.query.change.QueryProcessor.queryChanges(QueryProcessor.java:162)

at com.google.gerrit.server.query.change.QueryProcessor.queryChanges(QueryProcessor.java:103)

at com.google.gerrit.server.query.change.QueryProcessor.queryChanges(QueryProcessor.java:87)

at com.google.gerrit.server.query.change.InternalChangeQuery.query(InternalChangeQuery.java:245)

at com.google.gerrit.server.query.change.InternalChangeQuery.byProjectOpen(InternalChangeQuery.java:207)

at com.google.gerrit.server.git.ReceiveCommitsAdvertiseRefsHook.advertiseOpenChanges(ReceiveCommitsAdvertiseRefsHook.java:101)

at com.google.gerrit.server.git.ReceiveCommitsAdvertiseRefsHook.advertiseRefs(ReceiveCommitsAdvertiseRefsHook.java:90)

at org.eclipse.jgit.transport.AdvertiseRefsHookChain.advertiseRefs(AdvertiseRefsHookChain.java:85)

at org.eclipse.jgit.transport.BaseReceivePack.sendAdvertisedRefs(BaseReceivePack.java:1042)

at org.eclipse.jgit.transport.ReceivePack.service(ReceivePack.java:179)

at org.eclipse.jgit.transport.ReceivePack.receive(ReceivePack.java:161)

....

Caused by: java.io.IOException: null: NIOFSIndexInput(path="/mnt/ebs2/services/gerrit/index/changes_0025/open/_4g_Lucene50_0.tim")

at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:190)

at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:342)

at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:54)

at org.apache.lucene.store.DataInput.readVInt(DataInput.java:125)

at org.apache.lucene.store.BufferedIndexInput.readVInt(BufferedIndexInput.java:221)

at org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.loadBlock(SegmentTermsEnumFrame.java:157)

at org.apache.lucene.codecs.blocktree.SegmentTermsEnum.seekExact(SegmentTermsEnum.java:507)

...

Caused by: java.nio.channels.ClosedByInterruptException

at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)

at sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:713)

at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:694)

at org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:180)

... 47 more

I suspect that the problem is caused by this scenario:

- get a loaded server (in our case load is between 8 and 9). The load is usually coming from gerrit running a lucene reindex after a restart

- try to push a change and impatiently interrupt the push (in our case, eclipse egit automatically interrupts the push after 30s of no reply from server because of the load)

Above exception will pop up in gerrit's error_log

When pushed from the command line (no timeout), there is no exception.

It seems to me that an interrupted connection should not break the connection to the Lucene database.

Any comments?

Kresimir

Shawn Pearce

unread,

Jun 9, 2016, 12:15:43 PM6/9/16

to Krešimir Tonković, Repo and Gerrit Discussion

No way. You've got to be kidding me.

http://blog.spearce.org/2010/05/dont-assume-you-know-whats-best.html

Gerrit had this exact bug in 2010. We fixed it by moving off NIO and using "old school" IO to read files from disk, because the SSH daemon inside of Gerrit absolutely sends interrupts to threads when users do things like disconnect.

In this case it looks like someone aborted a `git push` by pressing Ctrl-C on their workstation, the SSH daemon interrupted the work thread, the work thread was reading from the Lucene index, and the interrupt shot the index in the head and killed the system.

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Luca Milanesio

unread,

Jun 10, 2016, 4:28:48 AM6/10/16

to Shawn Pearce, Krešimir Tonković, Repo and Gerrit Discussion

On 9 Jun 2016, at 17:15, 'Shawn Pearce' via Repo and Gerrit Discussion <repo-d...@googlegroups.com> wrote:

No way. You've got to be kidding me.
http://blog.spearce.org/2010/05/dont-assume-you-know-whats-best.html

Gerrit had this exact bug in 2010. We fixed it by moving off NIO and using "old school" IO to read files from disk, because the SSH daemon inside of Gerrit absolutely sends interrupts to threads when users do things like disconnect.

In this case it looks like someone aborted a `git push` by pressing Ctrl-C on their workstation, the SSH daemon interrupted the work thread, the work thread was reading from the Lucene index, and the interrupt shot the index in the head and killed the system.

Wow! ... I am assuming then that all Gerrit versions are impacted since we started using Lucene :-(

Luca.

Björn Pedersen

unread,

Jun 10, 2016, 4:58:19 AM6/10/16

to Repo and Gerrit Discussion, s...@google.com, kresimir...@gmail.com

https://gerrit-review.googlesource.com/#/c/77380/ seems to address this problem?

Luca Milanesio

unread,

Jun 10, 2016, 5:05:41 AM6/10/16

to Björn Pedersen, Repo and Gerrit Discussion, Shawn Pearce, kresimir...@gmail.com

Yes, it should.

Krešimir Tonković

unread,

Jun 10, 2016, 9:01:37 AM6/10/16

to Repo and Gerrit Discussion, ice...@googlemail.com, s...@google.com, kresimir...@gmail.com

Thanks guys.

If I understand correctly, the change will make sure that there is a pool to the lucene database and if one connection dies another one will be used? Will there be an auto-reconnect option so that the pool would not get drained?

When is this change scheduled for release?

Thanks,

Kresimir

Björn Pedersen

unread,

Jun 10, 2016, 9:40:03 AM6/10/16

to Repo and Gerrit Discussion, ice...@googlemail.com, s...@google.com, kresimir...@gmail.com

Am Freitag, 10. Juni 2016 15:01:37 UTC+2 schrieb Krešimir Tonković:

Thanks guys.

If I understand correctly, the change will make sure that there is a pool to the lucene database and if one connection dies another one will be used? Will there be an auto-reconnect option so that the pool would not get drained?

No, it ensures the interrupt never reaches lucene index writing. So no risk of pool draining.

Bassem Rabil

unread,

Jun 10, 2016, 11:30:17 AM6/10/16

to Repo and Gerrit Discussion, ice...@googlemail.com, s...@google.com, kresimir...@gmail.com

You can workaround the issue in Gerrit 2.12.2 by setting the config parameter index.threads [1] to a value larger than 1 which used to be the default in Gerrit 2.12, but now in master branch the default has changed to half number of CPUs +1.

[1] https://gerrit-review.googlesource.com/Documentation/config-gerrit.html#index

Regards

Bassem

Hugo Arès

unread,

Jun 10, 2016, 12:54:29 PM6/10/16

to Repo and Gerrit Discussion, ice...@googlemail.com, s...@google.com, kresimir...@gmail.com

..to a value larger than 0...

Actually, the real default is 0 (documentation is wrong[1]) which means that indexing it not using a thread pool by default.

[1]https://gerrit-review.googlesource.com/#/c/78710

Krešimir Tonković

unread,

Jun 11, 2016, 5:09:02 AM6/11/16

to Repo and Gerrit Discussion, ice...@googlemail.com, s...@google.com, kresimir...@gmail.com

I have changed the number of threads to 2

[index]

type = LUCENE

threads = 2

and tried reproducing the issue. I still got this exception:

[2016-06-11 08:58:21,306] [SSH git-receive-pack '/<repo name>' (kresimir.tonkovic)] ERROR com.google.gerrit.server.git.ReceiveCommitsAdvertiseRefsHook : Cannot list open changes of <repo name>