Lucene index corrupted from high server load!

Sven Selberg

unread,

Feb 11, 2014, 2:26:04 AM2/11/14

to repo-d...@googlegroups.com

Hi!

We had an incident a couple of days ago when our gerrit-server went haywire after a network glitch. The server ran at 100% for a long time mainly writing network-error-messages to the logs.

Gerrit never recovered from this and we needed to restart, and Gerrit was up and running, except for the Lucene index.

After restart Gerrit seemed to miss the index for all changes made between index-creation (on upgrade to 2.8.1) and restart.

Copying the index to a local machine and attempting to read it with Luke or CheckIndex ended with a “read past EOF” IOException. From what I could tell the index had been corrupted.

After checking the Lucene-code in Gerrit I noticed that the IndexWriters don’t seem to be closed until Gerrit is stopped. Our main lead is that this, in correlation with the issues that high server-load caused, has led to the corrupt index.

I’m guessing that, if indexwriters are indeed only closed when Gerrit is stopped, this has been done to improve performance. However, all tutorials and howto’s I’ve read recommend that writers are closed after every operation.

Any thoughts about my assupmptions?

Should we reconsider this part of the Lucene implementation?

David Ostrovsky

unread,

Feb 11, 2014, 3:14:48 AM2/11/14

to repo-d...@googlegroups.com

On Tuesday, February 11, 2014 8:26:04 AM UTC+1, Sven Selberg wrote:

Hi!

We had an incident a couple of days ago when our gerrit-server went haywire after a network glitch. The server ran at 100% for a long time mainly writing network-error-messages to the logs.

Gerrit never recovered from this and we needed to restart, and Gerrit was up and running, except for the Lucene index.

I am seeing this all the time during Gerrit development:

1. start Gerrit in Debugger

2. upload a new change: it is shown in the change table

3. kill debugger

4. restart Gerrit

The change is missing in the table. Reindex fixes that.

If it is considered to be a feature and not a bug, it is annoying and should be

at least configurable, to enforce index flushes after every write operation.

Sven Selberg

unread,

Feb 11, 2014, 7:10:44 AM2/11/14

to repo-d...@googlegroups.com

It's just that reindex takes aproximately 330 min in our live-environment, it's impossible for us to have that kind of downtime.

I'm thinking there must be a way to implement Lucene that doesn't corrupt the entire index if this happens.

/Sven

Dave Borowitz

unread,

Feb 11, 2014, 10:50:40 AM2/11/14

to Sven Selberg, repo-discuss

I doubt you want this from a performance perspective but it's already configurable, just set index.lucene.maxBufferedDocs to 0.

In any case this is not the problem Sven is describing: dropped writes generally leave the index in a readable, un-corrupt state.

--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en

---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Dave Borowitz

unread,

Feb 11, 2014, 10:56:04 AM2/11/14

to Sven Selberg, repo-discuss

On Mon, Feb 10, 2014 at 11:26 PM, Sven Selberg <sven.s...@sonymobile.com> wrote:

Hi!

We had an incident a couple of days ago when our gerrit-server went haywire after a network glitch. The server ran at 100% for a long time mainly writing network-error-messages to the logs.

Gerrit never recovered from this and we needed to restart, and Gerrit was up and running, except for the Lucene index.

After restart Gerrit seemed to miss the index for all changes made between index-creation (on upgrade to 2.8.1) and restart.

Do you have a sense of approximately how many writes were dropped? The default buffer size is the Lucene default of 16MiB, but it's possible each write is only, say, under 1KiB in which case thousands of writes might be buffered.

We should definitely change the default if it's causing real-world problems.

Copying the index to a local machine and attempting to read it with Luke or CheckIndex ended with a “read past EOF” IOException. From what I could tell the index had been corrupted.

After checking the Lucene-code in Gerrit I noticed that the IndexWriters don’t seem to be closed until Gerrit is stopped. Our main lead is that this, in correlation with the issues that high server-load caused, has led to the corrupt index.

I’m guessing that, if indexwriters are indeed only closed when Gerrit is stopped, this has been done to improve performance. However, all tutorials and howto’s I’ve read recommend that writers are closed after every operation.

I would be surprised if they don't include some caveats like "unless you care about performance."

And again, dropped writes != corruption, so it seems like there is something else going on here. Perhaps it died in the middle of the write? Have you googled the specific error message you've gotten?

Any thoughts about my assupmptions?
Should we reconsider this part of the Lucene implementation?

The only thing I think we don't do that we should is flush writes periodically (say, once every few minutes) even if the buffer size hasn't been reached.

Shawn Pearce

unread,

Feb 11, 2014, 12:58:34 PM2/11/14

to Sven Selberg, repo-discuss

On Tue, Feb 11, 2014 at 4:10 AM, Sven Selberg <sven.s...@sonymobile.com> wrote:

It's just that reindex takes aproximately 330 min in our live-environment, it's impossible for us to have that kind of downtime.
I'm thinking there must be a way to implement Lucene that doesn't corrupt the entire index if this happens.

I agree, 330 minutes of downtime is unacceptable. That is a 5.5 hour indexing operation!

IIRC the indexing runs multi-threaded, but on one machine. If it was better parallelized and you had more hardware you could pull the time in, but even with 5 machines you are talking about more than an hour of downtime, which I think is still bordering on unacceptable. Especially if it was unplanned.

It should be possible to run reindex while the server is online. The Lucene glue is supposed to know how to build up the most recent version of the index in the background while the server stays up answering requests. Unfortunately with no valid index the dashboards are all unavailable, and there is enough important functionality missing that you might as well have the server down.

I think the structure of a Lucene index on disk is designed to be resilient to corruption. You could lose recent updates, but you should never lose the entire index. If it was only discarding a few recent updates and then re-indexing any recently modified change in the background after restart, the server should be able to repair itself and get caught up within minutes. But that didn't work here.

Dave Borowitz

unread,

Feb 11, 2014, 1:03:45 PM2/11/14

to Shawn Pearce, Sven Selberg, repo-discuss

On Tue, Feb 11, 2014 at 9:58 AM, Shawn Pearce <s...@google.com> wrote:

On Tue, Feb 11, 2014 at 4:10 AM, Sven Selberg <sven.s...@sonymobile.com> wrote:

It's just that reindex takes aproximately 330 min in our live-environment, it's impossible for us to have that kind of downtime.
I'm thinking there must be a way to implement Lucene that doesn't corrupt the entire index if this happens.

I agree, 330 minutes of downtime is unacceptable. That is a 5.5 hour indexing operation!

IIRC the indexing runs multi-threaded, but on one machine. If it was better parallelized and you had more hardware you could pull the time in, but even with 5 machines you are talking about more than an hour of downtime, which I think is still bordering on unacceptable.

This would require some extra work on our part, because on-disk Lucene indexes can really only be accessed from one machine. (Solr Cloud is a different story.) So we would need to serialize the writes each parallel machine would make and apply them from the main server. (Mapreduce with a single reducer.)

Especially if it was unplanned.

It should be possible to run reindex while the server is online. The Lucene glue is supposed to know how to build up the most recent version of the index in the background while the server stays up answering requests. Unfortunately with no valid index the dashboards are all unavailable, and there is enough important functionality missing that you might as well have the server down.

I think the structure of a Lucene index on disk is designed to be resilient to corruption. You could lose recent updates, but you should never lose the entire index.

Right. This is why I suggested searching for info on the specific error message. Maybe it's a known issue.

If it was only discarding a few recent updates and then re-indexing any recently modified change in the background after restart, the server should be able to repair itself and get caught up within minutes. But that didn't work here.

s/didn't work/wouldn't have worked/. We've discussed this before as a possibility, but the way you phrased it makes it sound like we've implemented it, which we haven't.

Shawn Pearce

unread,

Feb 11, 2014, 1:26:27 PM2/11/14

to Dave Borowitz, Sven Selberg, repo-discuss

On Tue, Feb 11, 2014 at 10:03 AM, Dave Borowitz <dbor...@google.com> wrote:

On Tue, Feb 11, 2014 at 9:58 AM, Shawn Pearce <s...@google.com> wrote:

On Tue, Feb 11, 2014 at 4:10 AM, Sven Selberg <sven.s...@sonymobile.com> wrote:

It's just that reindex takes aproximately 330 min in our live-environment, it's impossible for us to have that kind of downtime.
I'm thinking there must be a way to implement Lucene that doesn't corrupt the entire index if this happens.

I agree, 330 minutes of downtime is unacceptable. That is a 5.5 hour indexing operation!

IIRC the indexing runs multi-threaded, but on one machine. If it was better parallelized and you had more hardware you could pull the time in, but even with 5 machines you are talking about more than an hour of downtime, which I think is still bordering on unacceptable.

This would require some extra work on our part, because on-disk Lucene indexes can really only be accessed from one machine. (Solr Cloud is a different story.) So we would need to serialize the writes each parallel machine would make and apply them from the main server. (Mapreduce with a single reducer.)

Or let each machine build its own private section of the index, rsync the files to the master, and run a merge. Lucene has a tool to merge indexes together.

Especially if it was unplanned.

It should be possible to run reindex while the server is online. The Lucene glue is supposed to know how to build up the most recent version of the index in the background while the server stays up answering requests. Unfortunately with no valid index the dashboards are all unavailable, and there is enough important functionality missing that you might as well have the server down.

I think the structure of a Lucene index on disk is designed to be resilient to corruption. You could lose recent updates, but you should never lose the entire index.

Right. This is why I suggested searching for info on the specific error message. Maybe it's a known issue.

If it was only discarding a few recent updates and then re-indexing any recently modified change in the background after restart, the server should be able to repair itself and get caught up within minutes. But that didn't work here.

s/didn't work/wouldn't have worked/. We've discussed this before as a possibility, but the way you phrased it makes it sound like we've implemented it, which we haven't.

Er, yes, that is what I meant. We could do this... :-)

Sven Selberg

unread,

Feb 12, 2014, 5:24:47 AM2/12/14

to repo-d...@googlegroups.com, Sven Selberg

On Tuesday, February 11, 2014 4:50:40 PM UTC+1, Dave Borowitz wrote:

I doubt you want this from a performance perspective but it's already configurable, just set index.lucene.maxBufferedDocs to 0.

I cannot find anything pointing towards that maxBufferedDocs = 0 should trigger an autocommit in the IndexWriter in the constant field values [1]

Neither am I able to find where Gerrit handlas maxBufferedDocs = 0.

In any case this is not the problem Sven is describing: dropped writes generally leave the index in a readable, un-corrupt state.

What exactly do you refer to when you say "dropped writes"?

[1]

http://lucene.apache.org/core/4_0_0/core/constant-values.html#org.apache.lucene.index.IndexWriter

David Ostrovsky

unread,

Feb 12, 2014, 6:05:49 AM2/12/14

to repo-d...@googlegroups.com, Sven Selberg

On Wednesday, February 12, 2014 11:24:47 AM UTC+1, Sven Selberg wrote:

On Tuesday, February 11, 2014 4:50:40 PM UTC+1, Dave Borowitz wrote:
I doubt you want this from a performance perspective but it's already configurable, just set index.lucene.maxBufferedDocs to 0.
I cannot find anything pointing towards that maxBufferedDocs = 0 should trigger an autocommit in the IndexWriter in the constant field values [1]

Apparently my search engine is better then yours [1] ;-)

Neither am I able to find where Gerrit handlas maxBufferedDocs = 0.

How that [2]?

writerConfig.setMaxBufferedDocs(cfg.getInt("index", name, "maxBufferedDocs",

IndexWriterConfig.DEFAULT_MAX_BUFFERED_DOCS));

In any case this is not the problem Sven is describing: dropped writes generally leave the index in a readable, un-corrupt state.

What exactly do you refer to when you say "dropped writes"?

I think what Dave was trying to explain is that

* your corrupted index data on the disk

* and my not flushed RAM writes to the disk (aka as "dropped writes") due to killing the debugger

are completely different problems and have nothing in common.

[1] http://lucene.apache.org/core/4_6_1/core/org/apache/lucene/index/IndexWriterConfig.html#setMaxBufferedDocs(int)

[2] https://gerrit.googlesource.com/gerrit/+/master/gerrit-lucene/src/main/java/com/google/gerrit/lucene/LuceneChangeIndex.java

Sven Selberg

unread,

Feb 12, 2014, 6:25:29 AM2/12/14

to repo-d...@googlegroups.com, Sven Selberg

On Tuesday, February 11, 2014 4:56:04 PM UTC+1, Dave Borowitz wrote:

On Mon, Feb 10, 2014 at 11:26 PM, Sven Selberg <sven.s...@sonymobile.com> wrote:

Hi!

We had an incident a couple of days ago when our gerrit-server went haywire after a network glitch. The server ran at 100% for a long time mainly writing network-error-messages to the logs.

Gerrit never recovered from this and we needed to restart, and Gerrit was up and running, except for the Lucene index.

After restart Gerrit seemed to miss the index for all changes made between index-creation (on upgrade to 2.8.1) and restart.

Do you have a sense of approximately how many writes were dropped? The default buffer size is the Lucene default of 16MiB, but it's possible each write is only, say, under 1KiB in which case thousands of writes might be buffered.

We should definitely change the default if it's causing real-world problems.

Our config:

changes_open:

ramBufferSize = 60 m

maxBufferedDocs = 20

changes_closed:

ramBufferSize = 20 m

maxBufferedDocs = 10

So there shouldn't be more than 20 docs in the open-buffer and 10 docs in the closed-buffer as per the first-come-first-served policy of the buffer flush-policy.

Copying the index to a local machine and attempting to read it with Luke or CheckIndex ended with a “read past EOF” IOException. From what I could tell the index had been corrupted.

After checking the Lucene-code in Gerrit I noticed that the IndexWriters don’t seem to be closed until Gerrit is stopped. Our main lead is that this, in correlation with the issues that high server-load caused, has led to the corrupt index.

I’m guessing that, if indexwriters are indeed only closed when Gerrit is stopped, this has been done to improve performance. However, all tutorials and howto’s I’ve read recommend that writers are closed after every operation.

I would be surprised if they don't include some caveats like "unless you care about performance."

And again, dropped writes != corruption, so it seems like there is something else going on here. Perhaps it died in the middle of the write? Have you googled the specific error message you've gotten?

There are no traces of a Lucene malfunction in the logs.

The error message that I get when trying to read the index with Luke or CheckIndex is "read past EOF". Googling this the most common suggestion was to run "CheckIndes --fix". This however, not surprisingly, resulted in a "read past EOF" as well.

Any thoughts about my assupmptions?
Should we reconsider this part of the Lucene implementation?

The only thing I think we don't do that we should is flush writes periodically (say, once every few minutes) even if the buffer size hasn't been reached.

This is a good suggestion but I dont see how it would be helpful in our case, it is highly unlikely that we didn't reach maxBufferedDocs many thousands times over in the period between upgrading to 2.8.1 (creating the index) and the incident.

Sven Selberg

unread,

Feb 12, 2014, 6:37:14 AM2/12/14

to repo-d...@googlegroups.com, Sven Selberg

Thinking about this makes me sad. You're saying that if we left it enabled, there was a possibility that the index would have repaired itself? Not something you consider when you have an image in your head of 1000+ developers coming at you with sharp, and blunt, objects...

We didn't loose the entire index, just the changes post-index-creation and prior-to-restart (a couple of weeks worth of changes). Which contradicts the "read past EOF" error pointing towards a totally corrupt index.

Sven Selberg

unread,

Feb 12, 2014, 6:40:30 AM2/12/14

to repo-d...@googlegroups.com, Shawn Pearce, Sven Selberg

On Tuesday, February 11, 2014 7:03:45 PM UTC+1, Dave Borowitz wrote:

On Tue, Feb 11, 2014 at 9:58 AM, Shawn Pearce <s...@google.com> wrote:

On Tue, Feb 11, 2014 at 4:10 AM, Sven Selberg <sven.s...@sonymobile.com> wrote:

It's just that reindex takes aproximately 330 min in our live-environment, it's impossible for us to have that kind of downtime.
I'm thinking there must be a way to implement Lucene that doesn't corrupt the entire index if this happens.

I agree, 330 minutes of downtime is unacceptable. That is a 5.5 hour indexing operation!

IIRC the indexing runs multi-threaded, but on one machine. If it was better parallelized and you had more hardware you could pull the time in, but even with 5 machines you are talking about more than an hour of downtime, which I think is still bordering on unacceptable.

This would require some extra work on our part, because on-disk Lucene indexes can really only be accessed from one machine. (Solr Cloud is a different story.) So we would need to serialize the writes each parallel machine would make and apply them from the main server. (Mapreduce with a single reducer.)

We have been discussing moving to a Solr solution instead, but we feel we need to understand what went wrong here before taking it under serious consideration.

Especially if it was unplanned.

It should be possible to run reindex while the server is online. The Lucene glue is supposed to know how to build up the most recent version of the index in the background while the server stays up answering requests. Unfortunately with no valid index the dashboards are all unavailable, and there is enough important functionality missing that you might as well have the server down.

I think the structure of a Lucene index on disk is designed to be resilient to corruption. You could lose recent updates, but you should never lose the entire index.

Right. This is why I suggested searching for info on the specific error message. Maybe it's a known issue.

See previous post.

Sven Selberg

unread,

Feb 12, 2014, 7:03:29 AM2/12/14

to repo-d...@googlegroups.com, Sven Selberg

On Wednesday, February 12, 2014 12:05:49 PM UTC+1, David Ostrovsky wrote:

On Wednesday, February 12, 2014 11:24:47 AM UTC+1, Sven Selberg wrote:

On Tuesday, February 11, 2014 4:50:40 PM UTC+1, Dave Borowitz wrote:
I doubt you want this from a performance perspective but it's already configurable, just set index.lucene.maxBufferedDocs to 0.
I cannot find anything pointing towards that maxBufferedDocs = 0 should trigger an autocommit in the IndexWriter in the constant field values [1]

Apparently my search engine is better then yours [1] ;-)

Not only that, your eyesight seems to be superior to mine as well :-)

I can't find anything here pointing towards maxBufferedDocs == 0 triggering a autocommit-behavior. You might read between the lines that this value would trigger an flush after each document, but if our problem was due to documents not being flushed we would miss about 20, not two weeks worth.

Please explain.

Neither am I able to find where Gerrit handlas maxBufferedDocs = 0.

How that [2]?

All i see is taking the value from gerrit.config and populating a writerconfig with those values and passing it on to the IndexWriter ctor. As I stated earlier, I can't see where the special case of maxBufferedDocs == 0 being handled in this piece of code. Please help me understand.

writerConfig.setMaxBufferedDocs(cfg.getInt("index", name, "maxBufferedDocs",
IndexWriterConfig.DEFAULT_MAX_BUFFERED_DOCS));

In any case this is not the problem Sven is describing: dropped writes generally leave the index in a readable, un-corrupt state.
What exactly do you refer to when you say "dropped writes"?

I think what Dave was trying to explain is that

* your corrupted index data on the disk
* and my not flushed RAM writes to the disk (aka as "dropped writes") due to killing the debugger

are completely different problems and have nothing in common.

I concur, if the buffer limits config settings had effect (cannot find anything in Gerrit-code that would suggest they did not, also we could see the files of the index had been updated)

then there would have been a lot of flushing going on, so un-flushed changes in the buffer would have been limited and wouldn't have the effect we experienced.

From where I'm standing this does not look like a flush issue, more like a not-commiting-or-closing-indexwriters-after-addDocument()-call sort of affair.

David Ostrovsky

unread,

Feb 12, 2014, 7:59:13 AM2/12/14

to repo-d...@googlegroups.com, Sven Selberg

On Wednesday, February 12, 2014 1:03:29 PM UTC+1, Sven Selberg wrote:

On Wednesday, February 12, 2014 12:05:49 PM UTC+1, David Ostrovsky wrote:

On Wednesday, February 12, 2014 11:24:47 AM UTC+1, Sven Selberg wrote:

On Tuesday, February 11, 2014 4:50:40 PM UTC+1, Dave Borowitz wrote:
I doubt you want this from a performance perspective but it's already configurable, just set index.lucene.maxBufferedDocs to 0.
I cannot find anything pointing towards that maxBufferedDocs = 0 should trigger an autocommit in the IndexWriter in the constant field values [1]

Apparently my search engine is better then yours [1] ;-)

Not only that, your eyesight seems to be superior to mine as well :-)
I can't find anything here pointing towards maxBufferedDocs == 0 triggering a autocommit-behavior. You might read between the lines that this value would trigger an flush after each document, but if our problem was due to documents not being flushed we would miss about 20, not two weeks worth.
Please explain.

OK, i see your point. It is stated that: [1].

"Note that flushing just moves the internal buffered state in IndexWriter into the index, but these changes are not visible to IndexReader until either commit() or close() is called."

So are we missing the IndexWriter.commit() call?

Neither am I able to find where Gerrit handlas maxBufferedDocs = 0.

How that [2]?

All i see is taking the value from gerrit.config and populating a writerconfig with those values and passing it on to the IndexWriter ctor. As I stated earlier, I can't see where the special case of maxBufferedDocs == 0 being handled in this piece of code. Please help me understand.

Right, was assuming, that the special handling happens on the Lucene site.

[1] http://lucene.apache.org/core/4_6_1/core/org/apache/lucene/index/IndexWriter.html

Sven Selberg

unread,

Feb 12, 2014, 8:57:31 AM2/12/14

to repo-d...@googlegroups.com, Sven Selberg

On Wednesday, February 12, 2014 1:59:13 PM UTC+1, David Ostrovsky wrote:

On Wednesday, February 12, 2014 1:03:29 PM UTC+1, Sven Selberg wrote:

On Wednesday, February 12, 2014 12:05:49 PM UTC+1, David Ostrovsky wrote:

On Wednesday, February 12, 2014 11:24:47 AM UTC+1, Sven Selberg wrote:

On Tuesday, February 11, 2014 4:50:40 PM UTC+1, Dave Borowitz wrote:
I doubt you want this from a performance perspective but it's already configurable, just set index.lucene.maxBufferedDocs to 0.
I cannot find anything pointing towards that maxBufferedDocs = 0 should trigger an autocommit in the IndexWriter in the constant field values [1]

Apparently my search engine is better then yours [1] ;-)

Not only that, your eyesight seems to be superior to mine as well :-)
I can't find anything here pointing towards maxBufferedDocs == 0 triggering a autocommit-behavior. You might read between the lines that this value would trigger an flush after each document, but if our problem was due to documents not being flushed we would miss about 20, not two weeks worth.
Please explain.

OK, i see your point. It is stated that: [1].

"Note that flushing just moves the internal buffered state in IndexWriter into the index, but these changes are not visible to IndexReader until either commit() or close() is called."

So are we missing the IndexWriter.commit() call?

Thank you!

This is what I suspect,

We are running some tests to see if commit() or close() callsf, after adding document, would make this problem go away. If it does, we we'll run some tests to try to find out what sort of effect this has on performance.

Just wanted to make everybody aware that this was a potential Gerrit issue. There have been a similar issue with lost changes (can't find it), and that was considered to be a question of too large buffers.

Lucene index is a great feature, we love it. We apprecite Dave's efforts tremendously. Whether it's a Gerrit issue or an issue with our set up, we just want to fix this as soon as posible so that we can get the index up and running again.

I was just trying to make sure we were not barking up the wrong tree and that this wasn't merely an issue that manifests itself in our specific Gerrit set-up.

Neither am I able to find where Gerrit handlas maxBufferedDocs = 0.

How that [2]?

All i see is taking the value from gerrit.config and populating a writerconfig with those values and passing it on to the IndexWriter ctor. As I stated earlier, I can't see where the special case of maxBufferedDocs == 0 being handled in this piece of code. Please help me understand.

Right, was assuming, that the special handling happens on the Lucene site.

[1] http://lucene.apache.org/core/4_6_1/core/org/apache/lucene/index/IndexWriter.html

...and also thank you all for all the suggestions and clarifications. Forgot that important part, my bad...

Sven

Dave Borowitz

unread,

Feb 12, 2014, 12:09:02 PM2/12/14

to Sven Selberg, repo-discuss

On Wed, Feb 12, 2014 at 5:57 AM, Sven Selberg <sven.s...@sonymobile.com> wrote:

On Wednesday, February 12, 2014 1:59:13 PM UTC+1, David Ostrovsky wrote:

On Wednesday, February 12, 2014 1:03:29 PM UTC+1, Sven Selberg wrote:

On Wednesday, February 12, 2014 12:05:49 PM UTC+1, David Ostrovsky wrote:

On Wednesday, February 12, 2014 11:24:47 AM UTC+1, Sven Selberg wrote:

On Tuesday, February 11, 2014 4:50:40 PM UTC+1, Dave Borowitz wrote:

I doubt you want this from a performance perspective but it's already configurable, just set index.lucene.maxBufferedDocs to 0.

I cannot find anything pointing towards that maxBufferedDocs = 0 should trigger an autocommit in the IndexWriter in the constant field values [1]

Apparently my search engine is better then yours [1] ;-)

Not only that, your eyesight seems to be superior to mine as well :-)
I can't find anything here pointing towards maxBufferedDocs == 0 triggering a autocommit-behavior. You might read between the lines that this value would trigger an flush after each document, but if our problem was due to documents not being flushed we would miss about 20, not two weeks worth.

Please explain.

OK, i see your point. It is stated that: [1].

"Note that flushing just moves the internal buffered state in IndexWriter into the index, but these changes are not visible to IndexReader until either commit() or close() is called."

So are we missing the IndexWriter.commit() call?

Thank you!
This is what I suspect,
We are running some tests to see if commit() or close() callsf, after adding document, would make this problem go away. If it does, we we'll run some tests to try to find out what sort of effect this has on performance.

Just wanted to make everybody aware that this was a potential Gerrit issue. There have been a similar issue with lost changes (can't find it), and that was considered to be a question of too large buffers.

Lucene index is a great feature, we love it. We apprecite Dave's efforts tremendously. Whether it's a Gerrit issue or an issue with our set up, we just want to fix this as soon as posible so that we can get the index up and running again.

I was just trying to make sure we were not barking up the wrong tree and that this wasn't merely an issue that manifests itself in our specific Gerrit set-up.

I think see my misunderstanding here, and why this problem has not come up more frequently. This is my bad in understanding the Lucene docs.

The max buffered docs/RAM size trigger a _flush_ in the writer when exceeded, but not a _commit_. IIUC the key difference here is that flush may write some files, but commit ensures they're fsync'ed. So under normal disk load flushed documents will become permanent relatively quickly, so you'd see the index size grow over time with a running server. But under heavy load like you describe, the fsync might not happen for an indefinite period of time.

Lucene doesn't give particularly good tools for strategizing when to commit. I think we need to implement something like Solr's CommitWithin policy to ensure commits don't fall too behind. Or just auto-commit on every flush.

I still don't think we want to commit after every write by default. Under load there is no telling how how long an fsync will take (think seconds), during which the IndexWriter will block accepting any writes. (But if not there is still some risk of changes getting stale, hence the idea of automatically detecting and fixing up stale changes.) And if we implement CommitWithin, we can at least handle the 0 case.

Honestly, I still don't know what's going on with your index corruption. It might be the case that if Lucene crashes in the middle of some kinds of writes this can happen.

[1] http://wiki.apache.org/solr/CommitWithin

Neither am I able to find where Gerrit handlas maxBufferedDocs = 0.

How that [2]?

All i see is taking the value from gerrit.config and populating a writerconfig with those values and passing it on to the IndexWriter ctor. As I stated earlier, I can't see where the special case of maxBufferedDocs == 0 being handled in this piece of code. Please help me understand.

Right, was assuming, that the special handling happens on the Lucene site.

[1] http://lucene.apache.org/core/4_6_1/core/org/apache/lucene/index/IndexWriter.html

...and also thank you all for all the suggestions and clarifications. Forgot that important part, my bad...

Sven

--

David Ostrovsky

unread,

Feb 12, 2014, 12:37:30 PM2/12/14

to repo-d...@googlegroups.com, Sven Selberg

As pointed out in my original post to this thread, we should reliable offer Lucene autocommit

feature after every write at least for Gerrit developers:

index.lucene.autocommit = true

Currently, after uploading or merging changes, killing debugger all recent index writes are lost.

I don't care about performance, let it be a second. Much more annoying is to restart gerrit

and to see all index changes were lost, stop Gerrit again, run reindex and start Gerrit

again, it is more then a second and it is annoying.

Dave Borowitz

unread,

Feb 12, 2014, 12:52:12 PM2/12/14

to David Ostrovsky, repo-discuss, Sven Selberg

I agree, this is incredibly annoying, I actually wrote a shell alias to find and kill -TERM running Gerrit processes for this very reason.

When I get to poking at the code (hopefully today) I'll see whether it makes sense to make this a separate option from (say) index.lucene.commitwithin=0.

David Ostrovsky

unread,

Feb 12, 2014, 12:59:49 PM2/12/14

to repo-d...@googlegroups.com, David Ostrovsky, Sven Selberg

+2

Sven Selberg

unread,

Feb 13, 2014, 2:50:27 AM2/13/14

to repo-d...@googlegroups.com, David Ostrovsky, Sven Selberg

+2

Sven Selberg

unread,

Feb 13, 2014, 3:26:53 AM2/13/14

to repo-d...@googlegroups.com, David Ostrovsky, Sven Selberg

Great!

Something that would be very valuable to us, and I'm sure to a lot of others is a possibility to configure commit-frequency to fine tune it to fit the specific Gerrit environment and requirements.

I'm thinking about something in the lines of adding a configuration "index.commitOnFlush"

Extend TrackingIndexWriter with a class that overrides flush() and calls commit() if index.commitOnFlush = true.

I'm sure there are more pressing issues right now, but it would be a very useful feature.

/Sven

Dave Borowitz

unread,

Feb 18, 2014, 2:27:16 PM2/18/14

to Sven Selberg, repo-discuss, David Ostrovsky

Just to close the loop on this, the fix will be in 2.8.2:

https://gerrit-review.googlesource.com/54610

Thanks for the report.

--

Sven Selberg

unread,

Feb 19, 2014, 2:58:22 AM2/19/14

to repo-d...@googlegroups.com, Sven Selberg, David Ostrovsky

Thank you for the fix!

I will test it.

/Sven

Reply all

Reply to author

Forward