Hi!
We had an incident a couple of days ago when our gerrit-server went haywire
after a network glitch. The server ran at 100% for a long time mainly writing
network-error-messages to the logs.
Gerrit never recovered from this and we needed to restart, and Gerrit was up and running, except for the Lucene index.
After restart Gerrit seemed to miss the index for all changes made between index-creation (on upgrade to 2.8.1) and restart.
Copying the index to a local machine and attempting to read it with Luke or CheckIndex ended with a “read past EOF” IOException. From what I could tell the index had been corrupted.
After checking the Lucene-code in Gerrit I noticed that the IndexWriters don’t seem to be closed until Gerrit is stopped. Our main lead is that this, in correlation with the issues that high server-load caused, has led to the corrupt index.
I’m guessing that, if indexwriters are indeed only closed when Gerrit is stopped, this has been done to improve performance. However, all tutorials and howto’s I’ve read recommend that writers are closed after every operation.
Any thoughts about my assupmptions?
Should we reconsider this part of the Lucene implementation?
Hi!
We had an incident a couple of days ago when our gerrit-server went haywire after a network glitch. The server ran at 100% for a long time mainly writing network-error-messages to the logs.Gerrit never recovered from this and we needed to restart, and Gerrit was up and running, except for the Lucene index.
--
--
To unsubscribe, email repo-discuss...@googlegroups.com
More info at http://groups.google.com/group/repo-discuss?hl=en
---
You received this message because you are subscribed to the Google Groups "Repo and Gerrit Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to repo-discuss...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
Hi!
We had an incident a couple of days ago when our gerrit-server went haywire after a network glitch. The server ran at 100% for a long time mainly writing network-error-messages to the logs.Gerrit never recovered from this and we needed to restart, and Gerrit was up and running, except for the Lucene index.
After restart Gerrit seemed to miss the index for all changes made between index-creation (on upgrade to 2.8.1) and restart.
Copying the index to a local machine and attempting to read it with Luke or CheckIndex ended with a “read past EOF” IOException. From what I could tell the index had been corrupted.
After checking the Lucene-code in Gerrit I noticed that the IndexWriters don’t seem to be closed until Gerrit is stopped. Our main lead is that this, in correlation with the issues that high server-load caused, has led to the corrupt index.
I’m guessing that, if indexwriters are indeed only closed when Gerrit is stopped, this has been done to improve performance. However, all tutorials and howto’s I’ve read recommend that writers are closed after every operation.
Any thoughts about my assupmptions?
Should we reconsider this part of the Lucene implementation?
It's just that reindex takes aproximately 330 min in our live-environment, it's impossible for us to have that kind of downtime.I'm thinking there must be a way to implement Lucene that doesn't corrupt the entire index if this happens.
On Tue, Feb 11, 2014 at 4:10 AM, Sven Selberg <sven.s...@sonymobile.com> wrote:
It's just that reindex takes aproximately 330 min in our live-environment, it's impossible for us to have that kind of downtime.I'm thinking there must be a way to implement Lucene that doesn't corrupt the entire index if this happens.I agree, 330 minutes of downtime is unacceptable. That is a 5.5 hour indexing operation!IIRC the indexing runs multi-threaded, but on one machine. If it was better parallelized and you had more hardware you could pull the time in, but even with 5 machines you are talking about more than an hour of downtime, which I think is still bordering on unacceptable.
Especially if it was unplanned.It should be possible to run reindex while the server is online. The Lucene glue is supposed to know how to build up the most recent version of the index in the background while the server stays up answering requests. Unfortunately with no valid index the dashboards are all unavailable, and there is enough important functionality missing that you might as well have the server down.I think the structure of a Lucene index on disk is designed to be resilient to corruption. You could lose recent updates, but you should never lose the entire index.
If it was only discarding a few recent updates and then re-indexing any recently modified change in the background after restart, the server should be able to repair itself and get caught up within minutes. But that didn't work here.
On Tue, Feb 11, 2014 at 9:58 AM, Shawn Pearce <s...@google.com> wrote:
On Tue, Feb 11, 2014 at 4:10 AM, Sven Selberg <sven.s...@sonymobile.com> wrote:
It's just that reindex takes aproximately 330 min in our live-environment, it's impossible for us to have that kind of downtime.I'm thinking there must be a way to implement Lucene that doesn't corrupt the entire index if this happens.I agree, 330 minutes of downtime is unacceptable. That is a 5.5 hour indexing operation!IIRC the indexing runs multi-threaded, but on one machine. If it was better parallelized and you had more hardware you could pull the time in, but even with 5 machines you are talking about more than an hour of downtime, which I think is still bordering on unacceptable.This would require some extra work on our part, because on-disk Lucene indexes can really only be accessed from one machine. (Solr Cloud is a different story.) So we would need to serialize the writes each parallel machine would make and apply them from the main server. (Mapreduce with a single reducer.)
Especially if it was unplanned.It should be possible to run reindex while the server is online. The Lucene glue is supposed to know how to build up the most recent version of the index in the background while the server stays up answering requests. Unfortunately with no valid index the dashboards are all unavailable, and there is enough important functionality missing that you might as well have the server down.I think the structure of a Lucene index on disk is designed to be resilient to corruption. You could lose recent updates, but you should never lose the entire index.Right. This is why I suggested searching for info on the specific error message. Maybe it's a known issue.If it was only discarding a few recent updates and then re-indexing any recently modified change in the background after restart, the server should be able to repair itself and get caught up within minutes. But that didn't work here.s/didn't work/wouldn't have worked/. We've discussed this before as a possibility, but the way you phrased it makes it sound like we've implemented it, which we haven't.
I doubt you want this from a performance perspective but it's already configurable, just set index.lucene.maxBufferedDocs to 0.
In any case this is not the problem Sven is describing: dropped writes generally leave the index in a readable, un-corrupt state.
On Tuesday, February 11, 2014 4:50:40 PM UTC+1, Dave Borowitz wrote:I doubt you want this from a performance perspective but it's already configurable, just set index.lucene.maxBufferedDocs to 0.I cannot find anything pointing towards that maxBufferedDocs = 0 should trigger an autocommit in the IndexWriter in the constant field values [1]
Neither am I able to find where Gerrit handlas maxBufferedDocs = 0.
What exactly do you refer to when you say "dropped writes"?
On Mon, Feb 10, 2014 at 11:26 PM, Sven Selberg <sven.s...@sonymobile.com> wrote:
Hi!
We had an incident a couple of days ago when our gerrit-server went haywire after a network glitch. The server ran at 100% for a long time mainly writing network-error-messages to the logs.Gerrit never recovered from this and we needed to restart, and Gerrit was up and running, except for the Lucene index.
After restart Gerrit seemed to miss the index for all changes made between index-creation (on upgrade to 2.8.1) and restart.
Do you have a sense of approximately how many writes were dropped? The default buffer size is the Lucene default of 16MiB, but it's possible each write is only, say, under 1KiB in which case thousands of writes might be buffered.We should definitely change the default if it's causing real-world problems.
Copying the index to a local machine and attempting to read it with Luke or CheckIndex ended with a “read past EOF” IOException. From what I could tell the index had been corrupted.
After checking the Lucene-code in Gerrit I noticed that the IndexWriters don’t seem to be closed until Gerrit is stopped. Our main lead is that this, in correlation with the issues that high server-load caused, has led to the corrupt index.
I’m guessing that, if indexwriters are indeed only closed when Gerrit is stopped, this has been done to improve performance. However, all tutorials and howto’s I’ve read recommend that writers are closed after every operation.
I would be surprised if they don't include some caveats like "unless you care about performance."And again, dropped writes != corruption, so it seems like there is something else going on here. Perhaps it died in the middle of the write? Have you googled the specific error message you've gotten?
Any thoughts about my assupmptions?
Should we reconsider this part of the Lucene implementation?
The only thing I think we don't do that we should is flush writes periodically (say, once every few minutes) even if the buffer size hasn't been reached.
On Tue, Feb 11, 2014 at 9:58 AM, Shawn Pearce <s...@google.com> wrote:
On Tue, Feb 11, 2014 at 4:10 AM, Sven Selberg <sven.s...@sonymobile.com> wrote:
It's just that reindex takes aproximately 330 min in our live-environment, it's impossible for us to have that kind of downtime.I'm thinking there must be a way to implement Lucene that doesn't corrupt the entire index if this happens.I agree, 330 minutes of downtime is unacceptable. That is a 5.5 hour indexing operation!IIRC the indexing runs multi-threaded, but on one machine. If it was better parallelized and you had more hardware you could pull the time in, but even with 5 machines you are talking about more than an hour of downtime, which I think is still bordering on unacceptable.This would require some extra work on our part, because on-disk Lucene indexes can really only be accessed from one machine. (Solr Cloud is a different story.) So we would need to serialize the writes each parallel machine would make and apply them from the main server. (Mapreduce with a single reducer.)
Especially if it was unplanned.It should be possible to run reindex while the server is online. The Lucene glue is supposed to know how to build up the most recent version of the index in the background while the server stays up answering requests. Unfortunately with no valid index the dashboards are all unavailable, and there is enough important functionality missing that you might as well have the server down.I think the structure of a Lucene index on disk is designed to be resilient to corruption. You could lose recent updates, but you should never lose the entire index.Right. This is why I suggested searching for info on the specific error message. Maybe it's a known issue.
On Wednesday, February 12, 2014 11:24:47 AM UTC+1, Sven Selberg wrote:
On Tuesday, February 11, 2014 4:50:40 PM UTC+1, Dave Borowitz wrote:I doubt you want this from a performance perspective but it's already configurable, just set index.lucene.maxBufferedDocs to 0.I cannot find anything pointing towards that maxBufferedDocs = 0 should trigger an autocommit in the IndexWriter in the constant field values [1]Apparently my search engine is better then yours [1] ;-)
Neither am I able to find where Gerrit handlas maxBufferedDocs = 0.How that [2]?
writerConfig.setMaxBufferedDocs(cfg.getInt("index", name, "maxBufferedDocs",IndexWriterConfig.DEFAULT_MAX_BUFFERED_DOCS));In any case this is not the problem Sven is describing: dropped writes generally leave the index in a readable, un-corrupt state.What exactly do you refer to when you say "dropped writes"?I think what Dave was trying to explain is that* your corrupted index data on the disk* and my not flushed RAM writes to the disk (aka as "dropped writes") due to killing the debuggerare completely different problems and have nothing in common.
On Wednesday, February 12, 2014 12:05:49 PM UTC+1, David Ostrovsky wrote:
On Wednesday, February 12, 2014 11:24:47 AM UTC+1, Sven Selberg wrote:
On Tuesday, February 11, 2014 4:50:40 PM UTC+1, Dave Borowitz wrote:I doubt you want this from a performance perspective but it's already configurable, just set index.lucene.maxBufferedDocs to 0.I cannot find anything pointing towards that maxBufferedDocs = 0 should trigger an autocommit in the IndexWriter in the constant field values [1]Apparently my search engine is better then yours [1] ;-)Not only that, your eyesight seems to be superior to mine as well :-)I can't find anything here pointing towards maxBufferedDocs == 0 triggering a autocommit-behavior. You might read between the lines that this value would trigger an flush after each document, but if our problem was due to documents not being flushed we would miss about 20, not two weeks worth.Please explain.
Neither am I able to find where Gerrit handlas maxBufferedDocs = 0.How that [2]?All i see is taking the value from gerrit.config and populating a writerconfig with those values and passing it on to the IndexWriter ctor. As I stated earlier, I can't see where the special case of maxBufferedDocs == 0 being handled in this piece of code. Please help me understand.
On Wednesday, February 12, 2014 1:03:29 PM UTC+1, Sven Selberg wrote:
On Wednesday, February 12, 2014 12:05:49 PM UTC+1, David Ostrovsky wrote:
On Wednesday, February 12, 2014 11:24:47 AM UTC+1, Sven Selberg wrote:
On Tuesday, February 11, 2014 4:50:40 PM UTC+1, Dave Borowitz wrote:I doubt you want this from a performance perspective but it's already configurable, just set index.lucene.maxBufferedDocs to 0.I cannot find anything pointing towards that maxBufferedDocs = 0 should trigger an autocommit in the IndexWriter in the constant field values [1]Apparently my search engine is better then yours [1] ;-)Not only that, your eyesight seems to be superior to mine as well :-)I can't find anything here pointing towards maxBufferedDocs == 0 triggering a autocommit-behavior. You might read between the lines that this value would trigger an flush after each document, but if our problem was due to documents not being flushed we would miss about 20, not two weeks worth.Please explain.OK, i see your point. It is stated that: [1]."Note that flushing just moves the internal buffered state in IndexWriter into the index, but these changes are not visible to IndexReader until either commit() or close() is called."So are we missing the IndexWriter.commit() call?
Neither am I able to find where Gerrit handlas maxBufferedDocs = 0.How that [2]?All i see is taking the value from gerrit.config and populating a writerconfig with those values and passing it on to the IndexWriter ctor. As I stated earlier, I can't see where the special case of maxBufferedDocs == 0 being handled in this piece of code. Please help me understand.Right, was assuming, that the special handling happens on the Lucene site.
On Wednesday, February 12, 2014 1:59:13 PM UTC+1, David Ostrovsky wrote:
On Wednesday, February 12, 2014 1:03:29 PM UTC+1, Sven Selberg wrote:
On Wednesday, February 12, 2014 12:05:49 PM UTC+1, David Ostrovsky wrote:
On Wednesday, February 12, 2014 11:24:47 AM UTC+1, Sven Selberg wrote:
On Tuesday, February 11, 2014 4:50:40 PM UTC+1, Dave Borowitz wrote:I doubt you want this from a performance perspective but it's already configurable, just set index.lucene.maxBufferedDocs to 0.I cannot find anything pointing towards that maxBufferedDocs = 0 should trigger an autocommit in the IndexWriter in the constant field values [1]Apparently my search engine is better then yours [1] ;-)Not only that, your eyesight seems to be superior to mine as well :-)I can't find anything here pointing towards maxBufferedDocs == 0 triggering a autocommit-behavior. You might read between the lines that this value would trigger an flush after each document, but if our problem was due to documents not being flushed we would miss about 20, not two weeks worth.Please explain.OK, i see your point. It is stated that: [1]."Note that flushing just moves the internal buffered state in IndexWriter into the index, but these changes are not visible to IndexReader until either commit() or close() is called."
So are we missing the IndexWriter.commit() call?Thank you!This is what I suspect,We are running some tests to see if commit() or close() callsf, after adding document, would make this problem go away. If it does, we we'll run some tests to try to find out what sort of effect this has on performance.Just wanted to make everybody aware that this was a potential Gerrit issue. There have been a similar issue with lost changes (can't find it), and that was considered to be a question of too large buffers.Lucene index is a great feature, we love it. We apprecite Dave's efforts tremendously. Whether it's a Gerrit issue or an issue with our set up, we just want to fix this as soon as posible so that we can get the index up and running again.I was just trying to make sure we were not barking up the wrong tree and that this wasn't merely an issue that manifests itself in our specific Gerrit set-up.
Neither am I able to find where Gerrit handlas maxBufferedDocs = 0.How that [2]?All i see is taking the value from gerrit.config and populating a writerconfig with those values and passing it on to the IndexWriter ctor. As I stated earlier, I can't see where the special case of maxBufferedDocs == 0 being handled in this piece of code. Please help me understand.
Right, was assuming, that the special handling happens on the Lucene site....and also thank you all for all the suggestions and clarifications. Forgot that important part, my bad...Sven
--
+2
--