textIndexer -clean problem

30 views
Skip to first unread message

Jamie Orchard-Hays

unread,
Jun 2, 2008, 10:53:11 AM6/2/08
to xtf-...@googlegroups.com

Over the weekend I was running the textIndexer and ran out of disk
space. When I cleaned up I ran an -incremental to finish up.
Unfortunately, it stopped with this error:

(88%) Indexing [tss/8-1/tss.div.8.1.99.xml] ... (9 stored keys) ...
Done.
*** Error: class java.io.FileNotFoundException
java.io.FileNotFoundException: /home/capistrano/trunk/xtf/apache-
tomcat-6.0.10/webapps/xtf/index/_jnz.cfs (No such file or directory)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:212)
at org.apache.lucene.store.FSIndexInput
$Descriptor.<init>(FSDirectory.java:497)
at
org.apache.lucene.store.FSIndexInput.<init>(FSDirectory.java:522)
at
org.apache.lucene.store.FSDirectory.openInput(FSDirectory.java:434)
at
org
.apache.lucene.index.CompoundFileReader.<init>(CompoundFileReader.java:
63)
at
org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:154)
at
org.apache.lucene.index.SegmentReader.get(SegmentReader.java:140)
at
org.apache.lucene.index.SegmentReader.get(SegmentReader.java:121)
at
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:1473)
at
org
.apache.lucene.index.IndexWriter.maybeMergeSegments(IndexWriter.java:
1415)
at
org.apache.lucene.index.IndexWriter.flushRamSegments(IndexWriter.java:
1352)
at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:
588)
at
org.cdlib.xtf.textIndexer.XMLTextProcessor.close(XMLTextProcessor.java:
641)
at
org.cdlib.xtf.textIndexer.SrcTreeProcessor.close(SrcTreeProcessor.java:
192)
at
org.cdlib.xtf.textIndexer.TextIndexer.main(TextIndexer.java:330)

Unfortunately, I didn't know about this when our nightly cronjob
reindexed. This one always does a -clean. It threw this error:

TextIndexer v2.1


Indexing New/Updated Documents:
Index: "default"
*** Error: class java.lang.IllegalStateException
java.lang.IllegalStateException: doc counts differ for segment _jo1:
fieldsReader shows 1 but segmentInfo shows 100
at
org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:164)
at
org.apache.lucene.index.SegmentReader.get(SegmentReader.java:140)
at
org.apache.lucene.index.SegmentReader.get(SegmentReader.java:121)
at org.apache.lucene.index.IndexReader
$1.doBody(IndexReader.java:166)
at org.apache.lucene.index.SegmentInfos
$FindSegmentsFile.run(SegmentInfos.java:579)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:
147)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:
131)
at
org
.cdlib
.xtf
.textIndexer.XMLTextProcessor.openIdxForReading(XMLTextProcessor.java:
3656)
at
org.cdlib.xtf.textIndexer.XMLTextProcessor.open(XMLTextProcessor.java:
517)
at
org.cdlib.xtf.textIndexer.SrcTreeProcessor.open(SrcTreeProcessor.java:
150)
at
org.cdlib.xtf.textIndexer.TextIndexer.main(TextIndexer.java:328)

Indexing Process Aborted.Finished clean index of all data


What I'm curious about is why a -clean run would fail. Doesn't a -
clean just remove the indexed files and start over?

Jamie

Martin Haye

unread,
Jun 9, 2008, 6:54:55 PM6/9/08
to xtf-...@googlegroups.com
Hi Jamie,

I can't think of an explanation for why -clean didn't work properly. I just
tried making a corrupt index (by renaming one of the segment files) and it
successfully blew it away when I indexed with -clean.

Perhaps there is a difference in file ownership that prevented it from being
able to remove the old index? I'm grasping at straws here...

--Martin

Jamie Orchard-Hays

unread,
Jun 10, 2008, 10:22:06 AM6/10/08
to xtf-...@googlegroups.com

I don't think ownership was a problem. <shrugs>

Martin Haye

unread,
Jun 12, 2008, 7:51:50 PM6/12/08
to xtf-...@googlegroups.com
Hi Jamie,

In the last couple of days we stumbled on a possible way for the textIndexer
to fail during a -clean index, and I thought I should share it.

As you pointed out, when you call it with -clean, the indexer tries to blow
away the old index directory. What I didn't realize is that the code
silently ignores errors during this process. So if a file or directory can't
be removed, the indexer just goes on with the index process. Of course this
can come back to bite us later in the process when filenames conflict.

I just checked in a change to throw an exception and abort indexing if the
old directory can't be deleted.

We'll release this and a few other bug fixes next week as XTF 2.1.1.

--Martin

Jamie Orchard-Hays

unread,
Jun 13, 2008, 5:54:01 PM6/13/08
to xtf-...@googlegroups.com
Great! Thanks for the heads-up.

Jamie

Reply all
Reply to author
Forward
0 new messages