Modifying Lucene index

97 views
Skip to first unread message

Colin Schepers

unread,
Sep 2, 2013, 9:16:40 AM9/2/13
to rav...@googlegroups.com
Two indexes:
- TextIdx: Indexes the 'Text' Field of a document. A custom analyzer is used which injects synonyms at the same position as the corresponding term. It retrieves synonyms from the index 'SynonymIdx'.
- SynonymIdx: The index on the synonyms

The problem is as follows:
When a synonym is added, modified or removed, a, AbstractDeleteTrigger/AbstractPutTrigger on SynonymIdx should remove/modify the TextIdx regarding that synonym. Note that no relation exists between the two indexes (only the analyzer which injects a string). Using Lucene's IndexSearcher/IndexWriter (i.e. IndexWriter.DeleteDocuments(Term)) results in an exception indicating that the index is locked (read only directory). What is the best way to solve this problem?

Oren Eini (Ayende Rahien)

unread,
Sep 2, 2013, 9:22:43 AM9/2/13
to ravendb
Take 17 steps back.
I have no idea what you are trying to do here. Let starts there.


--
You received this message because you are subscribed to the Google Groups "RavenDB - 2nd generation document database" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ravendb+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Itamar Syn-Hershko

unread,
Sep 2, 2013, 11:05:59 AM9/2/13
to rav...@googlegroups.com
No - whenever a synonym dictionary is updated, you have to do full reindex. This is true also for raw Lucene indexes, unless you are only doing synoyms in query time (aka query expansion)


On Mon, Sep 2, 2013 at 4:16 PM, Colin Schepers <colins...@gmail.com> wrote:
Message has been deleted

Colin Schepers

unread,
Sep 2, 2013, 1:32:29 PM9/2/13
to rav...@googlegroups.com
@Oren: I have two collections: documents and synonyms. When a document is indexed, the raw text is parsed using a custom analyzer. This analyzer inserts synonym terms at certain positions depending on the synonym collection. How should I update the index of the documents whenever the synonym collection is updated?

@Itamar: 
Can you explain why? Suppose, doc1 = "a b" and syn1 = "a == aa" (after analyzing doc1 looks like "(a == aa) b"). My ideas were:
- syn1 deleted or updated: delete documents with term "a" from the document index and add the document again (re-analyzing it)
IndexWriter.updateDocument(Term term, Iterable<? extends IndexableField> doc, Analyzer analyzer)
Updates a document by first deleting the document(s) containing term and then adding the new document.
- syn2 (b == bb) added: query for "b" and re-index these documents

Still, to do this I need to access the Lucene Index. Is it possible to unlock the index directory (the file write.lock)?


Itamar Syn-Hershko

unread,
Sep 2, 2013, 1:47:17 PM9/2/13
to rav...@googlegroups.com
Synonyms should be handled either in indexing time, or on query time. There's no real point in doing this both ways.

What you propose requires too much work, and your first attempt should really try to use RavenDB constructs - for example by updating the RavenDB documents matching the query and by that trigger their reindexing.

The real questions you should ask yourself are how frequent is this list going to get updated, how big will it be, and how many synonyms in average per word you will have.

If the updates are frequent enough, and especially if the majority of queries are expected to be short (Google-style queries) I will highly recommend doing the expansion on query time - by providing your own Analyzer to the QueryParser or your own QueryParser. I believe there is no separation between search and index analyzers in RavenDB at this point, but there should be one.

Please note if you are using highlighthing as well Lucene.NET currently has a bug which may cause searches to become unresponsive when highlighting phrases with multiple synonyms.


Colin Schepers

unread,
Sep 2, 2013, 2:26:24 PM9/2/13
to rav...@googlegroups.com
Ok, thanks for your helpful reply. Query expansion might be a better option, I have to think about that.

"What you propose requires too much work, and your first attempt should really try to use RavenDB constructs - for example by updating the RavenDB documents matching the query and by that trigger their reindexing."
I didn't actually mean deleting and adding the document; but using the method "IndexWriter.updateDocument" which does this implicitly (what I understand from reading the documentation). Can you explain what you mean by updating the documents? Nothing changes in a document itself, only the TokenStream generated from the analyzer.

"I believe there is no separation between search and index analyzers in RavenDB at this point, but there should be one."
What about the plugin AbstractAnalyzerGenerator or is this not what you mean?

Colin Schepers

unread,
Sep 3, 2013, 2:01:58 AM9/3/13
to rav...@googlegroups.com
On a second thought, Query Expansion is no option because I'm also dealing with synonyms that are regular expressions over a sentence of the text. This has to be done during indexing time.

Itamar Syn-Hershko

unread,
Sep 3, 2013, 3:43:11 AM9/3/13
to rav...@googlegroups.com
On Mon, Sep 2, 2013 at 9:26 PM, Colin Schepers <colins...@gmail.com> wrote:
Ok, thanks for your helpful reply. Query expansion might be a better option, I have to think about that.

"What you propose requires too much work, and your first attempt should really try to use RavenDB constructs - for example by updating the RavenDB documents matching the query and by that trigger their reindexing."
I didn't actually mean deleting and adding the document; but using the method "IndexWriter.updateDocument" which does this implicitly (what I understand from reading the documentation). Can you explain what you mean by updating the documents? Nothing changes in a document itself, only the TokenStream generated from the analyzer.

Right, but instead of fighthing with index locks you can trigger reindexing to happen by RavenDB itself by touching the documents in store. If you have other indexes on the same collection then maybe it worth looking into other options. But at least I would start there.
 

"I believe there is no separation between search and index analyzers in RavenDB at this point, but there should be one."
What about the plugin AbstractAnalyzerGenerator or is this not what you mean?

Yes, that will work

Itamar Syn-Hershko

unread,
Sep 3, 2013, 3:46:24 AM9/3/13
to rav...@googlegroups.com
So this is going to be quite tough. You'll need both an Analyzer with an updateable dictionary, the ability to update it from the client side (I'd assume), and then a way to find out which documents require reindexing and trigger reindexing for them. If the list changes frequently you'll be always indexing, and thats not ideal as well.


On Tue, Sep 3, 2013 at 9:01 AM, Colin Schepers <colins...@gmail.com> wrote:
On a second thought, Query Expansion is no option because I'm also dealing with synonyms that are regular expressions over a sentence of the text. This has to be done during indexing time.

--
Reply all
Reply to author
Forward
0 new messages