Hi, I've got a graph on which I want to index different Node (Entities and Events) using properties such as time range, location, domain Ontology, etc. The obvious 2 options I've got for doing this is to use: 1) a Lucene Index; or 2) an in-graph Index, where I'll use a Node to index the Nodes I seek. One main advantage with the in-graph Index is the versatility it provides, by supporting a multilevel index (as shown in http://blog.neo4j.org/2012/02/modeling-multilevel-index-in-neoj4.html) and reverse index lookup and other possibilities in traversals... However, it is a bit more complex to maintain and it "pollutes" the graph with "system nodes". Moreover, I'm not sure how the in-graph index compares in term of efficiency to the Lucene Index? More specifically, in terms of time/date indexing, how would the previous multilevel index compare to a Lucene "YYYYMMDD" String field index? The in-graph index seems to offer some advantage when indexing a start and end date and searching for date ranges...
I would appreciate any insights about those two indexing approaches... thanks!
I have done some work on in-graph indexes in the past and my
experience is that it is not always worth the effort. It depends
however on the context. If for example you want to expose the index as
part of your application, an in-graph index is a great solution.
In my experience in-graph indexes become less attractive when indexing
large numbers of nodes. Rebalancing index trees can become
prohibitively slow when indexes become big. In "normal" Btrees eg.,
the index consists of blocks that can be swapped in and out of memory
as a unit. In-graph indexes use relationships to span up a tree, but
those relationships are not grouped together on disk, so rebalancing
an index tree may require disk reads from many different places in the
relationship file.
In my experience (running on my development machine, without any
additional tuning) an index up to approximately 100,000 entries still
performs reasonably well, above that number of entries, performance
becomes progressively slower. Of course tuning can make the approach
work well for higher numbers of entries, but I have to assume the
basic pattern remains.
On Jun 1, 4:39 pm, SimonH <simon.ha...@gmail.com> wrote:
> Hi, I've got a graph on which I want to index different Node (Entities and
> Events) using properties such as time range, location, domain Ontology,
> etc. The obvious 2 options I've got for doing this is to use: 1) a Lucene
> Index; or 2) an in-graph Index, where I'll use a Node to index the Nodes I
> seek. One main advantage with the in-graph Index is the versatility it
> provides, by supporting a multilevel index (as shown in http://blog.neo4j.org/2012/02/modeling-multilevel-index-in-neoj4.html) and
> reverse index lookup and other possibilities in traversals... However, it
> is a bit more complex to maintain and it "pollutes" the graph with "system
> nodes". Moreover, I'm not sure how the in-graph index compares in term of
> efficiency to the Lucene Index? More specifically, in terms of time/date
> indexing, how would the previous multilevel index compare to a Lucene
> "YYYYMMDD" String field index? The in-graph index seems to offer some
> advantage when indexing a start and end date and searching for date
> ranges...
> I would appreciate any insights about those two indexing approaches...
> thanks!
On Friday, June 1, 2012 4:56:01 PM UTC-4, Niels Hoogeveen wrote:
> I have done some work on in-graph indexes in the past and my > experience is that it is not always worth the effort. It depends > however on the context. If for example you want to expose the index as > part of your application, an in-graph index is a great solution.
> In my experience in-graph indexes become less attractive when indexing > large numbers of nodes. Rebalancing index trees can become > prohibitively slow when indexes become big. In "normal" Btrees eg., > the index consists of blocks that can be swapped in and out of memory > as a unit. In-graph indexes use relationships to span up a tree, but > those relationships are not grouped together on disk, so rebalancing > an index tree may require disk reads from many different places in the > relationship file.
> In my experience (running on my development machine, without any > additional tuning) an index up to approximately 100,000 entries still > performs reasonably well, above that number of entries, performance > becomes progressively slower. Of course tuning can make the approach > work well for higher numbers of entries, but I have to assume the > basic pattern remains.
> On Jun 1, 4:39 pm, SimonH <simon.ha...@gmail.com> wrote: > > Hi, I've got a graph on which I want to index different Node (Entities > and > > Events) using properties such as time range, location, domain Ontology, > > etc. The obvious 2 options I've got for doing this is to use: 1) a > Lucene > > Index; or 2) an in-graph Index, where I'll use a Node to index the Nodes > I > > seek. One main advantage with the in-graph Index is the versatility it > > provides, by supporting a multilevel index (as shown in > http://blog.neo4j.org/2012/02/modeling-multilevel-index-in-neoj4.html) > and > > reverse index lookup and other possibilities in traversals... However, > it > > is a bit more complex to maintain and it "pollutes" the graph with > "system > > nodes". Moreover, I'm not sure how the in-graph index compares in term > of > > efficiency to the Lucene Index? More specifically, in terms of time/date > > indexing, how would the previous multilevel index compare to a Lucene > > "YYYYMMDD" String field index? The in-graph index seems to offer some > > advantage when indexing a start and end date and searching for date > > ranges...
> > I would appreciate any insights about those two indexing approaches... > > thanks!