Live indexes

17 views
Skip to first unread message

Valentin Tablan

unread,
Oct 30, 2013, 11:48:37 AM10/30/13
to mg...@googlegroups.com
Hi All,

I'm trying to implement a live index, where documents are available for
searching as soon as they've been added to the index. This is somewhat
contrary to the way MG4J normally works, where documents are indexed in
batches, so it will require a bit of work. Before I commit to spending a
lot of time writing code, I thought I'd check with the experts whether
my current plan makes sense, or if there's a better solution.

My current idea is:

- implement some form of in-memory Index
- create a documental cluster containing the main on-disk index and the
in-memory one
- when new documents are added to the index, they go directly into the
in-memory index
- when the memory is full (and also at regular time intervals):
- dump the in-memory index to an on-disk batch
- add the new on-disk batch to the documental cluster
- start a new empty in-memory index, and add it to the cluster

- at regular (longer) time intervals, append all the on-disk batches to
the main index.


Does this sound like a sensible way of approaching the problem? Are
there already classes I should look at, that may provide some of this
functionality?

Thanks,
Valentin



Sebastiano Vigna

unread,
Nov 7, 2013, 6:27:32 PM11/7/13
to mg...@googlegroups.com
On 30 Oct 2013, at 4:48 PM, Valentin Tablan <v.ta...@gmail.com> wrote:

> lot of time writing code, I thought I'd check with the experts whether
> my current plan makes sense, or if there's a better solution.

Sigh. That's on the todo list since my last visit to Sheffield.

> - implement some form of in-memory Index

As we discussed, the current posting-list representation used in-memory during the index is, in fact, fully searchable. It is not really optimized (e.g., no skips), but for a small collection is fine. It's just a matter of orchestrating correctly the order of update of the internal variables so that each posting list is never in an inconsistent state.

> - create a documental cluster containing the main on-disk index and the
> in-memory one
> - when new documents are added to the index, they go directly into the
> in-memory index
> - when the memory is full (and also at regular time intervals):
> - dump the in-memory index to an on-disk batch
> - add the new on-disk batch to the documental cluster
> - start a new empty in-memory index, and add it to the cluster

Yes, that's how I would do it, to.

> - at regular (longer) time intervals, append all the on-disk batches to
> the main index.


There are of course classes, like Combine/Concatenate/Merge/Paste, that can combine the on-disk batches into larger batches or to the main index.

But how would you expose the in-memory index? A server?

Ciao,

seba

Reply all
Reply to author
Forward
0 new messages