best NoSQL for building an inverted index

574 views
Skip to first unread message

jenna_s

unread,
Apr 17, 2012, 3:01:28 AM4/17/12
to NOSQL
Hi,

I'm working on a small project where I need to build an inverted index
and then run similarity algorithms on it based on a user query - basic
information retrieval. Is there one NoSQL product that stands out when
it comes to building & searching inverted indices?

Thanks so much,
J

Itamar Syn-Hershko

unread,
Apr 17, 2012, 11:14:43 AM4/17/12
to nosql-di...@googlegroups.com
RavenDB builds on Lucene and provides all this out of the box - produces an inverted index from properties you define and supports suggestions out of the box

But why would you need a NoSQL database? just use Lucene directly


--
You received this message because you are subscribed to the Google Groups "NOSQL" group.
To post to this group, send email to nosql-di...@googlegroups.com.
To unsubscribe from this group, send email to nosql-discussi...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nosql-discussion?hl=en.



Emin Gun Sirer

unread,
Apr 17, 2012, 11:30:25 AM4/17/12
to nosql-di...@googlegroups.com
Hi J,

I assume that your data is too large scale to fit on a single host, so you cannot use a local database or inverted index. I urge you to look at HyperDex, which has a unique SEARCH primitive that can perform the kind of information retrieval you mentioned efficiently, while also supporting sharding and scaling out. The url is: http://www.hyperdex.org

- egs

Eric Bloch

unread,
Apr 17, 2012, 11:41:33 AM4/17/12
to nosql-di...@googlegroups.com
MarkLogic is based on inverted indexes - every document inserted or edited (XML, JSON, text, etc) has its full text indexed in real-time (no subsequent indexing phase).  http://community.marklogic.com - free academic (and express) licenses available.  From "Inside MarkLogic" :

When people think of MarkLogic they often think of its text search capabilities.  The 
founding team has a deep background in search: Chris Lindblad was the architect of the 
Ultraseek Server, while Paul Pederson was the VP of Enterprise Search at Google.  
MarkLogic supports numerous search features including word and phrase search, boolean 
search, proximity, wildcarding, stemming, tokenization, decompounding, case-sensitivity 
options, punctuation-sensitivity options, diacritic-sensitivity options, document quality 
settings, numerous relevance algorithms, individual term weighting, topic clustering, 
faceted navigation, custom-indexed fields, and more.

It gets used for all sorts of IR and text analysis/analytics.  It was designed and architected from scratch and does not rely on some version of Lucene being bolted on later.  

There's an easy to use and powerful REST API for it available at http://github.com/marklogic/Corona as well

Eric
--
Eric Bloch
  2305 Forest View Avenue, Hillsborough CA 94010
  Email: eric....@gmail.com
  Web page: http://www.virginia-avenue.com/  
  Phone: 650-339-0376

Konstantin Osipov

unread,
Apr 17, 2012, 11:48:42 AM4/17/12
to nosql-di...@googlegroups.com
* jenna_s <jenna....@gmail.com> [12/04/17 19:16]:

sphinx?

--
http://tarantool.org - an efficient, extensible in-memory data store

David Bayliss

unread,
Apr 17, 2012, 2:59:46 PM4/17/12
to nosql-di...@googlegroups.com
If you want rather more control over your inverted index (and/or ever want to scale to bigger data) - you might want to check out HPCC Systems: http://hpccsystems.com/

It gives complete control over the index build (and usage) process.

I give an example of how to build an inverted index using it here: http://www.dabhand.org/ECL/construct_a_simple_bible_search.htm

hth

David

Martin Bruse

unread,
Apr 17, 2012, 4:55:15 PM4/17/12
to nosql-di...@googlegroups.com
Check out elasticsearch as well, lots of people use it simply as a
linearly scalable and fully indexed database.

Based on Lucene and with all its power but also operationally simple
and scales out really well.

//Martin

> --
> You received this message because you are subscribed to the Google Groups
> "NOSQL" group.

> To view this discussion on the web visit
> https://groups.google.com/d/msg/nosql-discussion/-/wmIYIXk5OrUJ.

jenna_s

unread,
Apr 17, 2012, 7:37:23 PM4/17/12
to NOSQL
Thanks everyone. I didn't expect that many replies and so many
different replies. I guess I should start looking at each one.

A friend has suggested that I use Redis, given that my user interface
in implemented in Rails. Does anyone have experience with Redis and
inverted indices and why I should/should not use Redis VS Lucene/
Hyperdex/MarkLogic/Sphinx/HPCC/ElasticSearch?

Thanks,
J

Brian Bulkowski

unread,
Apr 17, 2012, 8:07:04 PM4/17/12
to nosql-di...@googlegroups.com
Jenna ---

Those two choices are apples and oranges.

Redis is a simple, fast key store that allows you to build your own
reverse index.

Lucene/Hyperdex/MarkLogic/Sphinx/... *ARE* text oriented reverse indexes.

Choose whichever you need!

Good luck!
-brian

>> it comes to building& searching inverted indices?
>>
>> Thanks so much,
>> J

Ran Tavory

unread,
Apr 18, 2012, 12:24:04 AM4/18/12
to nosql-di...@googlegroups.com
But wait, there's more...
Solr is lucene wrapped as a web service.
Solandra is a Solr implementation over Cassandra for exceptional high scale.
For high scale, Solr Cloud is also worth a check.

Ran

> --
> You received this message because you are subscribed to the Google Groups "NOSQL" group.

jenna_s

unread,
Apr 18, 2012, 1:25:12 AM4/18/12
to NOSQL
After reading about all these technologies, I realized I didn't ask
the right question: what's the best NoSQL to roll your own inverted
index? My project is not only about IR. The algorithms I need to
implement are based on other data that's in addition to text indexing.

I do appreciate everyone's answers.

On Apr 17, 12:01 am, jenna_s <jenna.sim...@gmail.com> wrote:

John D. Mitchell

unread,
Apr 18, 2012, 1:34:00 AM4/18/12
to nosql-di...@googlegroups.com
On Apr 17, 2012, at 22:25 , jenna_s wrote:
[...]

> After reading about all these technologies, I realized I didn't ask
> the right question: what's the best NoSQL to roll your own inverted
> index? My project is not only about IR. The algorithms I need to
> implement are based on other data that's in addition to text indexing.

If you're doing this to learn and have a fair bit of exploration into composing various data structures into custom solutions, Redis is quite nice for that since it gives you building blocks that are easy to use.

The bigger, commercial "databases" (nosql or not) all have their own, baked-in approaches to dealing with things so it's more of a question of whether they fit your needs or not based on how/what they implement under the hood.

So, without knowing a lot more about what you're trying to do, it's hard to say much more.

Have fun,
John

Reply all
Reply to author
Forward
0 new messages