creating an inverted index

635 views
Skip to first unread message

jenna_s

unread,
Apr 19, 2012, 3:39:55 AM4/19/12
to Redis DB
Hi,

I'm new to NoSQL and Redis, but I think that's what I need for my new
project. I need to implement an inverted index of English words to do
information retrieval. I can't use a commercial product (like Solr/
Lucene/etc.) since my algorithm has some twists.

The only thing that worries me about Redis is that my data set is
quite large. First of all, only to index all words in the English
language takes somewhere around 6MB - for 1,000,000 entries (7
letters per word on the average). These are just they keys. I estimate
that on the average, each word will appear in 30 documents (the
values), probably putting the whole thing at over 120MB. I'm guessing
that Redis should be able to take on this much, right? What if I
extend to other countries? What's too big of a dataset for Redis? At
what point should I be afraid? What solutions are there to scale?

Thanks,
Jenna

Didier Spezia

unread,
Apr 19, 2012, 5:31:56 PM4/19/12
to redi...@googlegroups.com
Hi,

the constraint is that all the memory consumed by Redis should fit in RAM.
If you have enough RAM, Redis can handle several GBs without any issue.
My personal record is 24 GB, but I know some people went further.
120 MB are hardly a problem.

Regards,
Didier.

Arnaud GRANAL

unread,
Apr 19, 2012, 5:42:43 PM4/19/12
to redi...@googlegroups.com
On Fri, Apr 20, 2012 at 12:31 AM, Didier Spezia <didi...@gmail.com> wrote:
> Hi,
>
> the constraint is that all the memory consumed by Redis should fit in RAM.
> If you have enough RAM, Redis can handle several GBs without any issue.
> My personal record is 24 GB, but I know some people went further.
> 120 MB are hardly a problem.
>

I can confirm instances are working with 80GB+ but I don't recommend
running this way because of the slow fork() that will occur during
background saving and the long time the instance takes to boot up or
synchronize.
Also, if you have a big instance, you are likely to have a lot of
clients. Because of the single-threaded nature of Redis, this is not a
good idea.

As a more general recommandation, if you have "big" needs (>= 20 GB)
you should rather shard your data across more than one Redis instance
(run 64 Redis instances for example).

120MB won't be a problem but sharding might be good if you think your
dataset or traffic will grow a lot.

A.

Pierre Chapuis

unread,
Apr 20, 2012, 6:11:06 AM4/20/12
to redi...@googlegroups.com
Hello Jenna,

at Moodstocks we have been using Redis as an inverted index for more than a year, with a lot more data that that (our documents are images...), so I can assure you it works. Our algorithm is a variant of TF-IDF scoring.

I gave a small presentation about that topic at the last Open Source Developers Conference in France, you can find the slides here: http://files.catwell.info/presentations/2011-osdcfr-redis-iidx/

Two key pieces of advice:

1) Use Scripting. As you can see in the slide deck I was asking for it before it existed, precisely because I had to maintain a fork of Redis with inverted index search-related commands before.

2) Make a word <-> unique integer map, and store the integers and not the words themselves in your structures. This will save a lot of RAM.

To scale with the size of the index, favor natural sharding (eg. if you know you will know the language of the document you are looking for, shard on language). Otherwise, what you can do is store different sets of documents in different Redis instances, get a scored shortlist for your query on each instance (larger than what you need in the end) and finally merge them by sorting by score.

-- 
Pierre Chapuis

jenna_s

unread,
Apr 20, 2012, 10:49:37 AM4/20/12
to Redis DB
Hi,

This is great advice. Thanks everyone.

Pierre - I'm reading your presentation.

Thanks!
Jen

hymloth

unread,
Apr 20, 2012, 1:28:01 PM4/20/12
to Redis DB
Hi,

If you understand python code, I have written a small search engine
using redis to hold the inverted index.
See http://github.com/hymloth/pyredise

cheers
Reply all
Reply to author
Forward
0 new messages