Efficient Faceting

175 views
Skip to first unread message

davet

unread,
Apr 26, 2011, 4:53:39 AM4/26/11
to Redis DB
I wonder if someone could help me. I can't outline the actual problem
since I'm working under an NDA so I'll use a dating site as an example
of the problem. I have:

a) A hash of users. Key is "user:xx" where "xx" is the user id. Each
hash containing username, firstname, surname, gender, age, email

I want to facet the users on:

a) City
b) Interests e.g. food & drink, cinema, sports, travel etc.

My implementation so far is as follows:

a) City - set per city with the value as the user ID.
b) Interests - set per interest type with the value as the user ID.
c) Gender - set per gender with the value as the user ID.

So to get all women in New York interested in cinema I perform an
intersection on the female, New York and cinema sets to get the user
IDs then another call to retrieve the user details from the hash.

With the following settings:

hash-max-zipmap-entries 512
hash-max-zipmap-value 512
list-max-ziplist-entries 512
list-max-ziplist-value 64

My test uses ~30M memory for 40k users, 2 gender sets (20k entries per
set), 1 city set (20k entries) and 5 interest sets (again each with
20k entries). In reality my datasets are going to be considerably
larger than this - for example the "interest" sets will have millions
of entries and will continue to grow.

Can anyone advise whether this approach is inline with how redis would
be used to address this problem? For example I've wondered whether it
would be better to partition the interest sets so they're city
specific e.g. "ny:cinema" to keep the overall set size down.

Thanks for any help,

Dvir Volk

unread,
Apr 26, 2011, 7:54:19 AM4/26/11
to redi...@googlegroups.com
basically you're doing something similar to what can be described in SQL tables like:

KEY(city,interest)
or on the other hand
KEY(city),
KEY(interest)

It really depends on your data pattern - many interests with little users each, or few with many users each.
the first approach will likely take up more memory, especially if users have many interests each (the city gets replicated many many times)
but think not only about memory consumption - but about performance, too. in the second approach you have to cross sets.
in the first approach you'll have much faster results if you have many records per set, as redis will not need to do any intersecting.

people who use redis usually prefer, in very broad terms, performance over memory consumption, that's redis' point in the first place :)



--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To post to this group, send email to redi...@googlegroups.com.
To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.


Didier Spezia

unread,
Apr 26, 2011, 12:47:25 PM4/26/11
to Redis DB

The performance of SINTER or SINTERSTORE depends on the
cardinality of your smallest set. For these operations,
redis sorts the sets per cardinality, and then scans the
first set trying to match each item in the other sets
(in the optimal order).

The problem is when you do not have a set significantly
smaller than the other ones.

I have tried an example with 350000 users:

cardinality gender = 175000 (2 genders)
cardinality city = 58000 (6 cities)
cardinality interest = 127000 (5 interest domains,
each user has up to 2 of them)

doing:

sinterstore tmp city:NCE gender:F interest:NATURE
sort tmp by nosort limit 0 100 get *

My oldish PC (a 2 GHz Athlon X2) cannot sustain more than 18 TPS
of these commands. Do not expect more than 3 or 4 times this
figure on recent server hardware. If you multiply the volume
of the smallest set by 10, you can divide the throughput by the
same value.

You may want to isolate this kind of queries on a slave in order to
keep your master responsive.

Now, if you denormalize the criteria, you can expect the cardinalities
to be much smaller and the performance much higher, but you actually
loose some flexibility.

For instance, if you keep only one kind of sets aggregating
gender,city,interest, the same query is covered by the following
command:

sort idx:F:NCE:NATURE by nosort limit 0 100 get *
(no intersection needed anymore)

This query is much faster (my hardware now supports 250 TPS
rather than 18 TPS). However you loose the possibility to
query on a subset of the criteria (you need the three of
them systematically). Also, it only works if the cardinality
of the cartesian product of the criteria is not too high
(here it would be 2*6*5 -> no problem).

An intermediate solution is to denormalize only a subset of
the criteria, the purpose being to decrease only the
cardinality of the smallest set. For instance, if I build
a set with gender,city keeping the interest sets unchanged,
I will instantly double my 18 TPS throughput, trading
performance against some flexibility.

Hope this helps ...

Regards,
Didier.

davet

unread,
Apr 26, 2011, 4:26:09 PM4/26/11
to Redis DB
Thanks to both of you. Those are extremely helpful, insightful
answers.

Pierre Chapuis

unread,
Apr 27, 2011, 10:00:20 AM4/27/11
to redi...@googlegroups.com
SOmething else you should take into account if you
do any kind of memory consumption measurement:
the size of a Redis set in memory follows the following
equations:

s = A + n*B if n <= N
s = A + n*C if n > N

where:

* n is the number of elements in the set;
* A is a constant (set creation overhead);
* B and C are constants with C > B;
* N is a constant that corresponds to set-max-intset-entries
in the configuration file (512 by default).

That means that if you store your data in sets, your memory
usage will grow a lot once you go past 512 elements per set.


By the way, it is the same thing with hashes. I have a use
case where I store data in ~1M hashes and my first memory
usage estimation was much lower than reality because of that
effect.

--
Pierre Chapuis

davet

unread,
Apr 28, 2011, 6:01:14 AM4/28/11
to Redis DB
Thanks Pierre,

I am quite concerned about the memory requirements of my
implementation. Getting data into redis is very easy. Optimising the
storage of the data to reduce memory consumption seems to be a bit of
black art.

I really wanted to use redis for the facet implementation just to
reduce the number of technologies involved but I've started to look at
solr as an alternative. I currently have 20 facets, in redis each
sorted set of 30k elements is using ~2meg consuming ~40meg. Crudely
implementing the facets in solr for 30k documents consumes ~18meg.
Solr memory usage seems to scale linearly as it does for redis (I
haven't validated this yet). I appreciate there's a performance trade
off but a ~50% difference in memory usage can't really be ignored.

As I understand it the next version of redis will have memory
optimisations for small sorted sets but it doesn't sound like it will
help for my use case.

Thanks again,

Dvir Volk

unread,
Apr 28, 2011, 6:07:49 AM4/28/11
to redi...@googlegroups.com
I really think you should think in a different magnitude of RAM consumption if you're going to use redis for stuff like that.
I consider a <100M database to be almost non existent, having databases of up to 10G RAM :)
simply put - the performance boost is worth it.

davet

unread,
Apr 28, 2011, 6:40:56 AM4/28/11
to Redis DB
The figures I'm using here are just to illustrate the problem with my
pretend dating site. Since memory use seems to increase linearly it
didn't really seem to matter what figures I used to describe the
nature of my task.

The sizing of the real problem is much higher. Each hash represents a
result set. Each result set requires faceting and therefore requires
the 20 sorted sets. In my simple example 20 sets uses ~40meg. So 10
results sets would require 400meg of facet data and so on. The reality
is I'm going to have thousands of result sets.

It's a certainty I'm going to end up distributing the data across
multiple nodes - nevertheless I want to squeeze as much out of one
node as possible.

Dvir Volk

unread,
Apr 28, 2011, 8:56:11 AM4/28/11
to redi...@googlegroups.com
how did you end up implementing the indexing?
perhaps this can be further optimized.

davet

unread,
Apr 28, 2011, 1:58:49 PM4/28/11
to Redis DB
The set key is "xx:id" where "xx" is a two character identifier for
the type of facet and "xx" is the key (incrementing integer) of the
hash. The members of the set are the fields (again incrementing
integers) within a hash.

To illustrate:

1) A result set hash has a key "12" and has 3 entries: 1->data, 2-
>data, 3->data.
2) A single facet set would be "dt:12" (where "dt" identifies the
facet type and "12" is the hash key) and could have members 1, 2 and 3
(i.e. map to the fields within the hash).

Then of course I have 20 sorted sets for the hash, with each set
providing a different facet on the main result hash.

The result sets will be created and torn down over time which is one
of the reasons I've partitioned the data like this.

Thanks for your help with this.
Reply all
Reply to author
Forward
0 new messages