a search/modeling question

54 views
Skip to first unread message

fschwiet

unread,
May 16, 2011, 2:12:48 PM5/16/11
to ravendb
Consider the following scenario:
A site has 1-2 million products
Users can keep a list of their favorite products. 100-200 million
favorites have been created.
A user may have 0 to several thousand favorites. Typically users
have at most a few hundred favorites.
The goal is to let users search their favorite products. (Users
also need to search products, but that is not so challenging)

How to let a user search their favorite products?

1) If each favorite is completely denormalized (each favorite
includes the document), searching favorites is pretty
straightforward. However we then have 125 million documents with
mostly redundant data. Updates to the products will take a lot of
work to propagate out. Most of those normalized documents will never
be searched.

2) We could do a search on products where the query includes a list
of the user's favorite product IDs. This works for most cases, except
that Lucene has a limit to the number of terms in a query (1024).
Unfortunately then the most active users won't be able to search all
their favorites.

3) Maybe I could use a custom tokenizer during query generation.
The incoming token would be the userId. The custom tokenizer would
then do a subquery to get all the users favorites, producing a list of
product IDs. The query can then match product IDs on the users
favorites, while working around Lucene's 1024 term limit. This
solution would require the query to be able to specify a custom
tokenizer, which is not something RavenDB currently supports. This
solution would also get awkward when sharding. I could accomplish
same thing using a custom lucene query filter instead of a custom
tokenizer, both approaches would require more plumbing in RavenDB.

Sharding would likely be necessary, if not already then in the
future. That definitely impacts how the query would work. One
sharding model might be to spread the products across all shards, then
store favorites on the shard where the corresponding product is.
Another might be a star deployment, where the central DB stores all
products as a master store, replicating into the star's branches
(where products become readonly). Then favorites could be sharded
arbitrarily across the star's outer nodes.

Itamar Syn-Hershko

unread,
May 16, 2011, 6:19:12 PM5/16/11
to rav...@googlegroups.com
Lucene's 1024 term limit per query is there to protect you from incorrect use. You shouldn't be trying to bypass it.

Off the top of my head, here's something that may work well:

public class FavoriteProduct
{
public Guid ProductId { get; set; }
public string ProductName { get; set; }
public int[] UserIds { get; set; }
}

public class Products_ByUserFavorites : AbstractIndexCreationTask<FavoriteProduct>
{
public Products_ByUserFavorites()
{
Map = favs => from product in favs
             from userId in product.UserIds
             select new { product.ProductId, product.ProductName, userId};
}
}

The idea is having an index with fields for the product name and the user id. You query it with the user id first and then the full-text parameter for the product id, to get a performing query. You can use TransformResults/Includes too to get the actual product entity. Updates to product names will have to be performed twice - on products and on FavoriteProduct.

The only real con of this approach is the cost of saving a new favorite or deleting one, which depends on the amount of users you have. Perhaps a patch operation is going to be useful for doing that. The factor to consider is how common such operations are expected to be.

There's probably even a better way of doing that using a similar logic.

Ayende Rahien

unread,
May 17, 2011, 9:09:00 AM5/17/11
to rav...@googlegroups.com
Basically, this is a many / many operation, and it is probably easier to do this by actually adding FavoriteByUsersIds property to the Product.
That way, you only have to update a single document.
And you can limit on that very easily.

Frank Schwieterman

unread,
May 17, 2011, 1:14:21 PM5/17/11
to rav...@googlegroups.com
What Itamar and you have suggested makes sense, a
Product.UserIdsWhoFavoritedThis field to make the search work. With
currently data one the field could already have 14k entries, and much
more is possible. By "And you can limit on that very easily" I
believe you meant enforce a maximum the length of the field, let me
know if I misunderstood.

I suppose my next step is to make some of these alternatives work, and
then measure the performance. Thanks for the suggestions.

Itamar Syn-Hershko

unread,
May 17, 2011, 1:31:31 PM5/17/11
to rav...@googlegroups.com
The only reservation I have from Oren's suggestion is with the property UserIdsWhoFavoritedThis making your product entity much heavier than it should be for normal e-store operation. Storing this info outside the product entity to save on traffic and loading costs looks more reasonable to me than doing another entity update on the rare occasion of product name change (which you can probably use a trigger for).

This is also why I had a separate object in the first place.

Ayende Rahien

unread,
May 19, 2011, 9:25:27 AM5/19/11
to rav...@googlegroups.com
Hm, I didn't consider that, that is a good point. A separate document per product, then? Similar to how we are handling comments in Raccoon?

Itamar Syn-Hershko

unread,
May 19, 2011, 10:25:06 AM5/19/11
to rav...@googlegroups.com
A separate document per product like I posted above, yes, but not quite like in Racoon where there's integration between Post and PostComments.

All you need here is a simple object with the Product ID and the fields you want to index on (name, desc, tags) and the list of user IDs which you index too, so you can limit the search for one user. When update occurs, you look up this FavoriteProduct object based on the inner product ID.

And again: you want to put the condition of user ID _BEFORE_ the other conditions, that will speed up the index search.
Reply all
Reply to author
Forward
0 new messages