sorting collections in memcache

373 views
Skip to first unread message

homli

unread,
Jul 24, 2008, 6:58:45 PM7/24/08
to memcached
I've been using a two-phased fetch to cache arrays of items and the
items details as described by Paul Stacey (http://lists.danga.com/
pipermail/memcached/2007-July/004578.html)

Now I need to sort items in the array of its three different ways
(e.g. id, name, creator) and am wondering how others are handling
this. I've considered a few options...

1) Cache three different arrays of ids--one per sort order. However,
for name and creator I'd need to reselect the entire array from the DB
any time an id is added or removed (which could happen relatively
frequently, and required a join across large tables to do the
sorting.)

2) Cache name and creator along with id in the array and re-sort in
memory. Sort performance obviously isn't as great as the array grows.

3) Cache three different arrays as above, but include the name along
with id in one of the arrays and creator along with id in another.
Now I can pre-sort when adding new items to each array.

4) Cache three arrays as above, but only include the first few letters
of name or creator. Now I can generally presort, and then multi-get
items that cross page boundaries as pages are requested to figure out
any sorting necessary beyond the first few letters (i.e. if I'm asking
for items 21-40 and both item 20 and 21 start with 'abc').

Thoughts?

dormando

unread,
Jul 25, 2008, 9:10:45 PM7/25/08
to memcached
Hmm.

Guess I'd say:

- Benchmark 2) for the min, average, max list size. it might not make a
huge enough difference.
- 3) ... sounds fragile?
- 4) won't work? the first few letters won't let you match the
full key? Maybe I'm not following.

Uhhh. Blanking on other good ideas, sorry :\
-Dormando

homli

unread,
Jul 26, 2008, 2:05:11 AM7/26/08
to memcached
Thanks for your feedback Dormando.

The idea with 4) is that I'd cache an array of key value pairs, where
the key is the item's id and the value is the first few characters of
the item's name. So if the item's name is "Baseball Game", the value
might be "bas", "baseba", or even "baseball gam". (I guess I'd have to
test for optimal length to balance not having too many items with the
same value while keeping the cache value as small as possible.)

In that case, I could do a rough sort in memory based on the truncated
values. Now suppose my user interface lets users page through values
20 at a time. For page one, I'd need to make sure the values for the
20th and 21st item in the array aren't equal. If they are, I'd need
to include item 21's id (and 22, 23, etc. if they also have the same
value) in my multi-get for the detail data (which I'd lookup by id)
and use the detailed data for final sorting before deciding which 20
items should be displayed.

Similar code would be used when inserting items in the array... if the
first few letters of the item don't match something already in the
array, I can add the item id and value without any extra work.
Otherwise, I'd have to multi-get any items with the same first few
letters to decide where to add the item in the array. Again, I could
balance how often the extra step is necessary by varying the length of
the values in the cache.

Obviously the code is more complex here, but as long as there aren't
too many items with the same cached values, it might (or might not) be
more efficient than deleting and re-caching the entire array from SQL
every time an item is added or removed.

Anyone else tackled this before?

Clint Webb

unread,
Jul 26, 2008, 4:49:24 AM7/26/08
to memc...@googlegroups.com
I had a similar quandry with one of my projects.  After thinking about it for weeks, and coming up with several complex solutions, I decided to take a step back from my database design and look at things differently.   I soon realised that my database design was the problem and a much more simple solution.

In my case, when I looked at the reason for the the various sorts I was attempting to do, I realized that most of the time only one sort would ever be used.  Additionally, giving the user multiple sort options only made the user interface more complicated. 

In almost every instance, I have discovered that if I find myself spending a lot of time trying to figure out how to do something complicated with memcache, most of those cases either are BEST handled by the database (and shouldn't be cached at all), or that it is an indication that there are problems with my database design or user interface.

To re-iterate, I cache a lot.  But I dont cache everything.  There is a reason that we use a structured query language (SQL) for databases instead of just using a hash object store.  Because it gives you extra flexibility.  By trying to fit everything into memcache you are limiting what you can do with your data.

If you are having database performance issues and you think this particular sorting problem is the primary cause of it, then I would first suggest that you examine your indexes and make sure they are tuned for the query (i assume you are doing joins).   If your indexes are setup correctly, a join should not be very expensive in order to return an ordered list.  If you have your composite objects in cache, and all you need to return is an id of some sort, that is.

Otherwise I would let our database handle this instead.
--
"Be excellent to each other"

homli

unread,
Jul 26, 2008, 7:00:13 PM7/26/08
to memcached
Thanks for the feedback. I've definitely had to step back from my
design on several occasions. Unfortunately, in this case the limited
sorting is pretty critical to the user experience.

However, I realize I can simplify my question a bit... For those who
are using a two-phased approach to cache collections of items (cache
an array of ids, then multi_get a slice of the array for each "page"
of items, each of which is cached individually), how are you handling
sorting when the sort order isn't as simple as pushing new items onto
the array?

mike

unread,
Jul 26, 2008, 7:12:33 PM7/26/08
to memc...@googlegroups.com
On 7/26/08, homli <hans...@gmail.com> wrote:
>
> Thanks for the feedback. I've definitely had to step back from my
> design on several occasions. Unfortunately, in this case the limited
> sorting is pretty critical to the user experience.
>
> However, I realize I can simplify my question a bit... For those who
> are using a two-phased approach to cache collections of items (cache
> an array of ids, then multi_get a slice of the array for each "page"
> of items, each of which is cached individually), how are you handling
> sorting when the sort order isn't as simple as pushing new items onto
> the array?

I haven't done this in production on any decent-sized site but our
approach until we had to think of a new way to scale it would be

phase 1: select just the IDs from the database. do limiting, ordering,
filtering, sorting, whatever in that query

phase 2: issue a multi_get that hits the cache first, if any are
missing, grabs from the db (in a single query) and then adds them to
the cache for the next time

In practice it worked great. How well it would scale, not sure. It was
just our first run at it.

Henrik Schröder

unread,
Jul 27, 2008, 6:19:43 AM7/27/08
to memc...@googlegroups.com
In our product we do a bit of everything with regards to what you describe.

We have simple arrays of items stored where we have one object in memcache for every permutation, i.e. underlying data, sort order and slice. Whenever something in the underlying data changes, we invalidate all cached objects that depend on it.

We also have more expensive objects that we store individually, and sometimes access as arrays. We store the entire sorted array in memcached, and when we need slices of it, we get the entire array, slice out the ids we want, and then do a second multi-get to "fill" it with the actual items. The only tricky thing there is that you might get holes in your array, and you need to fill those in afterwards from your data store, and put them back in the cache as well. If the underlying data changes, I think we usually just invalidate the entire array. We don't have any that are difficult to re-caclculate.

There's a few places where we typically just add an item to the array, and in that case we update the memcached object instead of invalidating it.

If I understood your case correctly, your arrays are expensive to re-calculate from scratch, so you want to avoid that? I would store each sorted array in full, so you always do the slicing in your app. And if you have a small amount of sort orders, if your underlying data changes, you could get all the arrays, do the change in memory, and put them back in memcached. This assumes that you *can* do the changes in your application. If you're dependent on your database to do the sort orders correctly, well, then you *have to* invalidate all arrays and re-calculate them on every change.

Getting a few arrays from memcached, updating them, and storing them back takes some time. You simply have to benchmark a bit and see if this is cheaper than invalidating and re-calculating from scratch, there's no way you could say that one case is always faster than the other, it always depends on your application and your domain.


/Henrik Schröder

homli

unread,
Jul 29, 2008, 5:45:28 PM7/29/08
to memcached
Thanks!
Reply all
Reply to author
Forward
0 new messages