This post was just AWESOME!
Many thanks for time spent and so much detail.
Gratulacje, na prawde doskonala robota.
Thanks
Art
Please correct me if I am wrong, but your description WRT to subspaces (and an implementation thereof) seems to be the equivalent to the following:1. Whenever you cache a row with column values (c1, c2, c3, ...) from table X, add a references to indexes for each of the columns of the data you are caching (there is an index for X.c1, X.c2, etc., as well as table Y with columns c10, c11... as Y.c10, Y.c11, ...)2. When you need to invalidate your cache for a given class of rows (you use song_id=3, author_id=7), you can look up those cached entries in your indexes, find the intersecting subset, and discard those cache entries.
Is it novel? I don't know, I'm not terribly well versed in the realm of cache invalidation algorithms. Though elements of your method (in particular his use of '*' and '?') are reminiscent of the Linda language's tuple spaces, and how they are thought of as just vectors to match patterns against and execute upon.
Hilariously enough, by thinking of it in terms of tuple spaces (and having one available), you can get a distributed cache manager for free!
Regards,- Josiah
P.S. I'm not a regular contributor to memcached, but your post was just linked from the Redis list, and I'd thought I'd come over to say hello.
W dniu sobota, 28 kwietnia 2012 05:33:15 UTC+2 użytkownik Josiah Carlson napisał:Please correct me if I am wrong, but your description WRT to subspaces (and an implementation thereof) seems to be the equivalent to the following:1. Whenever you cache a row with column values (c1, c2, c3, ...) from table X, add a references to indexes for each of the columns of the data you are caching (there is an index for X.c1, X.c2, etc., as well as table Y with columns c10, c11... as Y.c10, Y.c11, ...)2. When you need to invalidate your cache for a given class of rows (you use song_id=3, author_id=7), you can look up those cached entries in your indexes, find the intersecting subset, and discard those cache entries.I'm not sure if I understand your proposition, but it sounds like you are only caching individual rows, and not arbitrary queries results. That makes sense if in your application you always SELECT single rows by specifying equality constraints for all columns. But, even then, what my algorithm would do, is quite different from your description.
You mention "indexes", but I am not sure if I understand correctly, what kind of indexes they are. From your description it seems that you expect them to be very fast for both updates and selects (perhaps something quite trivial to achieve with Redis, but not so obvious how to achieve this with MySQL, or memcached).
Anyway, if you just mean something that can answer a query "find all rows which match song_id=3 and author_id=7", then it might end up being either:a) slow, but reliable and persistent, sql-like index (which doesn't seem to improve performance of anything)b) fast, but unreliable, ram-only listPerhaps I am missing something, but having something which is both reliable and fast, you would actually solve the original problem of providing fast answers to any query (as a two stage process: find matching rows ids, fetch their content). If you can do that quickly, then you're done.
I think that my solution is trying to address a different problem: how to cache more complicated queries then single row retrieval in the presence of a queries modifying multiple rows at once.
Is it novel? I don't know, I'm not terribly well versed in the realm of cache invalidation algorithms. Though elements of your method (in particular his use of '*' and '?') are reminiscent of the Linda language's tuple spaces, and how they are thought of as just vectors to match patterns against and execute upon.Thanks for hint about Linda language.Hilariously enough, by thinking of it in terms of tuple spaces (and having one available), you can get a distributed cache manager for free!I'm not sure if I understand this. Could you elaborate?
So let me present you an example so you could present your idea on a real world examples.Suppose you have a web page, where you can view all songs of a given author, sorted by song_id, with pagination size set to 10, 20, or 50 results per page or any other value (say 42 results) if they wish.
Therefore a database queries needed are something like:SELECT song_id FROM songs WHERE author_id = @A ORDER BY song_id ASC LIMIT @L OFFSET @O;
Since a user can use many values (10,20,50, etc.) for @O, and many possible offsets @L (actually any natural number) you will get a huge number of different queries all referencing the same row with author_id=7, song_id=3. Let say there are like 100 different queries which relate to this row.Here are some 2 examples of such queries:SELECT song_id FROM songs WHERE author_id = 7 ORDER BY song_id ASC LIMIT 2 OFFSET 8;SELECT song_id FROM songs WHERE author_id = 7 ORDER BY song_id ASC LIMIT 2 OFFSET 9;You suggested to compute a digest for each such query actually perfromed and store them in a cache along with a value.Then for each hash, and each row in the result set you want (if I understand) to save the reference from that row to that hash in a separate persistent data structure.At this point you have a persistent data structure with 100 more entries than the original database.
Now, if you remove the row author_id=7 and song_id=3 from the database (or add it, or modify), you have to fetch these 100 hashes from the index, and invalidate them.This is a correct solution, but not as effective as mine, as it must perform 100 operations, and requires 100 times more storage than necessary.But that's not the end of the problems.Suppose you have a query like this:DELETE FROM songs WHERE author_id = 7.Suppose that there are 1000 songs by this author.In order to perform a proper invalidation you need to first look up all rows affected by this query, and then, for each such row you need to find all hashes related to these rows. Of course given a reasonable index structure you do not need a JOIN for that, but rather something like "SELECT DISTINCT(hash) FROM index WHERE author_id = 7", however this may still require scaning 100*1000 entries in the index, as each of 1000 deleted songs could be referenced by 100 queries.Did I understand you correctly?
And, by the way, in nk.pl we have around 500 database servers, and I know what sharding is. Therefore a solution which does not require a central "index" larger than the sharded database itself, such us mine, makes more sense to me.
Awesome !
Thanks a million , will read the paper right after my holiday.
Thanks a lot for sharing.
Art
--
---
You received this message because you are subscribed to the Google Groups "memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email to memcached+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.