Algorithm for automatic cache invalidation

Showing 1-19 of 19 messages
Algorithm for automatic cache invalidation Jakub Łopuszański 4/27/12 3:14 PM
There are only two hard things in Computer Science: cache invalidation and naming things.
-- Phil Karlton

Hi, I’d like to share with you an algorithm for cache invalidation that I’ve came up with, and successfully implemented in a real world application. I am a software architect at nk.pl, a Polish social network successfully competing with FB, and a founder and backend architect of Vanisoft, a startup in which I had more freedom to experiment with a different architecture.

Perhaps my idea is not original, but I could not find a description of similar concept anywhere, so even if this is reinventing the wheel, maybe it will help propagate the knowledge and save somebody else’s time. When I was designing the backend for Vanisoft’s project ZnaszMnie (currently over 400k players) all I could find in the internet about ways to cache database queries results was either wrong (such as no cache invalidation at all), or ineffective (trying to keep of a list of all cached results and invalidating them one by one).

I assume you have more than one front-end machines, some databases and cache which are reachable from frontends. Perhaps you have local cache on each front-end machine, which is really fast, but inaccessible from other machines (an thus subject to cache invalidation problems). If so, then my approach can correctly provide cached answer to a query by using one round-trip to global cache, and one to local (the cost of later is usually neglible), and never serve stale data (and the last part is the most important for me).

I assume, that even if your database is not SQL, you could think about your queries as expressible in SQL, using SELECT, UPDATE, or DELETE. This is needed only to understand the presentation of the algorithm, which could be easily adapted to many other cases.

To keep things simple let us consider a binary relation (a table with two columns, or a set of points in 2-dimensional space, if you prefer). The algorithm works for any number of dimensions, but it is easier to grasp for 2-D case.

So let’s think about about a table `songs` with columns “song_id” and “author_id”, where obviously some songs may have several authors and vice-versa. Again, for simplicity assume, that these are integers, while the algorithm works for any (serializable) data type.

For each query, I’d like to introduce a concept of “subspace” on which it operates. If you choose the interpretation of binary relation as a set of points in 2-dimensional space, then subspace, is what you get by fixing some of dimensions.
Some examples -- a list of queries with their subspaces:
INSERT INTO songs (song_id, author_id) VALUES (1,2)  ::  [song_id=1, author_id=2]
SELECT COUNT(*) FROM songs WHERE author_id = 7 ::  [author=7]
DELETE FROM songs WHERE song_id = 13 :: [song_id=13]
DELETE FROM songs :: []

Updates are a bit more complicated, and I’d prefer to think about them as a DELETE followed by INSERT, for example:
UPDATE songs SET author_id=13 WHERE author_id=7 AND song_id=3
can be thought of as :
DELETE FROM songs WHERE author_id=7 AND song_id=3 :: [song_id=3, author_id=7]
INSERT INTO songs (song_id, author_id) VALUES (3,13)   :: [song_id=3, author_id=13]
I know this is not equivalent, and may cause some race conditions, but I believe it does not really matter in what follows.

As you can see in our example, a subspace can be, 0, 1 or 2-dimensional, which depends on the scope on which the query operates. If we are unsure about the actual scope, we need to upperbound it for my algorithm to work. For example :
SELECT COUNT(*) FROM songs WHERE song_id < 13 :: []
here we could not say much about rows which are important for this query, so we upperbound it with the whole space. Please note how subspace has nothing to do with the columns returned by select, but rather by the rows scanned during it.

In general, the subspace is determined by all equality constraints found in the query. This is such a trivial correspondence, that you can (and should) let your backend library deduce it automatically from the query string. Or if you use some object oriented query builder, you should get it for free.

For 2-dimensional space, there are 4 kinds of subspaces :
- the whole space, given by no constraints at all : []
- a point, given by two constraints, for example [song_id=3, author_id=7]
- a vertical, or horizontal line, given by one constraint, for example [song_id=3], or [author_id=7]

If we have two queries A and B, such that A is a write (DELETE or INSERT), and B is a read (SELECT), then a cached result for B should be invalidated after query A, if subspace of A intersects subspace of B. (Actually this is a sufficient condition, not necessary).
For example a query
DELETE FROM songs WHERE author_id=7 AND song_id=3 :: [song_id=3, author_id=7]
affects results of
SELECT COUNT(*) FROM songs WHERE author_id = 7 ::  [author=7]
and you can tell it just by comparing [author=7] with [song_id=3, author_id=7], without looking at the queries. Of course this is some safe approximation : sometimes cache invalidation is not necessary, but performing it will ensure the correctness of algorithm.

Now, notice, that there may be infinitely many queries, which may need to be invalidated, as there are many subspaces that intersect with a given one. For example a query with subspace [author=7] must invalidate all queries with subspace [song_id=1], as well as [song_id=2], etc... Moreover there are infinitely many possible queries with the same subspace. Frameworks that try to keep track of all intersecting queries are doomed to either store everything in a large, permanent storage (which kind of defeats the purpose of in-memory cache to me), or eventually forget about a query which needs to be invalidated (which is an error to me).

The solution is much simpler though, and does not involve operating on infinitely many cache keys... only exponentially many. But, don’t worry, it is exponential in the number of dimensions, not the size of dataset. For example for a binary relation we need to invalidate 2^2 = 4 keys. If you have a table with many columns, please don’t be alarmed neither -- you just need to choose a few columns on which you’d like to focus, and crop the subspace to those few columns -- you will invalidate thinks more often than necessary, but will not get hit by exponential blow. I’ve never experienced a case with more than 3 relevant columns, and 2^3 = 8 is not a big number.

We need some shorter notation for subspaces. Let us write them as 2-dimensional vectors (similar to rows in the table). Instead of [song_id=3,author_id=7], let us write (3,7), and for [song_id=3], we will use (3,*). Whole subspace is (*,*), and [author_id=4] is just (*,4).

Now the tricky part. Let us introduce a new wildcard symbol “?”, which will mean something along lines of “all subspaces which have a number here”. For example (?,3) is a shorthand for the infinite set of subspaces {(0,3),(1,3),(2,3),...}. There is no particular query that has subspace (?,3). This artificial notation corresponds to all queries which have fixed second dimension.

I assume that your global cache (such as memcache) supports atomic increment operation, which is very convenient for the task of maintaining a “revision number”, which we will associate with each subspace (artificial or not). When I say “with each”, I don’t mean buying infinitely large computer -- we will just allocate memory for subspaces actually used in production, and those unused (or rarely used, if you have LRU garbage collector) will simply take no memory at all.
A short note about evictions : if get or increment results in a miss, simply set the revision to current timestamp.

I also assume that you know the idea of generational caching, that is, using a key name with appended revision to store cached data (such as “get_songs_count.123”). This way a single operation of incrementing a revision, can invalidate all data stored at key names which depended on it. This seems like a waste of memory, but LRU garbage collector can take care about leftovers. The nice thing about it, is that while a single integer (the revision number) must be stored in a global cache, the potentially large query result can be stored in local cache, from where it can be quickly retrieved without any bottlenecks of global cache. This is actually the only way (except for using silly small TTLs) for invalidating data stored in local cache that I know and use.

Now the algorithm. Or actually two algorithms, one for read-only queries, and one for writes.

When you perform a read query A, first determine its subspace. I assume you have d columns, so the subspace can be expressed as a d-arry vector of numbers and stars, as explained above. For example a query “SELECT song_id FROM songs WHERE author_id = 3” corresponds to a vector (*,3).
Then compute at most all possible variations of this d-arry vector, by using a single substitution rule : you can put “?” in place of a number. For example, if your original vector was (*,3), then it results in two vectors: (*,?) and the original. If on the other hand, your query had subspace (7,3), then you would end up with four variations: (7,3),(?,3),(7,?),(?,?).
Then, using a bulk (multi)get, fetch from the global cache revisions for those (at most 2^d) subspaces.
Concatenate revisions into one big revision (for example 123.3.114.14).
Then fetch from the local cache the cached result of the query, which should reside at the key which is a concatenation of the query string (or hash of it if you prefer) and the big concatenated revision.
If there is a miss in the local cache, try in the global cache.
If there is a miss in the global cache, try in the database.
Then add missing keys to caches as needed.

When you perform a write query B, determine its subspace.
Then compute exactly 2^d possible variations of it, by using two substitution rules:
a) you can put “?” in place of “*”
b) you can put “*” in place of any number
For example for a subspace (*,3) you’ll end up with four variants: (*,3),(?,3),(*,*),(?,*).
Then perform the query B.
Finally increment revisions for all these variants.

As you can see, updates are a bit slower than reads (which in my opinion is usual, and nothing to be worried about). Some caches implementations allow bulk increments in one round-trip, in which case this might actually be faster then read scenario, as you do not contact local cache at all.

Now, some explanation, of why it works.
Imagine a bipartite graph which on the left hand side has one vertex per each possible subspace of a write query, and on the right side has vertices corresponding to subspaces of read queries. Actually both sets are equal, but we will focus on edges.
Edge goes from left to right, if a query on the left side affects results of a query on the right side. As said before, both sets are infinite, but that’s not the problem. There are infinitely many edges, but it’s also not bad. What’s bad is that there are nodes on the left side with the infinite degree, which means, we need to invalidate infinitely many queries. What the above tricky algorithm does, is adding a third layer to the graph, in the middle between the two, such that the transitive closure of the resulting graph is still the same (in other words: you can still get by using two edges anywhere you could by one edge in the original graph), yet each node on the left, and each node on the right, have finite (actually constant) degree. This middle layer corresponds to the artificial subspaces with “?” marks, and serves as a connecting hub for all the mess. Now, when a query on the left executes, it needs to inform only its (small number of) neighbours about the change, moving the burden of reading this information to the right. That is, a query on the right side needs to check if there is a message in the “inbox” in the middle layer. So you can think about it as a cooperation where the left query makes one step forward, and the right query does a one step back, to meet at the central place, and pass the important information about the invalidation of cache.

Another point of view is that two subspaces (v[1],...,v[d]) and (u[1],...,u[d]) intersect each other if and only if they agree on all positions without stars. Or in other words: there is no such index k, that v[k]<>u[k] and v[k]<>* and u[k]<>*.
So, given a vector (v[1],...,v[d]), you can predict that any intersecting subspace will be described by a vector (u[1],...,u[d]) of some particular form. That is, it can be achieved from (v[1],...,v[d]) by using some simple substitution rules:
a) you can change any number to a star
b) you can change any star to any number
c) you can not use both above rules for the same index k
The problematic rule “b)” allows infinitely many possibilities.
To overcome this we introduced the question mark symbol, which represents “any number” without stating any particular number. A path of operations leading from (v[1],..,v[d]) to (u[1],..,u[d]) was then explored from the other end, by trying to substitute numbers in (u[1],...,u[d]) by question marks. If we could reach the same term from both ends, then we know that there is a path from one end to the other, even though there is an infinite maze in the middle.

As pointed before, you can use this technique for higher number of columns, or keep focus on just a few of them at the expense of premature invalidation. You can easily use this for a topology without local caches, by storing everything in the global cache. You can adapt the technique to columns storing strings, or nulls, or whatever. You can use it for noSQL databases if you need to. But this is not the end of possible improvements.

If you know in advance what kind of write queries you will perform, that is -- if you know what subspaces will they have, you can “trim” the graph, to a smaller one, which will result in faster reads and writes, as you will not have to fetch that many (is 4 many ?) keys.

For example if your modifications are just single-row operations (INSERT one row, DELETE one row), then your modifying queries subspaces never have stars in them. You can immediately see from the substitution rules, that you will never-ever increment revisions of artifcial subspaces (these with question marks), as “?” can be made only from “*”. If so, then it is not necessary to fetch them, or even think about them. You can safely ignore them.
Now, read queries only depend on revisions of vectors which differ from the original vector by replacing something with question marks. Since such revisions are not used in this setting, you only need to fetch revision of the original vector, which obviously had no question marks. This allows for quite fast and simple implementation, which fetches only single key from global cache. Actually my original framework forbid any other modifications than row-by-row just to get this extra-fast reads. I believe that in many real-world scenarios you never INSERT more than a single row at once. I’ve seen cases where you had an urge to DELETE many rows at once (for example when cleaning up dependent records), and may feel bad about the idea of doing it row-by-row. Think about amortized cost though: each of these rows you delete had had to be added at some point in the time, and if you already feel comfortable with the idea of adding rows one-by-one, then you already paid enough credit to not worry about the deletes.

That’s it folks. Now, which frameworks implement this idea, and in which paper I can read about it, and how many years did I overslept?

Re: Algorithm for automatic cache invalidation Artur Ejsmont 4/27/12 4:44 PM

This post was just AWESOME!

Many thanks for time spent and so much detail.

Gratulacje, na prawde doskonala robota.

Thanks

Art

Re: Algorithm for automatic cache invalidation Josiah Carlson 4/27/12 8:33 PM
Please correct me if I am wrong, but your description WRT to subspaces (and an implementation thereof) seems to be the equivalent to the following:

1. Whenever you cache a row with column values (c1, c2, c3, ...) from table X, add a references to indexes for each of the columns of the data you are caching (there is an index for X.c1, X.c2, etc., as well as table Y with columns c10, c11... as Y.c10, Y.c11, ...)
2. When you need to invalidate your cache for a given class of rows (you use song_id=3, author_id=7), you can look up those cached entries in your indexes, find the intersecting subset, and discard those cache entries.

Is it novel? I don't know, I'm not terribly well versed in the realm of cache invalidation algorithms. Though elements of your method (in particular his use of '*' and '?') are reminiscent of the Linda language's tuple spaces, and how they are thought of as just vectors to match patterns against and execute upon.

Hilariously enough, by thinking of it in terms of tuple spaces (and having one available), you can get a distributed cache manager for free!

Regards,
 - Josiah

P.S. I'm not a regular contributor to memcached, but your post was just linked from the Redis list, and I'd thought I'd come over to say hello.
Re: Algorithm for automatic cache invalidation Jakub Łopuszański 5/6/12 12:58 PM

W dniu sobota, 28 kwietnia 2012 05:33:15 UTC+2 użytkownik Josiah Carlson napisał:
Please correct me if I am wrong, but your description WRT to subspaces (and an implementation thereof) seems to be the equivalent to the following:

1. Whenever you cache a row with column values (c1, c2, c3, ...) from table X, add a references to indexes for each of the columns of the data you are caching (there is an index for X.c1, X.c2, etc., as well as table Y with columns c10, c11... as Y.c10, Y.c11, ...)
2. When you need to invalidate your cache for a given class of rows (you use song_id=3, author_id=7), you can look up those cached entries in your indexes, find the intersecting subset, and discard those cache entries.

I'm not sure if I understand your proposition, but it sounds like you are only caching individual rows, and not arbitrary queries results. That makes sense if in your application you always SELECT single rows by specifying equality constraints for all columns. But, even then, what my algorithm would do, is quite different from your description. 
You mention "indexes", but I am not sure if I understand correctly, what kind of indexes they are. From your description it seems that you expect them to be very fast for both updates and selects (perhaps something quite trivial to achieve with Redis, but not so obvious how to achieve this with MySQL, or memcached). Anyway, if you just mean something that can answer a query "find all rows which match song_id=3 and author_id=7", then it might end up being either:
a) slow, but reliable and persistent, sql-like index (which doesn't seem to improve performance of anything)
b) fast, but unreliable, ram-only list 
Perhaps I am missing something, but having something which is both reliable and fast, you would actually solve the original problem of providing fast answers to any query (as a two stage process: find matching rows ids, fetch their content). If you can do that quickly, then you're done.

I think that my solution is trying to address a different problem: how to cache more complicated queries then single row retrieval in the presence of a queries modifying multiple rows at once.


Is it novel? I don't know, I'm not terribly well versed in the realm of cache invalidation algorithms. Though elements of your method (in particular his use of '*' and '?') are reminiscent of the Linda language's tuple spaces, and how they are thought of as just vectors to match patterns against and execute upon.

Thanks for hint about Linda language.
 

Hilariously enough, by thinking of it in terms of tuple spaces (and having one available), you can get a distributed cache manager for free!

I'm not sure if I understand this. Could you elaborate?
 

Regards,
 - Josiah

P.S. I'm not a regular contributor to memcached, but your post was just linked from the Redis list, and I'd thought I'd come over to say hello.

neither am I.
Re: Algorithm for automatic cache invalidation Josiah Carlson 5/8/12 11:02 AM

On Sunday, May 6, 2012 12:58:23 PM UTC-7, Jakub Łopuszański wrote:

W dniu sobota, 28 kwietnia 2012 05:33:15 UTC+2 użytkownik Josiah Carlson napisał:
Please correct me if I am wrong, but your description WRT to subspaces (and an implementation thereof) seems to be the equivalent to the following:

1. Whenever you cache a row with column values (c1, c2, c3, ...) from table X, add a references to indexes for each of the columns of the data you are caching (there is an index for X.c1, X.c2, etc., as well as table Y with columns c10, c11... as Y.c10, Y.c11, ...)
2. When you need to invalidate your cache for a given class of rows (you use song_id=3, author_id=7), you can look up those cached entries in your indexes, find the intersecting subset, and discard those cache entries.

I'm not sure if I understand your proposition, but it sounds like you are only caching individual rows, and not arbitrary queries results. That makes sense if in your application you always SELECT single rows by specifying equality constraints for all columns. But, even then, what my algorithm would do, is quite different from your description. 

Different yes, but potentially functionally equivalent. The method that I described can easily be applied to arbitrary select queries as long as row ids are returned with the other row data. I didn't go into it because I believed that step was obvious, as that kind of thing is already done by any decent RDBMS. But to connect the dots:

Assume that you select some subset of rows X using query Y. You need to have some way of fetching those cached rows via the query itself, so presumably you would store a reference to X via some derivation of Y, something like (using a simple Python hash syntax, though obviously storing it in memcached or some other server would be different):

cache[sha256(Y).digest()] = X

After storing that reference in your cache, you would also store references to all of the ids in X using the indexes that I described.
 
You mention "indexes", but I am not sure if I understand correctly, what kind of indexes they are. From your description it seems that you expect them to be very fast for both updates and selects (perhaps something quite trivial to achieve with Redis, but not so obvious how to achieve this with MySQL, or memcached).

I mean something that is the rough equivalent of a 2-level hash with sets. In Python syntax for the structure itself:

INDEX = {
    'table_name': {
        14: set(['ref_25', 'ref_87', 'ref13'])
    }
}

The only two operations that are necessary are:
1. Add a reference to row A in table B to cache value C
2. Invalidate the cache entries that reference row ids V from table W

Anyway, if you just mean something that can answer a query "find all rows which match song_id=3 and author_id=7", then it might end up being either:
a) slow, but reliable and persistent, sql-like index (which doesn't seem to improve performance of anything)
b) fast, but unreliable, ram-only list 
Perhaps I am missing something, but having something which is both reliable and fast, you would actually solve the original problem of providing fast answers to any query (as a two stage process: find matching rows ids, fetch their content). If you can do that quickly, then you're done.

That's not the intent, but there is also a very fast in-memory SQL-speaking database with persistence called AlchemyDB (which is built on top of Redis). I know, it's a bit blasphemous to mix SQL and noSQL, but the author's execution is quite impressive.

I think that my solution is trying to address a different problem: how to cache more complicated queries then single row retrieval in the presence of a queries modifying multiple rows at once.

The complicated part of your scheme is how to invalidate the data. Keeping references to ids in the tables (as I describe) lets you invalidate data by chasing down references in the simple index for invalidation. It can also be trivially expanded to offer a variety of other referential options.

Is it novel? I don't know, I'm not terribly well versed in the realm of cache invalidation algorithms. Though elements of your method (in particular his use of '*' and '?') are reminiscent of the Linda language's tuple spaces, and how they are thought of as just vectors to match patterns against and execute upon.

Thanks for hint about Linda language.
 

Hilariously enough, by thinking of it in terms of tuple spaces (and having one available), you can get a distributed cache manager for free!

I'm not sure if I understand this. Could you elaborate?

The idea of tuple spaces is to have a region of memory that stores tuples. This memory can be on a single machine, or distributed across a disparate cluster of machines. The only limitation is the semantics of the space itself.

The basic semantic API for the space is:
1. Insert a tuple
2. Find me a tuple that matches the pattern Z (which removes the tuple)

The latter of the two can be asynchronous, that is, it can wait for a given time period for that matching tuple. Some implementations require that you pass both the pattern to match the tuple as well as the code that will execute on that tuple.


There are some minor mental gymnastics required to fit a caching scheme into a tuple space. And for a cache, obviously the first thing you would do upon matching a tuple for a read is to reinsert that tuple in the space. Actual implementations of tuple spaces end up being roughly equivalent to a distributed RPC server, or even a distributed task queue server (which is why I stopped working on the one I was writing in 2004), but they are useful abstractions for thinking of other ideas, and the algorithms for matching tuples efficiently are good exercises.

Regards,
 - Josiah
Re: Algorithm for automatic cache invalidation Jakub Łopuszański 5/10/12 10:12 PM
So you suggest :
1. to build a biderectional graph, where the first direction (which you called "cache") maps a query to the rows, and the second one (which you called "INDEX['table_name']") maps a row to queries which depend on it. 
2. keep the first part in a fast cache
3. keep the second part in a reliable database

What bugs me is: since this is a symetric graph, one should expect that the size of "index" will be at least as large as size of "cache", which means that you will need a lot (potentially infinitely many) of storage to keep track of all queries that the system generated at any point in time. You would need some routine that would check what queries where evicted from cache and thus can be also removed from index, to keep the size of index bounded.

Moreover I do not see how exactly would that work in case of queries which modify many rows.
It seems to me that you would have to lock, select these rows, find all (potentially infinitely many) queries depednent on these rows in index, invalidate them, delete rows, release the lock.

Sounds pretty ineffective and as I mentioned in the begining of my post -- I saw such systems and wanted something better.

W dniu wtorek, 8 maja 2012 20:02:21 UTC+2 użytkownik Josiah Carlson napisał:
Re: Algorithm for automatic cache invalidation Josiah Carlson 5/11/12 1:02 AM
I believe we are talking past each other.

When you are performing a write query against rows matching (author_id=7, song_id=3), presumably you would get a group of subspaces (7,3),(*,3),(7,*),(*,*). The latter subspace doesn't really help you at all, as it merely references all of your cached data involving author_id,song_id pairs. But here's the thing: the indexes that I described will give you the equivalent of your (*,3) and (7,*) subspaces, the intersection of which is exactly (7,3).

So, as long as you index your columns independently, you can invalidate all of the necessary cache entries. It doesn't require invalidating infinitely many rows (unless you've got a magic machine that has infinite memory, and is caching infinitely many rows from an infinite number of queries), and it also doesn't require a huge explosion of space. Worst-case, 2x.

But maybe I'm misreading what you are claiming to have built, as discussion about subspaces, bipartite matching, claimed infitities, really just results in excluding 95%+ of engineers/programmers from the conversation (I only got into it because I saw parallels with other technology I've seen). That said, I doubt Facebook or Twitter implemented something that they described like you are describing, and I'd bet a beer that Google didn't. Not because they couldn't, maybe because they didn't think of it, but probably because they looked at the problem in a different way*.

If you truly believe that you have created something that is unique, I would urge you to release it. If it really is a game changer, people will pay you to use it.

Regards,
 - Josiah

* Relational database queries were made fast by never performing joins, almost always using covering indexes, heavy sharding, and many read slaves (think X read slaves for each of the Y shards, where X and Y were surprisingly large). Combine all of this with huge id -> row caches for entire databases, as much as possible pre-calculated via mapreduce, and the use of bigtable whenever possible (no joins, queries only via composite primary key index, data distributed to 3+ machines so you got results from the fastest of them at any time), and you get high performance without the apparent level of complexity that you seem to be dealing with.
...
Re: Algorithm for automatic cache invalidation Jakub Łopuszański 5/11/12 6:00 AM
So let me present you an example so you could present your idea on a real world examples.
Suppose you have a web page, where you can view all songs of a given author, sorted by song_id, with pagination size set to 10, 20, or 50 results per page or any other value (say 42 results) if they wish.
Therefore a database queries needed are something like:
SELECT song_id FROM songs WHERE author_id = @A ORDER BY song_id ASC LIMIT @L OFFSET @O;
Since a user can use many values (10,20,50, etc.) for @O, and many possible offsets @L (actually any natural number) you will get a huge number of different queries all referencing the same row with author_id=7, song_id=3. Let say there are like 100 different queries which relate to this row.
Here are some 2 examples of such queries:
SELECT song_id FROM songs WHERE author_id = 7 ORDER BY song_id ASC LIMIT 2 OFFSET 8;
SELECT song_id FROM songs WHERE author_id = 7 ORDER BY song_id ASC LIMIT 2 OFFSET 9; 
You suggested to compute a digest for each such query actually perfromed and store them in a cache along with a value.
Then for each hash, and each row in the result set you want (if I understand) to save the reference from that row to that hash in a separate persistent data structure.
At this point you have a persistent data structure with 100 more entries than the original database.

Now, if you remove the row author_id=7 and song_id=3 from the database (or add it, or modify), you have to fetch these 100 hashes from the index, and invalidate them.
This is a correct solution, but not as effective as mine, as it must perform 100 operations, and requires 100 times more storage than necessary.
But that's not the end of the problems.
Suppose you have a query like this:
DELETE FROM songs WHERE author_id = 7.
Suppose that there are 1000 songs by this author.
In order to perform a proper invalidation you need to first look up all rows affected by this query, and then, for each such row you need to find all hashes related to these rows. Of course given a reasonable index structure you do not need a JOIN for that, but rather something like "SELECT DISTINCT(hash) FROM index WHERE author_id = 7", however this may still require scaning 100*1000 entries in the index, as each of 1000 deleted songs could be referenced by 100 queries.

Did I understand you correctly?

And, by the way, in nk.pl we have around 500 database servers, and I know what sharding is. Therefore a solution which does not require a central "index" larger than the sharded database itself, such us mine, makes more sense to me.

W dniu piątek, 11 maja 2012 10:02:52 UTC+2 użytkownik Josiah Carlson napisał:
...
Re: Algorithm for automatic cache invalidation Josiah Carlson 5/11/12 9:13 AM
Okay! Now this is perfect. A driving use-case with an example. Now I can see where you are coming from, and I can also see how this would be an issue for you. Before I get into the point-by-point, first a little story.

A man goes to the doctor and says, "It hurts when I push here." The doctor says, "So don't push there."

End of story. This will become relevant momentarily.


On Friday, May 11, 2012 6:00:45 AM UTC-7, Jakub Łopuszański wrote:
So let me present you an example so you could present your idea on a real world examples.
Suppose you have a web page, where you can view all songs of a given author, sorted by song_id, with pagination size set to 10, 20, or 50 results per page or any other value (say 42 results) if they wish.

Rule # 3 in building scalable systems: there is no customized query. If you need to paginate on variable-sized pages, you create larger pages, then sub-paginate.

Case in point: When Google produces search results for a query, they aren't making a query for the 10-entry page 1 for the first load, followed by another query for the 10-entry page 2 when you click on that link, etc. No. They make a query for the first X documents up front (historically I believe they made X somewhere in the range of 100-1000). They then cache that, and serve sub-results out of the cache for some modest duration.

This happens at YouTube, Facebook, Twitter, and any other site that scales.

Therefore a database queries needed are something like:
SELECT song_id FROM songs WHERE author_id = @A ORDER BY song_id ASC LIMIT @L OFFSET @O;
Since a user can use many values (10,20,50, etc.) for @O, and many possible offsets @L (actually any natural number) you will get a huge number of different queries all referencing the same row with author_id=7, song_id=3. Let say there are like 100 different queries which relate to this row.
Here are some 2 examples of such queries:
SELECT song_id FROM songs WHERE author_id = 7 ORDER BY song_id ASC LIMIT 2 OFFSET 8;
SELECT song_id FROM songs WHERE author_id = 7 ORDER BY song_id ASC LIMIT 2 OFFSET 9; 
You suggested to compute a digest for each such query actually perfromed and store them in a cache along with a value.
Then for each hash, and each row in the result set you want (if I understand) to save the reference from that row to that hash in a separate persistent data structure.
At this point you have a persistent data structure with 100 more entries than the original database.

If this was my problem, I wouldn't allow arbitrary offsets. I'd do as rule #3 says and pre-calculate blocks of whatever would cover 80/95/99% of queries (depending on how desperate I was).

I'd also be building a covering index, putting enough memory in my DB servers, and letting the DB cache that index. You can't get much more optimal for a B-Tree index than that query. You have an index on (author_id, song_id). In the index, you match the first column exactly, and order by the second. Assuming that index locking is rare (or at least optimized), your DB should be able to handle 5-10k requests out of that index every second.

Alternatively, this is also a use-case where Redis shines. This data would fit perfectly into what is known as a ZSET (a sorted set). You can think of it like a hash table with keys and values (we call them members and scores, respectively), where you can reference objects by keys, and you can reference objects by their sorted order which is defined over the tuple (value, key). I would set value = author_id, and key = packed representation of song_id concatenated with the rowid. Then to make queries against that index, I would simply ZRANGEBYSCORE <indexkey> <author_id> <author_id> WITHSCORES LIMIT <offset> <count> . On my little 2.4 ghz dev box, I can move almost 25k requests/second against that query sequentially (one returns before the other is performed), and in the range of 50-60k/second from multiple clients. If I had a more modern processor (and more than 2 cores), I could move 75k-100k/second, and I can expect to get 2/3-3/4 that performance if I ran replicas of Redis on the same machine tied to cores other than core 0.

To make that Redis cache be viable, I would do as I've done and recommended for over 2 years now: In whatever language runtime you are using, implement a post-commit hook that induces a caching task over any rows with inserts/updates/deletes. With Postgres allowing for a variety of languages to be built-in, you could even implement your post-commit hook inside of your database, letting it update your remote caches in real time (you also get row-level locking and other nice things). No need for cache invalidation, as your cache is your index, and your index is kept fresh.

So, long story short: I wouldn't use the system that you propose (or the system that I proposed as a variant) to cache that query. I'd use Redis. And if I was limited to *not* use Redis, I'd use covering indexes + block caches, keep statistics on what queries were common, and pre-cache those queries with rolling cache invalidation. Alternatively, I'd at least give Riak and their secondary indexes a shot, which is supposed to let you do the same kinds of queries across your cluster of machines, while also coming close to scaling linearly with every new box.

Now, if you remove the row author_id=7 and song_id=3 from the database (or add it, or modify), you have to fetch these 100 hashes from the index, and invalidate them.
This is a correct solution, but not as effective as mine, as it must perform 100 operations, and requires 100 times more storage than necessary.
But that's not the end of the problems.
Suppose you have a query like this:
DELETE FROM songs WHERE author_id = 7.
Suppose that there are 1000 songs by this author.
In order to perform a proper invalidation you need to first look up all rows affected by this query, and then, for each such row you need to find all hashes related to these rows. Of course given a reasonable index structure you do not need a JOIN for that, but rather something like "SELECT DISTINCT(hash) FROM index WHERE author_id = 7", however this may still require scaning 100*1000 entries in the index, as each of 1000 deleted songs could be referenced by 100 queries.

Did I understand you correctly?

Yes, but I was operating under the assumption that the types of queries you were caching were pre-sanitized from the perspective of wanting to scale. If you want to scale that system to be 10x or 100x bigger (as the world is somewhat larger than Poland), you can't even think about caching queries that make requests for 2 rows at arbitrary offsets from an index. You cache bigger blocks, or you distribute your index.

And, by the way, in nk.pl we have around 500 database servers, and I know what sharding is. Therefore a solution which does not require a central "index" larger than the sharded database itself, such us mine, makes more sense to me.

That makes sense everywhere. But this particular driving use-case isn't generally applicable for the Facebooks, Twitters, Googles, YouTubes, Ebays, Amazons, etc., of the world. Why? Because they follow the rulebook on scaling (which explicitly says *not* to cache those tiny queries).

Regards,
 - Josiah

<span style="font-size:15px;font-family:A
...
Re: Algorithm for automatic cache invalidation Jakub Łopuszański 5/11/12 9:52 AM
Hm. I was expecting you to explain how your algorithm would work for this particular example. The example (if you couldn't tell from stupid column names and use case) was fictional for the ease of presentation.
Instead of offsets, and limits, you could think on many other different parameters of a query such as:
- constraints on other columns  (for example AND  type =HEAVY_METAL)
- different sorting (for example SORT song_id DESC)
- different columns in result (for example SELECT song_id, type)
- different aggregation in result (for example SELECT COUNT(*) )

If you claim that each row of database can be returned only by O(1) different queries in your application, then I agree that your solution makes some sense for your application.

Unfortunately, this might not be the case for other people, so if you plan to build a framework, that would be a stupid assumption to make.
Sadly, this is an assumption found in several ORMs that I've seen, and now (after reading your posts) I understand why this frameworks are still in use... 

But even if there are only O(1) queries for each row which return it, you still have to deal somehow with the problem of perpendicular queries.
Allow me another simplistic example with three columns: x,y,z.
Now, suppose the queries to the system are always of the following form:
SELECT * FROM space WHERE x = @X AND y=@Y
Clearly, each row of this space can be returned by at most one query. So in your terms "index" for each row will contain reference to only one query (the one with @X and @Y equal to his).
Now, imagine a query:
DELETE FROM space WHERE z = 17.
Now, you need to invalidate all queries from this plane. There might be millions of it. Sure, you can argue, that this invalidation can be amortized by the cost of caching earlier. But will that explanation really bother the end user, which will have to wait for several seconds?


P.S. it hurts me when I ask for explanation on a simplistic example, and what I get is a rant on how stupid the example is...

P.S.2. Poland is small (>40 mln people), but nk.pl has like 12 mln active users monthly (similar to facebook), and 4 billion pageviews daily. As one of architects, I think I know how to keep this great social network working fast, and wanted to share an idea on how to remove the limits imposed by traditional way of thinking about how queries should look like. I really believe that end user experience should not be limited by some _artificial_ constraints from the backend architecture. I believe that the assumption that there should be a fixed number of possible queries is artificial.

W dniu piątek, 11 maja 2012 18:13:41 UTC+2 użytkownik Josiah Carlson napisał:
...
Re: Algorithm for automatic cache invalidation Perrin Harkins 5/11/12 10:57 AM
On Fri, Apr 27, 2012 at 6:14 PM, Jakub Łopuszański <qbo...@gmail.com> wrote:
> Hi, I’d like to share with you an algorithm for cache invalidation that I’ve
> came up with, and successfully implemented in a real world application.

This may be a silly question, but have you benchmarked your cached
application against just going straight to the database?  I've always
had the impression that keeping a perfect cache of a database that
beats it in performance was not possible because the overhead of cache
invalidation (both in the cache and in the application) would ruin the
performance gains on reads.  Caches usually beat database performance
by sacrificing accuracy (e.g. allowing race conditions and non-ACID
behavior) and freshness of data.

To put it another way, if it was possible to have an up-to-date cache
that outperforms an RDBMS with the same data, wouldn't the makers of
that RDBMS simply build that cache into their product?

- Perrin
Re: Algorithm for automatic cache invalidation Josiah Carlson 5/11/12 11:19 AM
On Fri, May 11, 2012 at 9:52 AM, Jakub Łopuszański <qbo...@gmail.com> wrote:
> Hm. I was expecting you to explain how your algorithm would work for this
> particular example. The example (if you couldn't tell from stupid column
> names and use case) was fictional for the ease of presentation.
> Instead of offsets, and limits, you could think on many other different
> parameters of a query such as:
> - constraints on other columns  (for example AND  type =HEAVY_METAL)
> - different sorting (for example SORT song_id DESC)
> - different columns in result (for example SELECT song_id, type)
> - different aggregation in result (for example SELECT COUNT(*) )

None of these exclude any of the alternate methods of caching that I
describe. In fact, every one of them can still be served out of a
proper covering index, including queries like "SELECT song_id,
count(*) ... GROUP BY song_id" or "SELECT sum(listen_count)". The
complexity and sophistication of modern database indexes cannot be
overstated. When used properly, they are borderline miraculous when
combined with a good query parser and processor.

> If you claim that each row of database can be returned only by O(1)
> different queries in your application, then I agree that your solution makes
> some sense for your application.

I make no such claim. Further, I could even construct an adversary for
my system such that every cached query result references a particular
row, and updating that row could invalidate the entirety of my cache
(incidentally, I could do the same for the system that you described).

Thankfully, reality isn't an adversary, and we have the ability to
choose the queries, what we cache, how we cache, etc. (the "how" is
what we are discussing here).

> Unfortunately, this might not be the case for other people, so if you plan
> to build a framework, that would be a stupid assumption to make.
> Sadly, this is an assumption found in several ORMs that I've seen, and now
> (after reading your posts) I understand why this frameworks are still in
> use...

So are you saying that I'm making stupid assumptions?

I claim that my assumptions about database indexes, queries, and
caching behavior, are quite sane. When building a system initially,
you make different decisions than the system you make at scale. You've
been going through growing pains in scaling your system, and your
choices and thoughts about what your system needed to support drove a
sequence of decisions that lead you to your subspace caching.

If you were to sit down today with zero code, but had half an hour to
dig through the features and functionality of your site, could get
factor-of-2 approximation of any number you needed, and were tasked
with 95% feature parity; I can just about guarantee that you would
build a different system. You would build more generic solutions in
advance, as you would see common patterns now that you didn't see when
you were building it the first time. Those other caching methods I
described in my last email? Those were designed, built, and made
standard by people who looked at their systems and asked, "If I were
building from scratch today, what would I do differently?"

It could be that your caching system offers features and functionality
that allow it to do XYZ that other caching systems are unable to do.
But for a systems architect, the better question is always, "Do I need
it to do XYZ, or can it do ABC instead?" As engineers, architects,
programmers, developers, etc., we do not work in a vacuum. Every
solution we choose is based on what we have available, and what we
think we can build in the time we have. Until/unless you release your
system, it is no more useful than a flying car: great in theory, but
not available, so not a solution.

I know you are aware of everything that I've described WRT indexes,
caching, etc. I mentioned those things not for your benefit (except to
remind you that the examples you've chosen don't require your system),
but for the benefit of the dozens/hundreds/thousands of future
engineers/architects that are looking for solutions to their
cache/scaling issues, but don't ask questions (a small vocal minority
of people with questions actually ask them, the rest search for
answers to questions that others have asked). They are going to wander
into this thread, see your description, and say "wow, I should build
that." But that is *exactly* the wrong answer! They shouldn't build
it! They should be using one of the standard practices for caching
that I described, or if you've released your software, they should see
if it offers them what they are looking for.

> But even if there are only O(1) queries for each row which return it, you
> still have to deal somehow with the problem of perpendicular queries.
> Allow me another simplistic example with three columns: x,y,z.
> Now, suppose the queries to the system are always of the following form:
> SELECT * FROM space WHERE x = @X AND y=@Y
> Clearly, each row of this space can be returned by at most one query. So in
> your terms "index" for each row will contain reference to only one query
> (the one with @X and @Y equal to his).
> Now, imagine a query:
> DELETE FROM space WHERE z = 17.
> Now, you need to invalidate all queries from this plane. There might be
> millions of it. Sure, you can argue, that this invalidation can be amortized
> by the cost of caching earlier. But will that explanation really bother the
> end user, which will have to wait for several seconds?

It wouldn't be viable in that situation if you were doing immediate
invalidation, but the same goes for your system. If you are caching
millions of queries, and those millions of queries must be
invalidated, there is no free lunch. Either you invalidate now (paying
the price), or you use one of the variety of ways to lazily invalidate
your cache.

Incidentally, I also wouldn't be caching millions of queries like
this. This is again an example where a covering index solves the
problem correctly and gracefully, and finding a way of keeping a
distributed and consistent covering index for this table will solve
more problems than your caching system.

> P.S. it hurts me when I ask for explanation on a simplistic example, and
> what I get is a rant on how stupid the example is...

I never said that the example was stupid. I just said that the example
provided had alternate solutions that were at least as good, if not
better, than your proposed solution. These alternate solutions are
known to scale, are not surprising, and are supported by dozens of
existing software packages.

You proposed that your solution to the problem of database query
caching was unique (in that you hadn't seen anything else like it),
and further that it was better. In order for it to be legitimately
better for database query caching (or database caching in general), it
will be competing against the current crop of standard practices
(which are standard because they work). If you are not okay with
people like me saying "why are you doing X with that system, Y would
work better at scale" for the examples you that you bring up, then
bring up a real example.

> P.S.2. Poland is small (>40 mln people), but nk.pl has like 12 mln active
> users monthly (similar to facebook), and 4 billion pageviews daily. As one
> of architects, I think I know how to keep this great social network working
> fast, and wanted to share an idea on how to remove the limits imposed by
> traditional way of thinking about how queries should look like. I really
> believe that end user experience should not be limited by some _artificial_
> constraints from the backend architecture. I believe that the assumption
> that there should be a fixed number of possible queries is artificial.

You brought up your idea. Why? Presumably to get feedback, opinions,
ideas to make it work better, maybe even to get responses like "This
post was just AWESOME!" But here's the thing: ideas are easy,
execution is hard. You say that you have built this system, and it is
amazing. I believe that you have built it. But right now, over email?
You can't actually prove anything about the usefulness of your system,
especially not with toy driving use-cases and examples.

On the other hand, if you release it, people who are even smarter than
me will actually take a look at it, and will examine it based on its
features, merits, and drawbacks in certain use-cases. And if it is
better, people will use it. But right now, it's material to you, but
an idea to everyone else. And ideas are cheap.
>>>>>>>>> stored in local cache that I know and use.
>>>>>>>>>
>>>>>>>>> Now the algorithm. Or actually two algorithms, one for read-only
>>>>>>>>> queries, and one for writes.
>>>>>>>>>
>>>>>>>>> When you perform a read query A, first determine its subspace. I
>>>>>>>>> assume you have d columns, so the subspace can be expressed as a d-arry
>>>>>>>>> vector of numbers and stars, as explained above. For example a query "SELECT
>>>>>>>>> song_id FROM songs WHERE author_id = 3" corresponds to a vector (*,3).
>>>>>>>>> Then compute at most all possible variations of this d-arry vector,
>>>>>>>>> by using a single substitution rule : you can put "?" in place of a number.
>>>>>>>>> For example, if your original vector was (*,3), then it results in two
>>>>>>>>> vectors: (*,?) and the original. If on the other hand, your query had
>>>>>>>>> subspace (7,3), then you would end up with four variations:
>>>>>>>>> (7,3),(?,3),(7,?),(?,?).
>>>>>>>>> Then, using a bulk (multi)get, fetch from the global cache
>>>>>>>>> revisions for those (at most 2^d) subspaces.
>>>>>>>>> Concatenate revisions into one big revision (for example
>>>>>>>>> 123.3.114.14).
>>>>>>>>> Then fetch from the local cache the cached result of the query,
>>>>>>>>> which should reside at the key which is a concatenation of the query string
>>>>>>>>> (or hash of it if you prefer) and the big concatenated revision.
>>>>>>>>> If there is a miss in the local cache, try in the global cache.
>>>>>>>>> If there is a miss in the global cache, try in the database.
>>>>>>>>> Then add missing keys to caches as needed.
>>>>>>>>>
>>>>>>>>> When you perform a write query B, determine its subspace.
>>>>>>>>> Then compute exactly 2^d possible variations of it, by using two
>>>>>>>>> substitution rules:
>>>>>>>>> a) you can put "?" in place of "*"
>>>>>>>>> b) you can put "*" in place of any number
>>>>>>>>> For example for a subspace (*,3) you'll end up with four variants:
>>>>>>>>> (*,3),(?,3),(*,*),(?,*).
>>>>>>>>> Then perform the query B.
>>>>>>>>> Finally increment revisio...
REMOVE ME Erik Seaberg 5/11/12 1:58 PM


Re: REMOVE ME Erik Seaberg 5/11/12 2:03 PM
Sorry everyone, I was following the instructions in
http://support.google.com/groups/bin/answer.py?hl=en&answer=46608 and didn't
expect "REMOVE ME" to be broadcast to the group like an ordinary message.

Re: Algorithm for automatic cache invalidation Jakub Łopuszański 5/22/12 3:08 PM
Well this is already deployed to a production system and works just great.
The reason this is faster is that in some situations you can not easily shard a database as some queries would cross shards boundaries. Of course this means that a regular DB will not scale unless you change your queries. On the other hand a cache is easier to distribute/shard/scale.
It is also easier (and in fact I do that on production) to have a memcache runing on each frontend machine (while this would be a strange idea to put shards of database on frontends which are usually diskless).
Of course one could argue, that you should just avoid complicated queries and shard everything, but even then, the cost of communicating over network between a frontend and mysql backend can be larger than a cost of a single multiget to a memcached (even if the network latency is similar, I believe that network stack of memcahed is one of the fastest).

In my particular example I have 6 frontend diskless machines, each running apache2 and memached, and single database and single global memcached. Most queries are resolved without any network communication at all, as they result in local cache hit.

I think that developers of mysql have no way (or incentive) to build as powerfull cache into a database, as no single database has not as much RAM, not as many network cards and not as many CPUs as a cloud of 55 memcaches used in nk.pl


W dniu piątek, 11 maja 2012 19:57:58 UTC+2 użytkownik Perrin Harkins napisał:
Re: Algorithm for automatic cache invalidation rspadim 5/22/12 7:52 PM
i readed some nosql features at mysql and mariadb (a mysql trunk) they
are starting some nosql/cache solutions, could be nice check it
i know that oracle company is a 'problem'/'solution' to mysql open
source version, check others trunks, mariadb is a nice trunk and is
more 'opensource' with some interesting features
the memcache + db solution is nice, but i think in futures a more
unified solution will exists just wait or develop it =)


2012/5/22 Jakub Łopuszański <qbo...@gmail.com>:
--
Roberto Spadim
Spadim Technology / SPAEmpresarial
Re: Algorithm for automatic cache invalidation Jakub Łopuszański 2/6/13 3:32 AM
As I really want this ideas to be well understood and wide spread I prepared:
- a paper with nice ilustrations and proofs of correctness,
- a presentation with more context and examples from everyday life of a programmer
- the source code of a working implementation used in "Znawcy Futbolu" and "Znasz Mnie?", the applications by Vanisoft

Hate it or ignore it, but at least do not say it is impractical, nor that I did not try to explain it:)

W dniu sobota, 28 kwietnia 2012 00:14:49 UTC+2 użytkownik Jakub Łopuszański napisał:
Re: Algorithm for automatic cache invalidation Artur Ejsmont 2/6/13 4:56 AM

Awesome !

Thanks a million , will read the paper right after my holiday.

Thanks a lot for sharing.

Art

--
 
---
You received this message because you are subscribed to the Google Groups "memcached" group.
To unsubscribe from this group and stop receiving emails from it, send an email to memcached+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 
Re: Algorithm for automatic cache invalidation Marcin Drążek 8/15/13 1:36 AM
Hello


> For example a query
> “SELECT song_id FROM songs WHERE author_id = 3”
> corresponds to a vector (*,3).
> [...]

> you can put “?” in place of a number.
> For example, if your original vector was (*,3), then it results in two vectors: (*,?) and the original
DEFINE: ALL_NUMBERS_GENERATE_WILDCARD


> When you perform a write query B, determine its subspace.
> a) you can put “?” in place of “*”
> b) you can put “*” in place of any number
> For example for a subspace (*,3) you’ll end up with four variants: (*,3),(?,3),(*,*),(?,*).

1. I need example for (?,3) but I don't understand expression 'any number'

Example query:
DELETE FROM songs WHERE song_id < 100 AND author_id = 3

If result is (?,3),(*,3),(?,*),(*,*) then '?' are logically simmilar to '*'

but DEFINED(ALL_NUMBERS_GENERATE_WILDCARD) AND 'you can put “?” in place of “*”'

Simply:
DELETE '*' => WILDCARD -> inc revision
SELECT NUMBER => WILDCARD -> check revision

And we have to clear all cache from table if we change multiple rows

Please answer, maybe I don't understand something