Google Groups

Algorithm for automatic cache invalidation


Jakub Łopuszański Apr 27, 2012 3:14 PM
Posted in group: memcached
There are only two hard things in Computer Science: cache invalidation and naming things.
-- Phil Karlton

Hi, I’d like to share with you an algorithm for cache invalidation that I’ve came up with, and successfully implemented in a real world application. I am a software architect at nk.pl, a Polish social network successfully competing with FB, and a founder and backend architect of Vanisoft, a startup in which I had more freedom to experiment with a different architecture.

Perhaps my idea is not original, but I could not find a description of similar concept anywhere, so even if this is reinventing the wheel, maybe it will help propagate the knowledge and save somebody else’s time. When I was designing the backend for Vanisoft’s project ZnaszMnie (currently over 400k players) all I could find in the internet about ways to cache database queries results was either wrong (such as no cache invalidation at all), or ineffective (trying to keep of a list of all cached results and invalidating them one by one).

I assume you have more than one front-end machines, some databases and cache which are reachable from frontends. Perhaps you have local cache on each front-end machine, which is really fast, but inaccessible from other machines (an thus subject to cache invalidation problems). If so, then my approach can correctly provide cached answer to a query by using one round-trip to global cache, and one to local (the cost of later is usually neglible), and never serve stale data (and the last part is the most important for me).

I assume, that even if your database is not SQL, you could think about your queries as expressible in SQL, using SELECT, UPDATE, or DELETE. This is needed only to understand the presentation of the algorithm, which could be easily adapted to many other cases.

To keep things simple let us consider a binary relation (a table with two columns, or a set of points in 2-dimensional space, if you prefer). The algorithm works for any number of dimensions, but it is easier to grasp for 2-D case.

So let’s think about about a table `songs` with columns “song_id” and “author_id”, where obviously some songs may have several authors and vice-versa. Again, for simplicity assume, that these are integers, while the algorithm works for any (serializable) data type.

For each query, I’d like to introduce a concept of “subspace” on which it operates. If you choose the interpretation of binary relation as a set of points in 2-dimensional space, then subspace, is what you get by fixing some of dimensions.
Some examples -- a list of queries with their subspaces:
INSERT INTO songs (song_id, author_id) VALUES (1,2)  ::  [song_id=1, author_id=2]
SELECT COUNT(*) FROM songs WHERE author_id = 7 ::  [author=7]
DELETE FROM songs WHERE song_id = 13 :: [song_id=13]
DELETE FROM songs :: []

Updates are a bit more complicated, and I’d prefer to think about them as a DELETE followed by INSERT, for example:
UPDATE songs SET author_id=13 WHERE author_id=7 AND song_id=3
can be thought of as :
DELETE FROM songs WHERE author_id=7 AND song_id=3 :: [song_id=3, author_id=7]
INSERT INTO songs (song_id, author_id) VALUES (3,13)   :: [song_id=3, author_id=13]
I know this is not equivalent, and may cause some race conditions, but I believe it does not really matter in what follows.

As you can see in our example, a subspace can be, 0, 1 or 2-dimensional, which depends on the scope on which the query operates. If we are unsure about the actual scope, we need to upperbound it for my algorithm to work. For example :
SELECT COUNT(*) FROM songs WHERE song_id < 13 :: []
here we could not say much about rows which are important for this query, so we upperbound it with the whole space. Please note how subspace has nothing to do with the columns returned by select, but rather by the rows scanned during it.

In general, the subspace is determined by all equality constraints found in the query. This is such a trivial correspondence, that you can (and should) let your backend library deduce it automatically from the query string. Or if you use some object oriented query builder, you should get it for free.

For 2-dimensional space, there are 4 kinds of subspaces :
- the whole space, given by no constraints at all : []
- a point, given by two constraints, for example [song_id=3, author_id=7]
- a vertical, or horizontal line, given by one constraint, for example [song_id=3], or [author_id=7]

If we have two queries A and B, such that A is a write (DELETE or INSERT), and B is a read (SELECT), then a cached result for B should be invalidated after query A, if subspace of A intersects subspace of B. (Actually this is a sufficient condition, not necessary).
For example a query
DELETE FROM songs WHERE author_id=7 AND song_id=3 :: [song_id=3, author_id=7]
affects results of
SELECT COUNT(*) FROM songs WHERE author_id = 7 ::  [author=7]
and you can tell it just by comparing [author=7] with [song_id=3, author_id=7], without looking at the queries. Of course this is some safe approximation : sometimes cache invalidation is not necessary, but performing it will ensure the correctness of algorithm.

Now, notice, that there may be infinitely many queries, which may need to be invalidated, as there are many subspaces that intersect with a given one. For example a query with subspace [author=7] must invalidate all queries with subspace [song_id=1], as well as [song_id=2], etc... Moreover there are infinitely many possible queries with the same subspace. Frameworks that try to keep track of all intersecting queries are doomed to either store everything in a large, permanent storage (which kind of defeats the purpose of in-memory cache to me), or eventually forget about a query which needs to be invalidated (which is an error to me).

The solution is much simpler though, and does not involve operating on infinitely many cache keys... only exponentially many. But, don’t worry, it is exponential in the number of dimensions, not the size of dataset. For example for a binary relation we need to invalidate 2^2 = 4 keys. If you have a table with many columns, please don’t be alarmed neither -- you just need to choose a few columns on which you’d like to focus, and crop the subspace to those few columns -- you will invalidate thinks more often than necessary, but will not get hit by exponential blow. I’ve never experienced a case with more than 3 relevant columns, and 2^3 = 8 is not a big number.

We need some shorter notation for subspaces. Let us write them as 2-dimensional vectors (similar to rows in the table). Instead of [song_id=3,author_id=7], let us write (3,7), and for [song_id=3], we will use (3,*). Whole subspace is (*,*), and [author_id=4] is just (*,4).

Now the tricky part. Let us introduce a new wildcard symbol “?”, which will mean something along lines of “all subspaces which have a number here”. For example (?,3) is a shorthand for the infinite set of subspaces {(0,3),(1,3),(2,3),...}. There is no particular query that has subspace (?,3). This artificial notation corresponds to all queries which have fixed second dimension.

I assume that your global cache (such as memcache) supports atomic increment operation, which is very convenient for the task of maintaining a “revision number”, which we will associate with each subspace (artificial or not). When I say “with each”, I don’t mean buying infinitely large computer -- we will just allocate memory for subspaces actually used in production, and those unused (or rarely used, if you have LRU garbage collector) will simply take no memory at all.
A short note about evictions : if get or increment results in a miss, simply set the revision to current timestamp.

I also assume that you know the idea of generational caching, that is, using a key name with appended revision to store cached data (such as “get_songs_count.123”). This way a single operation of incrementing a revision, can invalidate all data stored at key names which depended on it. This seems like a waste of memory, but LRU garbage collector can take care about leftovers. The nice thing about it, is that while a single integer (the revision number) must be stored in a global cache, the potentially large query result can be stored in local cache, from where it can be quickly retrieved without any bottlenecks of global cache. This is actually the only way (except for using silly small TTLs) for invalidating data stored in local cache that I know and use.

Now the algorithm. Or actually two algorithms, one for read-only queries, and one for writes.

When you perform a read query A, first determine its subspace. I assume you have d columns, so the subspace can be expressed as a d-arry vector of numbers and stars, as explained above. For example a query “SELECT song_id FROM songs WHERE author_id = 3” corresponds to a vector (*,3).
Then compute at most all possible variations of this d-arry vector, by using a single substitution rule : you can put “?” in place of a number. For example, if your original vector was (*,3), then it results in two vectors: (*,?) and the original. If on the other hand, your query had subspace (7,3), then you would end up with four variations: (7,3),(?,3),(7,?),(?,?).
Then, using a bulk (multi)get, fetch from the global cache revisions for those (at most 2^d) subspaces.
Concatenate revisions into one big revision (for example 123.3.114.14).
Then fetch from the local cache the cached result of the query, which should reside at the key which is a concatenation of the query string (or hash of it if you prefer) and the big concatenated revision.
If there is a miss in the local cache, try in the global cache.
If there is a miss in the global cache, try in the database.
Then add missing keys to caches as needed.

When you perform a write query B, determine its subspace.
Then compute exactly 2^d possible variations of it, by using two substitution rules:
a) you can put “?” in place of “*”
b) you can put “*” in place of any number
For example for a subspace (*,3) you’ll end up with four variants: (*,3),(?,3),(*,*),(?,*).
Then perform the query B.
Finally increment revisions for all these variants.

As you can see, updates are a bit slower than reads (which in my opinion is usual, and nothing to be worried about). Some caches implementations allow bulk increments in one round-trip, in which case this might actually be faster then read scenario, as you do not contact local cache at all.

Now, some explanation, of why it works.
Imagine a bipartite graph which on the left hand side has one vertex per each possible subspace of a write query, and on the right side has vertices corresponding to subspaces of read queries. Actually both sets are equal, but we will focus on edges.
Edge goes from left to right, if a query on the left side affects results of a query on the right side. As said before, both sets are infinite, but that’s not the problem. There are infinitely many edges, but it’s also not bad. What’s bad is that there are nodes on the left side with the infinite degree, which means, we need to invalidate infinitely many queries. What the above tricky algorithm does, is adding a third layer to the graph, in the middle between the two, such that the transitive closure of the resulting graph is still the same (in other words: you can still get by using two edges anywhere you could by one edge in the original graph), yet each node on the left, and each node on the right, have finite (actually constant) degree. This middle layer corresponds to the artificial subspaces with “?” marks, and serves as a connecting hub for all the mess. Now, when a query on the left executes, it needs to inform only its (small number of) neighbours about the change, moving the burden of reading this information to the right. That is, a query on the right side needs to check if there is a message in the “inbox” in the middle layer. So you can think about it as a cooperation where the left query makes one step forward, and the right query does a one step back, to meet at the central place, and pass the important information about the invalidation of cache.

Another point of view is that two subspaces (v[1],...,v[d]) and (u[1],...,u[d]) intersect each other if and only if they agree on all positions without stars. Or in other words: there is no such index k, that v[k]<>u[k] and v[k]<>* and u[k]<>*.
So, given a vector (v[1],...,v[d]), you can predict that any intersecting subspace will be described by a vector (u[1],...,u[d]) of some particular form. That is, it can be achieved from (v[1],...,v[d]) by using some simple substitution rules:
a) you can change any number to a star
b) you can change any star to any number
c) you can not use both above rules for the same index k
The problematic rule “b)” allows infinitely many possibilities.
To overcome this we introduced the question mark symbol, which represents “any number” without stating any particular number. A path of operations leading from (v[1],..,v[d]) to (u[1],..,u[d]) was then explored from the other end, by trying to substitute numbers in (u[1],...,u[d]) by question marks. If we could reach the same term from both ends, then we know that there is a path from one end to the other, even though there is an infinite maze in the middle.

As pointed before, you can use this technique for higher number of columns, or keep focus on just a few of them at the expense of premature invalidation. You can easily use this for a topology without local caches, by storing everything in the global cache. You can adapt the technique to columns storing strings, or nulls, or whatever. You can use it for noSQL databases if you need to. But this is not the end of possible improvements.

If you know in advance what kind of write queries you will perform, that is -- if you know what subspaces will they have, you can “trim” the graph, to a smaller one, which will result in faster reads and writes, as you will not have to fetch that many (is 4 many ?) keys.

For example if your modifications are just single-row operations (INSERT one row, DELETE one row), then your modifying queries subspaces never have stars in them. You can immediately see from the substitution rules, that you will never-ever increment revisions of artifcial subspaces (these with question marks), as “?” can be made only from “*”. If so, then it is not necessary to fetch them, or even think about them. You can safely ignore them.
Now, read queries only depend on revisions of vectors which differ from the original vector by replacing something with question marks. Since such revisions are not used in this setting, you only need to fetch revision of the original vector, which obviously had no question marks. This allows for quite fast and simple implementation, which fetches only single key from global cache. Actually my original framework forbid any other modifications than row-by-row just to get this extra-fast reads. I believe that in many real-world scenarios you never INSERT more than a single row at once. I’ve seen cases where you had an urge to DELETE many rows at once (for example when cleaning up dependent records), and may feel bad about the idea of doing it row-by-row. Think about amortized cost though: each of these rows you delete had had to be added at some point in the time, and if you already feel comfortable with the idea of adding rows one-by-one, then you already paid enough credit to not worry about the deletes.

That’s it folks. Now, which frameworks implement this idea, and in which paper I can read about it, and how many years did I overslept?