On Thu, Mar 20, 2014 at 11:57 PM, Josiah Carlson
<
josiah....@gmail.com> wrote:
>
>
> I'm sure you mean "very good approximation of LRU-based eviction". And yes, it should be very well behaved in practice. At 10 probes, you have a roughly 65% chance of getting one of the oldest 10% of elements or ~94% chance of getting one of the oldest 25% of elements. But the secret is that just because you evict literally the oldest element in LRU doesn't mean that the oldest element was necessarily the right element to evict. Perfect eviction (in the case where the TTL hasn't passed) requires knowing whether or not a given element will be used in the future prior to it's original expiration time, and if so, when, so you know whether there are other elements that would be requested more often.
Yes, sorry. Totally agree with you, what I'm doing here is just to
compare the approximated algorithm with what LRU would do in the same
condition. The real miss rate depends on the access pattern, however
it is interesting to note that when you start to consider access
patterns, our approximation is even better than how it looks like when
you just compared it with the behavior of true LRU.
if you look at this:
http://antirez.com/misc/lru.html which is the
algorithm behavior with 5 samples after I applied the pool (of 16
elements), the more an object was recently accessed, the less likely
is for it to be handled incorrectly.
Now if we apply a power-law access pattern and count cache misses, our
simulated application will stress more the elements that we handle
with a better approximation, so Matt measured just 2.5% of misses in
this setup.
> The only time there is substantial wins WRT using LRU vs some other scheme is when the additional overhead of keeping the LRU data is minor in comparison - like page caches (where pages are anywhere from 4k to 1m, the latter in the case of jumbo pages) or db caches (that typically reference chunks of a table space or chunks of an index).
>
> In practice, choosing the oldest after X probes is very common - in CPU caches in particular. If you've ever heard of "X-way set associative cache", that just means that the cache is arranged in such a way that if you have data that you are going to cache, you can store it in one of X places, and which of those places is determined by the oldest among them. Fully associative is LRU, but the 2-, 4-, 8-, and 16- way associative caches were and still are very common (sometimes using different levels of associativity in different caches).
Yep, in hardware many LRU approximations are used, and makes sense and
is "tolerated" :-) I believe we encountered more resistance in Redis
land to this approach because in software there is the idea that you
need to do the Right Thing. I believe the right thing was to avoid
using more CPU and memory for something that provided the same result,
however a paradox is that the current Redis approach actually allows
us to experiment with more complex stuff in the future that are hard
to accomplish if you try to do the right thing.
Example, an interesting schema to try is based on frequency, where you
evict the key that was used the least. This schema is not very used in
practice since it has two issues: one is that maybe an object is
accessed a number of times in a small timespan, and later not accessed
for a long time, it still will not get evicted. The other issue is
that a monotonically increasing counter is not great when you have a
long running cache, at the end everything will converge to the max
value.
Our "sampling" method would allow us to try a more complex LFU
approach where objects start with the counter set to 0, and they get
the counter incremented on access, but decremented via random sampling
(in order to simulate a "decay" due to time). To test this only in
theory is pointless, so to play with such a thing we definitely would
need a big Redis as a cache use case with engineers that want to
experiment to get more out of Redis.
> The statistical guarantees that you get from that will still be very good, initially starting at around 96.5% to get an item from the oldest 10% of items, improving as you add more probes/evictions. There are some Bayesian stats that describe how it works, but the non-Bayesian methods are easier to describe an calculate. Long story short, if you flip it around and perform 62 probes initially to evict 10 elements (only evicting after you've done all of the probes, 32 initial plus 3 for each desired eviction), you have a 99.6% chance of only evicting keys that are in the oldest 10% of keys. Performing 32, then evicting 1, doing 3 probes to add to your pool, then evicting 1, etc., evicting a total of 10 items will put the probability of evicting *only* items in the oldest 10% at between 96.5% and 99.6%, which I'd call pretty good odds.
Yes, this is in accordance to what I see. It is non trivial to model
mathematically given that there is the pool (that may contain a
variable amount of data over time), so honestly I tried to revert to
graphical representations to try to compared the performances of
different algorithms. Sometimes I get, numerically, the same number of
"errors" but graphically the distribution of items look different and
is pretty clear if one algorithm is performing better.
> One of the other valuable insights that they gained is that you never use sparse hash tables. To the point, chained hash tables work best when they are under some fixed ratio full, with typical ratios in practice ranging from 50%-80%. But when you have blocks for your buckets, you want most/all of your buckets to have entries, so rather than having 2**k buckets, you have 2**(k-3) buckets, ensuring that a typical bucket itself will average 50%-80% full. Over-full buckets are okay because you've got a pointer for your chained buckets. What that means is that the smallest hash table that you'd use would have 1 bucket, making the smallest hashes not require hashing at all!
>
> On the rehashing side of things, you can also incrementally rehash one item at a time, you would just need to start from the last item inserted into that (chained) bucket. Alternatively, you can handle a chunk of 8 at a time (shifting items from the next bucket into the main hash, a quick memcpy operation), which will increase the incremental time to perform a single rehash operation, but would reduce rehash operation count.
I'm sure they did a good job for their needs, and probably some idea
could be applied, but overall I've the feeling that this is a bit of a
different design, given that in Redis we try to find a balance between
memory usage, performance, and the ability to easily rehash. And on
top of that we have to support the get-random-element operation.
However for sure it could be super instructive to see other hash table
implementations, and apply the most matching ideas to dict.c.
> It's CPU use, RAM size, and RAM latency when not performing sequential operations (though short sequential operations offer little benefit). You should have observed that chasing pointers is never a sequential operation, and the destination is almost never cached, so any time you can eliminate following a pointer, you win.
And you win big. The embedded sds object that we introduced some time
ago provided a 20% performance boost with minor changes. It is a lot
for an apparently small change...
> Redis currently uses a dict entry of (key*, value or value*, next*). If all keys for hash tables were "interned"* before use, then key comparison could be a pointer comparison. Even better, if you know that all hash keys are interned before use, and you are attempting to read information from any hash, then if the interning would have added a new entry to the intern mapping, you know that the entry doesn't exist. It's trivial to do this with a reference
counted/garbage collected language, a bit more work for Redis. The
plus side is that you get faster queries if you have one entry that is
making multiple probes (think SET/ZSET intersection/union), never mind
memory savings in some use-cases (search indexes utilizing ZSETs,
non-integer SETs, or large enough integer sets so intsets aren't
used), at the cost of some lookup time in advance, along with some
potentially annoying reference counting.
I used this technique in the past, the interesting thing is that you
don't really have to *guarantee* that all different strings have an
unique allocation to get at least some of the benefits.
You could do:
if (ptr1 == ptr2) return MATCH;
/* Otherwise compare strings... */
return strcmp(...);
If you have an hash table implementation where usually there are just
1-2 keys chained per bucket (Redis hash table is like this on
average), you have a good probability of hitting the right key ASAP
and skip the comparison.
We did this when there was the "object sharing" feature, however
currently that would require an hash table per se in order to "intern"
strings.
Example, I've key "foo" in my database. The user sends the command "GET foo".
To turn "foo" in is canonical pointer it takes an hash table lookup,
so if I later need to query foo again and again, I've some win,
otherwise it is not worth the effort.
Moreover as you noticed this would require having reference counted
sds strings that would add some memory overhead.
> There are several methods to minimize lookup times, chief among them is instead of mapping the string to itself, you map the hash of the string to the string. With a 64 bit hash, the probability of collision is very small, even with huge datasets, and collision resolution is a string comparison operation - far fewer than you would see in a typical hash lookup scenario when you have chains.
Of all the things we talked in this email exchange I bet that the
biggest wins both memory and CPU wise are in allowing dict.c to have a
few items per bucket at the default fill level ranges...
It should be something like dropping the "next" pointer that is 4 or 8
bytes, and substitute it with bitfields to do efficient handling of
the array of items.
Another thing that could be done easily is to change the
dictSdsKeyCompare() function to do the "check by pointer" thing before
to do the actual comparison, for the expires hash table, because
actually in we always use the same SDS string for the key in the
database and the key in the expire hash table:
void setExpire(redisDb *db, robj *key, long long when) {
dictEntry *kde, *de;
/* Reuse the sds from the main dict in the expire dict */
kde = dictFind(db->dict,key->ptr);
redisAssertWithInfo(NULL,key,kde != NULL);
de = dictReplaceRaw(db->expires,dictGetKey(kde));
dictSetSignedIntegerVal(de,when);
}
This would work very well if we also change the code in order to try
to lookup the expire has table with the pointer found in the database
key every time this is possible (often AFAIK).
Many things to try, and little time as usually :-) Thanks for the chat.