Why there is no real LRU eviction algorithm in Redis?

1,150 views
Skip to first unread message

Dmitriy Faleychik

unread,
Dec 9, 2013, 10:30:36 AM12/9/13
to redi...@googlegroups.com
In my company I need to use Redis in a Memcached-like way, and I don't want to include Memcached in the project, because I want to minimize the overall number of technologies used to simplify the system. But Redis with "maxmemory-policy allkeys-lru" sometimes evicts very recently used elements even with high "maxmemory-samples" value, because current eviction algorithm deletes randomly taken elements.

I was thinking — how can an eviction algorithm be improved and came up with simple idea, that later I discovered is being used in Memcached. The idea is really simple — if we will store linked list of elements in order they were accessed, we can get the least recent used element in a very simple way and evict it. This algorithm also allows to store elements in rdb snapshot in order they were accessed, so after loading the database from the rdb we can keep the access list.

I forked redis 2.8.2 and implemented the algorithm as an additional maxmemory policy, so you can take a look at it, and comment on what you think about it. I've implemented simple list without changing rdb save algorithm, as it is a preview of what can be done. After my tests I found out that this eviction algortihm slows down Redis lower than on 1% and uses extra 16 bytes for each key in a 64-bit system (I don't know why 16, it should be 24 extra bytes, may be because of some structure alignments). 16 bytes for each key gives 16 megabytes for 1 million keys, so I don't think it is too much.

So it would be great if you consider implementing this algorithm in Redis, because there are some people who would love to have it there.

My implementation can be found in my github repo https://github.com/fadimko/redis/tree/allkeys-real-lru in branch allkeys-real-lru. To turn the algorithm on you should set "maxmemory-policy allkeys-real-lru" in configuration.

Thank you.

Matt Palmer

unread,
Dec 9, 2013, 4:52:51 PM12/9/13
to redi...@googlegroups.com
Hi Dmitriy,

On Mon, Dec 09, 2013 at 07:30:36AM -0800, Dmitriy Faleychik wrote:
> I forked redis 2.8.2 and implemented the algorithm as an additional
> maxmemory policy, so you can take a look at it, and comment on what you
> think about it. I've implemented simple list without changing rdb save
> algorithm, as it is a preview of what can be done. After my tests I found
> out that this eviction algortihm slows down Redis lower than on 1% and uses
> extra 16 bytes for each key in a 64-bit system (I don't know why 16, it
> should be 24 extra bytes, may be because of some structure alignments).

Shiny! Thanks for putting this together, and also for benchmarking it and
determining that it doesn't have a huge performance impact. I can't speak
for Salvatore and getting it into Redis proper, but I'll definitely take a
look at incorporating it into NDS -- having a more effective eviction
algorithm is important for cache performance, and we typically store less
keys in memory, so the memory overhead will be much less pronounced.

- Matt

--
I am cow, hear me moo, I weigh twice as much as you. I'm a cow, eating
grass, methane gas comes out my ass. I'm a cow, you are too; join us all!
Type apt-get moo.

Dmitriy Faleychik

unread,
Dec 10, 2013, 7:53:12 AM12/10/13
to redi...@googlegroups.com, mpa...@hezmatt.org
Hi Matt,

I'm glad to hear, that you consider incorporating my code into your project. I didn't use Redis-NDS by myself, but I read some very positive reports about it on the internet.

I also want to add, that my code evicts data instantly, because all it needs to do to find the least recent used key is to get it from the tail of a linked list.

I'm waiting for the further feedback from you.

Salvatore Sanfilippo

unread,
Mar 19, 2014, 5:18:58 AM3/19/14
to Redis DB
Hello Dmitriy,

I'm reviewing your code, probably I would rework it a little bit here
and there for consistency about function calls or dict.h API, but my
first impression is that it looks really cool and well thought.
Thanks for providing this patch!

My plan is to use today to continue the review, tweak something if
needed, perform some testing, and finally comment more here about the
implementation, what we could change, if it is or not a good idea to
also store the objects age in RDB and so forth.

More news ASAP,
Salvatore

On Mon, Dec 9, 2013 at 4:30 PM, Dmitriy Faleychik <dima...@gmail.com> wrote:
> In my company I need to use Redis in a Memcached-like way, and I don't want
> to include Memcached in the project, because I want to minimize the overall
> number of technologies used to simplify the system. But Redis with
> "maxmemory-policy allkeys-lru" sometimes evicts very recently used elements
> even with high "maxmemory-samples" value, because current eviction algorithm
> deletes randomly taken elements.
>
> I was thinking -- how can an eviction algorithm be improved and came up with
> simple idea, that later I discovered is being used in Memcached. The idea is
> really simple -- if we will store linked list of elements in order they were
> accessed, we can get the least recent used element in a very simple way and
> evict it. This algorithm also allows to store elements in rdb snapshot in
> order they were accessed, so after loading the database from the rdb we can
> keep the access list.
>
> I forked redis 2.8.2 and implemented the algorithm as an additional
> maxmemory policy, so you can take a look at it, and comment on what you
> think about it. I've implemented simple list without changing rdb save
> algorithm, as it is a preview of what can be done. After my tests I found
> out that this eviction algortihm slows down Redis lower than on 1% and uses
> extra 16 bytes for each key in a 64-bit system (I don't know why 16, it
> should be 24 extra bytes, may be because of some structure alignments). 16
> bytes for each key gives 16 megabytes for 1 million keys, so I don't think
> it is too much.
>
> So it would be great if you consider implementing this algorithm in Redis,
> because there are some people who would love to have it there.
>
> My implementation can be found in my github repo
> https://github.com/fadimko/redis/tree/allkeys-real-lru in branch
> allkeys-real-lru. To turn the algorithm on you should set "maxmemory-policy
> allkeys-real-lru" in configuration.
>
> Thank you.
>
> --
> You received this message because you are subscribed to the Google Groups
> "Redis DB" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to redis-db+u...@googlegroups.com.
> To post to this group, send email to redi...@googlegroups.com.
> Visit this group at http://groups.google.com/group/redis-db.
> For more options, visit https://groups.google.com/groups/opt_out.



--
Salvatore 'antirez' Sanfilippo
open source developer - GoPivotal
http://invece.org

To "attack a straw man" is to create the illusion of having refuted a
proposition by replacing it with a superficially similar yet
unequivalent proposition (the "straw man"), and to refute it
-- Wikipedia (Straw man page)

Josiah Carlson

unread,
Mar 19, 2014, 2:03:47 PM3/19/14
to redi...@googlegroups.com
As a follow-up to this, Salvatore posted on Twitter that this might not be the best thing to merge, primarily because it has a ~20% overhead on keys: https://twitter.com/antirez/status/446321008904273920 . I would also suspect that they discovered that the eviction quality itself is not substantially better for the average workload - with 10 probes, you are about 65% likely to get one of the oldest 10% of keys, and ~95% likely to get one of the oldest 25% of keys. Double probe count to 20 and those become 88% and 99.7%, respectively.

Incidentally, these are some of the reasons why I didn't continue the development of or push for inclusion when I did a similar thing last August.

Incidentally, I've got an idea that could substantially improve the eviction quality and make better statistical guarantees during eviction cycles. I need to do a bit of legwork on my end, but I think that it could reasonably solve the random vs. LRU concerns that periodically come up.

 - Josiah



For more options, visit https://groups.google.com/d/optout.

Salvatore Sanfilippo

unread,
Mar 19, 2014, 5:34:43 PM3/19/14
to Redis DB
On Wed, Mar 19, 2014 at 7:03 PM, Josiah Carlson
<josiah....@gmail.com> wrote:
> Incidentally, I've got an idea that could substantially improve the eviction
> quality and make better statistical guarantees during eviction cycles. I
> need to do a bit of legwork on my end, but I think that it could reasonably
> solve the random vs. LRU concerns that periodically come up.

Thanks Josiah, I agree, I think I'll try something tomorrow.

My approach will be information-theory based, basically, I think that
at every eviction cycle, the N things we sample are wasted as we only
use the best candidate and remove the oldest information.
What I'm doing is to create an O(1) memory (fixed size) data structure
that remembers the oldest candidates seen so far. So the eviction
cycle is modified to do this:

1) Sample N. Don't take action based on the sampled elements, just
populate the M best picks we have.
2) Evict the best among the M in our fixed-size data structure.

This way we don't waste the old knowledge accumulated at every cycle.

Today I and Matt came up with two different simulations that both
showed that the current algorithm is not practically too different
than true-LRU, exactly as you noted.
However why to do more changes to improve it?

1) Because we can :-)
2) Because between 3, 5, and 10 samples, there are differences in performances.

Benchmark for 3, 5, 10, 30:

~ ➤ redis-benchmark -r 1000000 -n 10000000 -t set -q -P 32
SET: 312201.38
~ ➤ redis-benchmark -r 1000000 -n 10000000 -t set -q -P 32
SET: 291264.00
~ ➤ redis-benchmark -r 1000000 -n 10000000 -t set -q -P 32
SET: 227968.00
~ ➤ redis-benchmark -r 1000000 -n 10000000 -t set -q -P 32
SET: 127894.68

Most of the performance hit here is about cache misses, so together
with what I described earlier, I'm going to try a new dict.c interface
that directly is able to return N random keys, but those keys are
random in a different way, we just seek a random offset, and go
forward collecting the first N keys. Since the keys are hashed, this
is ok for our random sampling for eviction (but I'll verify with my
tests that this is true), however this will cost a lot less cache
misses.

Yet another improvement. Currently we handle wrapped LRU clock in
estinateObjectIdleTime(), but we can do better: when we find an object
that has lru > current lru timestamp, we don't just compute the age in
a different way, as we do, we also set its age to LRU_MAX, so we mark
it as old and it will no longer be exchanged as new later.

On top of all that the precision of LRU will be moved form 10 to 1
second, because there are additional 2 bits in the object structure
that we can use to enlarge the lru timestamp from 22 to 24 bit.

Long story short: regardless of complains the Redis LRU algorithm is
actually a good approximation of true LRU for real world applications,
so we are saving memory for a good reason. But it is not as good as it
could be, and can be improved with the right changes without adding
too much complexity.

Given that the results were positive, it's time to invest on that.
Tomorrow I'll post here my results and possibly the branch
implementing all the ideas above.

Salvatore

--
Salvatore 'antirez' Sanfilippo
open source developer - GoPivotal
http://invece.org

To "attack a straw man" is to create the illusion of having refuted a
proposition by replacing it with a superficially similar yet
unequivalent proposition (the "straw man"), and to refute it
— Wikipedia (Straw man page)

Josiah Carlson

unread,
Mar 19, 2014, 9:03:56 PM3/19/14
to redi...@googlegroups.com
Those are good changes, and your picking of a fixed-size group to start is about 90% the same as what I was thinking:
1. Assume that you get a "evict some data" call, and you have knowledge of the number of evictions that happened last pass
2. Perform #past evictions * probes initially, storing the results
3. Evict the oldest items until you've got the ratio that you want/expect, or you run out of sufficient statistical information to make an inference
4. Optionally keep those results between passes

There are certain guarantees that you can make about the first item that you evict in the above way, and there are other less strong guarantees for the remainder. Good rules to have very reasonable guarantees:
* At least 5x as many probes as evictions (consider each key ttl discovered as a "probe")
* At least 100 probes per eviction cycle

If you really want to cut down on your cache miss rate with the hash table... you should really take a look at what the Go programming language does. Instead of a bucket being a pointer to a linked list of entries, it's actually a chunk of 8 entries, with overflow having a pointer to another (external) 8 item block (singly linked list as you go along). They proved experimentally that 8 was the right number for each block, and that when the average number of blocks per bucket exceeded a certain limit, that is when rehashing should occur. Information about their experiments and design decisions are in the source.

 - Josiah



       — Wikipedia (Straw man page)

Salvatore Sanfilippo

unread,
Mar 20, 2014, 2:12:20 AM3/20/14
to Redis DB
On Thu, Mar 20, 2014 at 2:03 AM, Josiah Carlson
<josiah....@gmail.com> wrote:
> Those are good changes, and your picking of a fixed-size group to start is
> about 90% the same as what I was thinking:
> 1. Assume that you get a "evict some data" call, and you have knowledge of
> the number of evictions that happened last pass
> 2. Perform #past evictions * probes initially, storing the results
> 3. Evict the oldest items until you've got the ratio that you want/expect,
> or you run out of sufficient statistical information to make an inference
> 4. Optionally keep those results between passes

This looks like a good idea to adapt the number of probes dynamically,
my guess is however that with the fixed pool of M elements of good
candidates you get asymptotically the same eviction quality of M
probes by only actually doing N to populate the pool of M. That would
be a big cpu win, since in tests Matt and I performed, we noticed that
with N=10 what you get is basically already a very very good
approximation of true eviction.
So maybe with a poll of just M=32 elements that I can manage with
minimal constant times picking the right (rigid but cache obvious)
data structure, and N=2/3 probes per cycle, I may get an algorithm
that performs well in different conditions while using an amount of
CPU comparable to the current one.

A few data points about this. The graph for 10 probes looks already a
good approximation:

http://antirez.com/misc/lru_10.html

Matt tested a power law distribution and the number of misses were ~ 2.5%

> If you really want to cut down on your cache miss rate with the hash
> table... you should really take a look at what the Go programming language
> does. Instead of a bucket being a pointer to a linked list of entries, it's
> actually a chunk of 8 entries, with overflow having a pointer to another
> (external) 8 item block (singly linked list as you go along). They proved
> experimentally that 8 was the right number for each block, and that when the
> average number of blocks per bucket exceeded a certain limit, that is when
> rehashing should occur. Information about their experiments and design
> decisions are in the source.

I looked into this but unfortunately with sparsely populated hash
tables the price to pay is quite big... we would need to do this on
the fly by converting the linked list in fixed array (and the reverse,
since during rehashing we move single items from the old hash table to
the new hash table incrementally).
However in the past with Pieter we looked at different possibilities
and there is also a far more extreme one of using a ziplist of
key-pointer-key-pointer: that would save *tons* of memory. Back then
it looked like a too CPU expensive solution, but the reality is that
this is a moving target, it depends on the speed at which CPU power
grows over available RAM :-)

Thanks for your insights,
Salvatore

Salvatore Sanfilippo

unread,
Mar 20, 2014, 8:36:58 AM3/20/14
to Redis DB
Update: I tried a first implementation of the eviction pool idea
above, the results seems promising:

http://antirez.com/misc/lru_before.png
http://antirez.com/misc/lru_after.png

This is with a poll of M=16 elements, enlarging the pool does not seem
to win a lot more, and the whole code is designed to be cache-obvious
/ fast with M very small.
Sampling is 5 elements in both the graphs.

There is a branch called "lru" on Github where there is an
implementation, there are many other small improvements in the branch.

I'll keep you posted,
Salvatore

Josiah Carlson

unread,
Mar 20, 2014, 6:57:06 PM3/20/14
to redi...@googlegroups.com
On Wed, Mar 19, 2014 at 11:12 PM, Salvatore Sanfilippo <ant...@gmail.com> wrote:
On Thu, Mar 20, 2014 at 2:03 AM, Josiah Carlson
<josiah....@gmail.com> wrote:
> Those are good changes, and your picking of a fixed-size group to start is
> about 90% the same as what I was thinking:
> 1. Assume that you get a "evict some data" call, and you have knowledge of
> the number of evictions that happened last pass
> 2. Perform #past evictions * probes initially, storing the results
> 3. Evict the oldest items until you've got the ratio that you want/expect,
> or you run out of sufficient statistical information to make an inference
> 4. Optionally keep those results between passes

This looks like a good idea to adapt the number of probes dynamically,
my guess is however that with the fixed pool of M elements of good
candidates you get asymptotically the same eviction quality of M
probes by only actually doing N to populate the pool of M. That would
be a big cpu win, since in tests Matt and I performed, we noticed that
with N=10 what you get is basically already a very very good
approximation of true eviction.

I'm sure you mean "very good approximation of LRU-based eviction". And yes, it should be very well behaved in practice. At 10 probes, you have a roughly 65% chance of getting one of the oldest 10% of elements or ~94% chance of getting one of the oldest 25% of elements. But the secret is that just because you evict literally the oldest element in LRU doesn't mean that the oldest element was necessarily the right element to evict. Perfect eviction (in the case where the TTL hasn't passed) requires knowing whether or not a given element will be used in the future prior to it's original expiration time, and if so, when, so you know whether there are other elements that would be requested more often.

The only time there is substantial wins WRT using LRU vs some other scheme is when the additional overhead of keeping the LRU data is minor in comparison - like page caches (where pages are anywhere from 4k to 1m, the latter in the case of jumbo pages) or db caches (that typically reference chunks of a table space or chunks of an index).

In practice, choosing the oldest after X probes is very common - in CPU caches in particular. If you've ever heard of "X-way set associative cache", that just means that the cache is arranged in such a way that if you have data that you are going to cache, you can store it in one of X places, and which of those places is determined by the oldest among them. Fully associative is LRU, but the 2-, 4-, 8-, and 16- way associative caches were and still are very common (sometimes using different levels of associativity in different caches).
 
So maybe with a poll of just M=32 elements that I can manage with
minimal constant times picking the right (rigid but cache obvious)
data structure, and N=2/3 probes per cycle, I may get an algorithm
that performs well in different conditions while using an amount of
CPU comparable to the current one.

The statistical guarantees that you get from that will still be very good, initially starting at around 96.5% to get an item from the oldest 10% of items, improving as you add more probes/evictions. There are some Bayesian stats that describe how it works, but the non-Bayesian methods are easier to describe an calculate. Long story short, if you flip it around and perform 62 probes initially to evict 10 elements (only evicting after you've done all of the probes, 32 initial plus 3 for each desired eviction), you have a 99.6% chance of only evicting keys that are in the oldest 10% of keys. Performing 32, then evicting 1, doing 3 probes to add to your pool, then evicting 1, etc., evicting a total of 10 items will put the probability of evicting *only* items in the oldest 10% at between 96.5% and 99.6%, which I'd call pretty good odds.

A few data points about this. The graph for 10 probes looks already a
good approximation:

http://antirez.com/misc/lru_10.html

Matt tested a power law distribution and the number of misses were ~ 2.5%

> If you really want to cut down on your cache miss rate with the hash
> table... you should really take a look at what the Go programming language
> does. Instead of a bucket being a pointer to a linked list of entries, it's
> actually a chunk of 8 entries, with overflow having a pointer to another
> (external) 8 item block (singly linked list as you go along). They proved
> experimentally that 8 was the right number for each block, and that when the
> average number of blocks per bucket exceeded a certain limit, that is when
> rehashing should occur. Information about their experiments and design
> decisions are in the source.

I looked into this but unfortunately with sparsely populated hash
tables the price to pay is quite big... we would need to do this on
the fly by converting the linked list in fixed array (and the reverse,
since during rehashing we move single items from the old hash table to
the new hash table incrementally).

One of the other valuable insights that they gained is that you never use sparse hash tables. To the point, chained hash tables work best when they are under some fixed ratio full, with typical ratios in practice ranging from 50%-80%. But when you have blocks for your buckets, you want most/all of your buckets to have entries, so rather than having 2**k buckets, you have 2**(k-3) buckets, ensuring that a typical bucket itself will average 50%-80% full. Over-full buckets are okay because you've got a pointer for your chained buckets. What that means is that the smallest hash table that you'd use would have 1 bucket, making the smallest hashes not require hashing at all!

On the rehashing side of things, you can also incrementally rehash one item at a time, you would just need to start from the last item inserted into that (chained) bucket. Alternatively, you can handle a chunk of 8 at a time (shifting items from the next bucket into the main hash, a quick memcpy operation), which will increase the incremental time to perform a single rehash operation, but would reduce rehash operation count.
 
However in the past with Pieter we looked at different possibilities
and there is also a far more extreme one of using a ziplist of
key-pointer-key-pointer: that would save *tons* of memory. Back then
it looked like a too CPU expensive solution, but the reality is that
this is a moving target, it depends on the speed at which CPU power
grows over available RAM :-)

It's CPU use, RAM size, and RAM latency when not performing sequential operations (though short sequential operations offer little benefit). You should have observed that chasing pointers is never a sequential operation, and the destination is almost never cached, so any time you can eliminate following a pointer, you win.

But there is another trick to make this all even better by eliminating many string key comparisons (it is a strict reduction in the number of string comparison operations performed to look up an entry in any hash), as well as pointer following, though would require some surgery to dict access methods. You may already know some or all of this or considered doing it before, but I'd be surprised if this didn't improve main hash lookup times by a factor of 2-3x in practice. I did something similar to what I describe below about 9 years ago using P4s (which had nasty pipeline reset times), and I was able to do 6 million hash lookups/second - which was about 2x faster than the C++ stl hash table equivalent.

Redis currently uses a dict entry of (key*, value or value*, next*). If all keys for hash tables were "interned"* before use, then key comparison could be a pointer comparison. Even better, if you know that all hash keys are interned before use, and you are attempting to read information from any hash, then if the interning would have added a new entry to the intern mapping, you know that the entry doesn't exist. It's trivial to do this with a reference counted/garbage collected language, a bit more work for Redis. The plus side is that you get faster queries if you have one entry that is making multiple probes (think SET/ZSET intersection/union), never mind memory savings in some use-cases (search indexes utilizing ZSETs, non-integer SETs, or large enough integer sets so intsets aren't used), at the cost of some lookup time in advance, along with some potentially annoying reference counting.

Another benefit is that if your OS-level memory allocation uses address randomization, you can use the address itself as a replacement for hashing the string, you could use the byteswapped address as a replacement for hashing the string, or you can pass the address through your standard hash function (which can have an optimized version for fixed-size inputs) - all of which would effectively eliminate attacks relating to hash collisions, without requiring the addition of random hash initialization on startup (though keeping it doesn't hurt). I've used byteswapped addresses before in other contexts to get more shuffling around in the upper bits, but hashing the address is probably better from a randomization standpoint.

* String interning is the process of mapping an arbitrary string to its canonical version. There's a wikipedia article on it highlighting some of its uses in programming language runtimes.

There are several methods to minimize lookup times, chief among them is instead of mapping the string to itself, you map the hash of the string to the string. With a 64 bit hash, the probability of collision is very small, even with huge datasets, and collision resolution is a string comparison operation - far fewer than you would see in a typical hash lookup scenario when you have chains.


 - Josiah

Salvatore Sanfilippo

unread,
Mar 21, 2014, 4:01:49 AM3/21/14
to Redis DB
On Thu, Mar 20, 2014 at 11:57 PM, Josiah Carlson
<josiah....@gmail.com> wrote:
>
>
> I'm sure you mean "very good approximation of LRU-based eviction". And yes, it should be very well behaved in practice. At 10 probes, you have a roughly 65% chance of getting one of the oldest 10% of elements or ~94% chance of getting one of the oldest 25% of elements. But the secret is that just because you evict literally the oldest element in LRU doesn't mean that the oldest element was necessarily the right element to evict. Perfect eviction (in the case where the TTL hasn't passed) requires knowing whether or not a given element will be used in the future prior to it's original expiration time, and if so, when, so you know whether there are other elements that would be requested more often.

Yes, sorry. Totally agree with you, what I'm doing here is just to
compare the approximated algorithm with what LRU would do in the same
condition. The real miss rate depends on the access pattern, however
it is interesting to note that when you start to consider access
patterns, our approximation is even better than how it looks like when
you just compared it with the behavior of true LRU.

if you look at this: http://antirez.com/misc/lru.html which is the
algorithm behavior with 5 samples after I applied the pool (of 16
elements), the more an object was recently accessed, the less likely
is for it to be handled incorrectly.
Now if we apply a power-law access pattern and count cache misses, our
simulated application will stress more the elements that we handle
with a better approximation, so Matt measured just 2.5% of misses in
this setup.

> The only time there is substantial wins WRT using LRU vs some other scheme is when the additional overhead of keeping the LRU data is minor in comparison - like page caches (where pages are anywhere from 4k to 1m, the latter in the case of jumbo pages) or db caches (that typically reference chunks of a table space or chunks of an index).
>
> In practice, choosing the oldest after X probes is very common - in CPU caches in particular. If you've ever heard of "X-way set associative cache", that just means that the cache is arranged in such a way that if you have data that you are going to cache, you can store it in one of X places, and which of those places is determined by the oldest among them. Fully associative is LRU, but the 2-, 4-, 8-, and 16- way associative caches were and still are very common (sometimes using different levels of associativity in different caches).

Yep, in hardware many LRU approximations are used, and makes sense and
is "tolerated" :-) I believe we encountered more resistance in Redis
land to this approach because in software there is the idea that you
need to do the Right Thing. I believe the right thing was to avoid
using more CPU and memory for something that provided the same result,
however a paradox is that the current Redis approach actually allows
us to experiment with more complex stuff in the future that are hard
to accomplish if you try to do the right thing.

Example, an interesting schema to try is based on frequency, where you
evict the key that was used the least. This schema is not very used in
practice since it has two issues: one is that maybe an object is
accessed a number of times in a small timespan, and later not accessed
for a long time, it still will not get evicted. The other issue is
that a monotonically increasing counter is not great when you have a
long running cache, at the end everything will converge to the max
value.

Our "sampling" method would allow us to try a more complex LFU
approach where objects start with the counter set to 0, and they get
the counter incremented on access, but decremented via random sampling
(in order to simulate a "decay" due to time). To test this only in
theory is pointless, so to play with such a thing we definitely would
need a big Redis as a cache use case with engineers that want to
experiment to get more out of Redis.

> The statistical guarantees that you get from that will still be very good, initially starting at around 96.5% to get an item from the oldest 10% of items, improving as you add more probes/evictions. There are some Bayesian stats that describe how it works, but the non-Bayesian methods are easier to describe an calculate. Long story short, if you flip it around and perform 62 probes initially to evict 10 elements (only evicting after you've done all of the probes, 32 initial plus 3 for each desired eviction), you have a 99.6% chance of only evicting keys that are in the oldest 10% of keys. Performing 32, then evicting 1, doing 3 probes to add to your pool, then evicting 1, etc., evicting a total of 10 items will put the probability of evicting *only* items in the oldest 10% at between 96.5% and 99.6%, which I'd call pretty good odds.

Yes, this is in accordance to what I see. It is non trivial to model
mathematically given that there is the pool (that may contain a
variable amount of data over time), so honestly I tried to revert to
graphical representations to try to compared the performances of
different algorithms. Sometimes I get, numerically, the same number of
"errors" but graphically the distribution of items look different and
is pretty clear if one algorithm is performing better.

> One of the other valuable insights that they gained is that you never use sparse hash tables. To the point, chained hash tables work best when they are under some fixed ratio full, with typical ratios in practice ranging from 50%-80%. But when you have blocks for your buckets, you want most/all of your buckets to have entries, so rather than having 2**k buckets, you have 2**(k-3) buckets, ensuring that a typical bucket itself will average 50%-80% full. Over-full buckets are okay because you've got a pointer for your chained buckets. What that means is that the smallest hash table that you'd use would have 1 bucket, making the smallest hashes not require hashing at all!
>
> On the rehashing side of things, you can also incrementally rehash one item at a time, you would just need to start from the last item inserted into that (chained) bucket. Alternatively, you can handle a chunk of 8 at a time (shifting items from the next bucket into the main hash, a quick memcpy operation), which will increase the incremental time to perform a single rehash operation, but would reduce rehash operation count.

I'm sure they did a good job for their needs, and probably some idea
could be applied, but overall I've the feeling that this is a bit of a
different design, given that in Redis we try to find a balance between
memory usage, performance, and the ability to easily rehash. And on
top of that we have to support the get-random-element operation.
However for sure it could be super instructive to see other hash table
implementations, and apply the most matching ideas to dict.c.

> It's CPU use, RAM size, and RAM latency when not performing sequential operations (though short sequential operations offer little benefit). You should have observed that chasing pointers is never a sequential operation, and the destination is almost never cached, so any time you can eliminate following a pointer, you win.

And you win big. The embedded sds object that we introduced some time
ago provided a 20% performance boost with minor changes. It is a lot
for an apparently small change...

> Redis currently uses a dict entry of (key*, value or value*, next*). If all keys for hash tables were "interned"* before use, then key comparison could be a pointer comparison. Even better, if you know that all hash keys are interned before use, and you are attempting to read information from any hash, then if the interning would have added a new entry to the intern mapping, you know that the entry doesn't exist. It's trivial to do this with a reference
counted/garbage collected language, a bit more work for Redis. The
plus side is that you get faster queries if you have one entry that is
making multiple probes (think SET/ZSET intersection/union), never mind
memory savings in some use-cases (search indexes utilizing ZSETs,
non-integer SETs, or large enough integer sets so intsets aren't
used), at the cost of some lookup time in advance, along with some
potentially annoying reference counting.

I used this technique in the past, the interesting thing is that you
don't really have to *guarantee* that all different strings have an
unique allocation to get at least some of the benefits.
You could do:

if (ptr1 == ptr2) return MATCH;
/* Otherwise compare strings... */
return strcmp(...);

If you have an hash table implementation where usually there are just
1-2 keys chained per bucket (Redis hash table is like this on
average), you have a good probability of hitting the right key ASAP
and skip the comparison.
We did this when there was the "object sharing" feature, however
currently that would require an hash table per se in order to "intern"
strings.

Example, I've key "foo" in my database. The user sends the command "GET foo".
To turn "foo" in is canonical pointer it takes an hash table lookup,
so if I later need to query foo again and again, I've some win,
otherwise it is not worth the effort.
Moreover as you noticed this would require having reference counted
sds strings that would add some memory overhead.

> There are several methods to minimize lookup times, chief among them is instead of mapping the string to itself, you map the hash of the string to the string. With a 64 bit hash, the probability of collision is very small, even with huge datasets, and collision resolution is a string comparison operation - far fewer than you would see in a typical hash lookup scenario when you have chains.

Of all the things we talked in this email exchange I bet that the
biggest wins both memory and CPU wise are in allowing dict.c to have a
few items per bucket at the default fill level ranges...
It should be something like dropping the "next" pointer that is 4 or 8
bytes, and substitute it with bitfields to do efficient handling of
the array of items.

Another thing that could be done easily is to change the
dictSdsKeyCompare() function to do the "check by pointer" thing before
to do the actual comparison, for the expires hash table, because
actually in we always use the same SDS string for the key in the
database and the key in the expire hash table:

void setExpire(redisDb *db, robj *key, long long when) {
dictEntry *kde, *de;

/* Reuse the sds from the main dict in the expire dict */
kde = dictFind(db->dict,key->ptr);
redisAssertWithInfo(NULL,key,kde != NULL);
de = dictReplaceRaw(db->expires,dictGetKey(kde));
dictSetSignedIntegerVal(de,when);
}

This would work very well if we also change the code in order to try
to lookup the expire has table with the pointer found in the database
key every time this is possible (often AFAIK).

Many things to try, and little time as usually :-) Thanks for the chat.

Dmitry Faleychik

unread,
Mar 24, 2014, 9:25:55 AM3/24/14
to redi...@googlegroups.com
Hello, Salvatore,

I'm glad that you looked at the LRU problem again after my patch. Improvements you made significantly increased speed and quality of current LRU algorithm. One thought: what if we run the very first eviction cycle not <maxmemory_samples>, but like 1000 times? I think it may improve the overall data eviction picture a little bit by filling the eviction pool with good values from the start.

I also benched old and new Redis LRU algorithms and my patch:
maxmemory_samples 10
redis.oldlru # redis-benchmark -r 1000000 -n 10000000 -t set -q -P 32
SET: 416008.00 requests per second

maxmemory_samples 5
redis.oldlru # redis-benchmark -r 1000000 -n 10000000 -t set -q -P 32
SET: 542593.56 requests per second

maxmemory_samples 5
redis.newlru # redis-benchmark -r 1000000 -n 10000000 -t set -q -P 32
SET: 638896.00 requests per second

redis.reallru # redis-benchmark -r 1000000 -n 10000000 -t set -q -P 32
SET: 771962.31 requests per second

As you can see, my implementation of LRU algorithm works 20% faster than new algorithm. I think that it may be important for people, who want to use LRU data eviction. As my implementation needs additional 16 bytes of memory for a key, may be it can be inclulded in the project as a compile time feature? So if someone doesn't want to have additional memory overhead or don't want to use this feature he can compile Redis without it. I can provide a patch, if needed.

As for additional 20% memory overhead, I wrote above that it is 16 bytes of a memory for a key or 16MB for every 1'000'000 keys. I believe that it is not such high cost to pay. The "problem" is that Redis is really memory effective and even 16 bytes make 20% of total memory overhead. :-)

Salvatore Sanfilippo

unread,
Mar 24, 2014, 12:32:05 PM3/24/14
to Redis DB
On Mon, Mar 24, 2014 at 2:25 PM, Dmitry Faleychik <dima...@gmail.com> wrote:
> Hello, Salvatore,
>
> I'm glad that you looked at the LRU problem again after my patch.
> Improvements you made significantly increased speed and quality of current
> LRU algorithm. One thought: what if we run the very first eviction cycle not
> <maxmemory_samples>, but like 1000 times? I think it may improve the overall
> data eviction picture a little bit by filling the eviction pool with good
> values from the start.

Thank you Dmitry, your contribution was very important for us to
finally look at the LRU again.

What you proposed is interesting, actually the first picks will not be
very great.
Maybe we may want to do this periodically to "reseed" instead of doing
it just one time.
However probably there is no need to do it 1000 times, just a (small)
multiple of the size of the memory pool.

I'll play with this idea, thanks! I'll report back my findings.

> As you can see, my implementation of LRU algorithm works 20% faster than new
> algorithm. I think that it may be important for people, who want to use LRU

True but this test case is the worst difference you'll find because of
the pipeline, if you test without or with a smaller pipeline the
difference is smaller.
However probably LRU as implemented in your patch will always be more
efficient because random sampling is costly from the point of view of
cache misses.
But I must admit, I should profile this code path. Today I started
profiling the INFO command, with 4 commits it went from 9500 to 58k
ops/sec using a profiler-driven optimization approach.
We'll not be able to improve eviction like this for sure :-) But maybe
there is some room for improvement.

Btw currently I don't think that it is a good idea to merge true LRU
eviction just for the performance difference, it is not big enough to
justify that. As I said if it is 20% with a pipeline of 32 requests,
in the real world this is going ot be 2-5% in most applications. If
you compare that with the cost of merging, that is:

1) Memory overhead can't easily be enabled-disabled, it is a fixed
cost. This is alone an impressive argument against merging this patch
because the majority of the Redis users don't use LRU eviction and
would pay the overhead.
2) What happens when the option is enabled/disabled via CONFIG SET? It
starts evicting not what is the oldest, but what it starts to collect
after it gets enabled, so also it is not possible to revert the CPU
cost that the patch has when eviction is not needed. It is basically
mandatory to handle the dictionary linked list when we don't need to
use it.

That's why I think the current setup is a wise choice, but mind you,
it is important to specify that this is a subjective evaluation of
mine, so it is not "right", just my personal, limited sensibility
about what to add or not.

> data eviction. As my implementation needs additional 16 bytes of memory for
> a key, may be it can be inclulded in the project as a compile time feature?
> So if someone doesn't want to have additional memory overhead or don't want
> to use this feature he can compile Redis without it. I can provide a patch,
> if needed.

Redis is adverse to compile-time things ;-) Everything should work out
of the box if it is a feature.
We are even adverse to configuration-time things, and almost
everything can be enabled while the server is running.

> As for additional 20% memory overhead, I wrote above that it is 16 bytes of
> a memory for a key or 16MB for every 1'000'000 keys. I believe that it is
> not such high cost to pay. The "problem" is that Redis is really memory
> effective and even 16 bytes make 20% of total memory overhead. :-)

To put this into perspective. Same amount of RAM via maxmemory:

Without the patch -> 10619 keys added.
Wit the patch -> 8994 added.

This is a non trivial difference. Note that the difference in part
also compensates the 20% performance difference you noticed since
technically speaking the instance without the patch is dealing with
more keys.

Thanks for your valuable feedbacks,
Salvatore

Salvatore Sanfilippo

unread,
Mar 25, 2014, 5:37:55 AM3/25/14
to Redis DB
Hello, we found that the slowdown was actually caused by a bug in the
object creation introduced with the LRU patch.

https://github.com/antirez/redis/pull/1635

So apparently the LRU pool is performing better than expected. Still
some profiling would be a good thing.

Salvatore
Reply all
Reply to author
Forward
0 new messages