Patch for O(1) garbage collection of expired entries in memcached

Skip to first unread message

Aug 6, 2009, 1:01:34 AM8/6/09
to memcached,
here at, we were always a bit worried about the way
that memcached handles expired elements, which causes evictions of
unexpired entries even though there is a lot of garbage that could be
After introducing the patch below we managed to eliminate evictions
entirely without sacrificing speed, memory or cache hit/miss ratio.

Another benefit of this patch is that if you know that there is no
garbage in memory, then you can put more trust in stats such as number
of cached items, and then decide how much RAM do you really need, or
how long should the TTLs be.

It was already known, that a lot of memory can be wasted by expired
entries. In our case it was at least 50%, but actually it could be
arbitrary large number, as memcached always consumes all memory that
you assign to it.

It seems that many people believed for some reason that freeing
expired entries can not be performed in O(1), or even that it requires
Omega(N) operations. Using a heap, or other priority queue data
structures can obviously give you O(log N), but the patch below aims
at O(1).
This is achieved by having one bucket for each second, and by putting
each entry in a proper bucket by looking at the expiration time. Then
at each second we just free elements from the next bucket.
This is obviously O(1) amortized time, but it can happen that you have
a lot of elements that expire at the very same moment, so the patch
offers a fixed limit for number of operations that you are willing to
perform per one request, to make the garbage collection run smoother.

You can tweak this solutions in many ways.

For example you do not really need one bucket for each second. You
could group elements from a same minute, or same hour, which will make
the number of buckets smaller at the cost of garbage collection
occurring a bit later.

The patch introduces 16 additional bytes per item for doubly-linked
list in each bucket. Another change that you may consider is to use
single-linked lists for buckets, which can give you 8 bytes per item
This will lead to a linear access time to buckets, so you should make
sure that each bucket is really small.
This can be achieved by using two dimensional array of buckets : one
dimension being the expiration time, and the second being some
pseudorandom hash function of a key.
We find such an improvement both risky and unnecessary, as a number of
elements on a single server multiplied by 8 bytes is not really a big

Here at we run memcached in a single threaded mode, so
we have not tested the patch for the multithread set-ups. We've beeb
testing the patch on production for several days right know, though.
There were no problems, evictions dropped to zero, and number of
cached items is about 50% of full capacity. The cache hit/miss ratio
is same as before the change, and the speed seems similar as well, but
we were never too much concerned about the speed so you can test it by

Here is the patch:

Index: memcached-autoexpire/memcached-1.2.8/items.c
--- memcached-autoexpire/memcached-1.2.8/items.c (revision 16750)
+++ memcached-autoexpire/memcached-1.2.8/items.c (revision 17295)
@@ -31,28 +31,58 @@
rel_time_t evicted_time;
unsigned int outofmemory;
unsigned int tailrepairs;
} itemstats_t;

static item *heads[LARGEST_ID];
static item *tails[LARGEST_ID];
static itemstats_t itemstats[LARGEST_ID];
static unsigned int sizes[LARGEST_ID];

+/* Number of buckets should be larger than the largest TTL in use,
+ * to prevent non-expired items from being processed in the cleanup
+ * If this is a problem, you can change EXP_HASH macro.
+ * Number of buckets should be a power of two, to allow bitwise &
instead of %,
+ * when determining the bucket.
+ */
+#define EXPIRE_BUCKETS_CNT (1<<20)
+/* If you have TTLs larger than EXPIRE_BUCKETS_CNT,
+ * you may consider replacing (x) with something like ((x)>>6),
+ * to group elements in buckets that span 64 seconds,
+ * and setting MAX_DELETIONS_PER_CLEANUP to a smaller value.
+ */
+/* This value limits number of steps performed on a single request
and should be larger than 1.
+ * This is used to amortize the cost of cleanup (which is constant)
in time,
+ * so that we do not have to delete too many elements during a single
request, just because
+ * many objects had same exptime.
+ */
+/* These buckets are used to keep track of all elements that expire
at the same moment.
+ * Each item should be stored in item->exptime%EXPIRE_BUCKETS_CNT
unless it has inifinite TTL.
+ */
+static item *expire_bucket[EXPIRE_BUCKETS_CNT];
+#define EXPIRE_BUCKET(x) expire_bucket[EXP_HASH(x)]
void item_init(void) {
int i;
memset(itemstats, 0, sizeof(itemstats_t) * LARGEST_ID);
for(i = 0; i < LARGEST_ID; i++) {
heads[i] = NULL;
tails[i] = NULL;
sizes[i] = 0;
+ //cleanup of global variables seems unnecessary,
+ //but this line is here for completeness
+ memset(expire_bucket, 0, sizeof(item*) * EXPIRE_BUCKETS_CNT);

/* Get the next CAS id for a new item. */
uint64_t get_cas_id() {
static uint64_t cas_id = 0;
return ++cas_id;

/* Enable this for reference-count debugging. */
#if 0
@@ -78,30 +108,59 @@
* Returns the total size of the header.
static size_t item_make_header(const uint8_t nkey, const int flags,
const int nbytes,
char *suffix, uint8_t *nsuffix) {
/* suffix is defined at 40 chars elsewhere.. */
*nsuffix = (uint8_t) snprintf(suffix, 40, " %d %d\r\n", flags,
nbytes - 2);
return sizeof(item) + nkey + *nsuffix + nbytes;

+ * Tries to free all expired elements from the bucket pointed by
+ * It performs at most MAX_DELETIONS_PER_CLEANUP deletions.
+ * It skips elements that are locked (refcount!=0).
+ * If it reaches the end of the bucket it returns true.
+ * Otherwise it signals that there is something more to do by
returning false.
+ * Elements with refcount!=0 are not signalized, and do not affect
the limit of deletions.
+ */
+bool clean_up() {
+ item * it, *e_next;
+ uint32_t deletions_left = MAX_DELETIONS_PER_CLEANUP;
+ for (it = EXPIRE_BUCKET(last_cleanup_time); it; ) {
+ e_next = it->e_next;
+ if (0 == it->refcount && last_cleanup_time == it->exptime) {
+ if(deletions_left--){
+ do_item_unlink(it);
+ }else{
+ return false;
+ }
+ }
+ it = e_next;
+ }
+ return true;
item *do_item_alloc(char *key, const size_t nkey, const int flags,
const rel_time_t exptime, const int nbytes) {
uint8_t nsuffix;
item *it;
char suffix[40];
size_t ntotal = item_make_header(nkey + 1, flags, nbytes, suffix,

unsigned int id = slabs_clsid(ntotal);
if (id == 0)
return 0;
+ while (last_cleanup_time < current_time && clean_up()) {
+ ++last_cleanup_time;
+ }

it = slabs_alloc(ntotal, id);
if (it == 0) {
int tries = 50;
item *search;

/* If requested to not push old items out of cache when
memory runs out,
* we're out of luck at this point...

@@ -202,56 +261,77 @@
bool item_size_ok(const size_t nkey, const int flags, const int
nbytes) {
char prefix[40];
uint8_t nsuffix;

return slabs_clsid(item_make_header(nkey + 1, flags, nbytes,
prefix, &nsuffix)) != 0;

static void item_link_q(item *it) { /* item is the new head */
- item **head, **tail;
+ item **head, **tail, **e_bucket;
/* always true, warns: assert(it->slabs_clsid <= LARGEST_ID); */
assert((it->it_flags & ITEM_SLABBED) == 0);

head = &heads[it->slabs_clsid];
tail = &tails[it->slabs_clsid];
assert(it != *head);
assert((*head && *tail) || (*head == 0 && *tail == 0));
it->prev = 0;
it->next = *head;
if (it->next) it->next->prev = it;
*head = it;
if (*tail == 0) *tail = it;
+ if (it->exptime) {
+ e_bucket = &EXPIRE_BUCKET(it->exptime);
+ it->e_prev = NULL;
+ it->e_next = *e_bucket;
+ if (it->e_next) it->e_next->e_prev = it;
+ *e_bucket = it;
+ }

static void item_unlink_q(item *it) {
- item **head, **tail;
+ item **head, **tail, **e_bucket;
/* always true, warns: assert(it->slabs_clsid <= LARGEST_ID); */
head = &heads[it->slabs_clsid];
tail = &tails[it->slabs_clsid];

if (*head == it) {
assert(it->prev == 0);
*head = it->next;
if (*tail == it) {
assert(it->next == 0);
*tail = it->prev;
assert(it->next != it);
assert(it->prev != it);

if (it->next) it->next->prev = it->prev;
if (it->prev) it->prev->next = it->next;
+ if (it->exptime) {
+ e_bucket = &EXPIRE_BUCKET(it->exptime);
+ if (*e_bucket == it) {
+ assert(it->e_prev == NULL);
+ *e_bucket = it->e_next;
+ }
+ assert(it->e_next != it);
+ assert(it->e_prev != it);
+ if (it->e_next) it->e_next->e_prev = it->e_prev;
+ if (it->e_prev) it->e_prev->e_next = it->e_next;
+ }

int do_item_link(item *it) {
MEMCACHED_ITEM_LINK(ITEM_key(it), it->nbytes);
assert((it->it_flags & (ITEM_LINKED|ITEM_SLABBED)) == 0);
assert(it->nbytes < (1024 * 1024)); /* 1MB max size */
it->it_flags |= ITEM_LINKED;
it->time = current_time;
Index: memcached-autoexpire/memcached-1.2.8/items.h
--- memcached-autoexpire/memcached-1.2.8/items.h (revision 16750)
+++ memcached-autoexpire/memcached-1.2.8/items.h (revision 17289)
@@ -1,13 +1,15 @@
/* See items.c */
void item_init(void);
uint64_t get_cas_id(void);
+rel_time_t last_cleanup_time; /** relative timestamp pointing a
bucket that we should now clean up **/

item *do_item_alloc(char *key, const size_t nkey, const int flags,
const rel_time_t exptime, const int nbytes);
void item_free(item *it);
bool item_size_ok(const size_t nkey, const int flags, const int

int do_item_link(item *it); /** may fail if transgresses limits
void do_item_unlink(item *it);
void do_item_remove(item *it);
void do_item_update(item *it); /** update LRU time to current and
reposition */
Index: memcached-autoexpire/memcached-1.2.8/memcached.h
--- memcached-autoexpire/memcached-1.2.8/memcached.h (revision 16750)
+++ memcached-autoexpire/memcached-1.2.8/memcached.h (revision 17289)
@@ -109,20 +109,22 @@
#define ITEM_LINKED 1
#define ITEM_DELETED 2

/* temp */
#define ITEM_SLABBED 4

typedef struct _stritem {
struct _stritem *next;
struct _stritem *prev;
struct _stritem *h_next; /* hash chain next */
+ struct _stritem *e_next; /* next item in the same
expire_bucket */
+ struct _stritem *e_prev; /* prev item in the same
expire_bucket */
rel_time_t time; /* least recent access */
rel_time_t exptime; /* expire time */
int nbytes; /* size of data */
unsigned short refcount;
uint8_t nsuffix; /* length of flags-and-length string
uint8_t it_flags; /* ITEM_* above */
uint8_t slabs_clsid;/* which slab class we're in */
uint8_t nkey; /* key length, w/terminating null and
padding */
uint64_t cas_id; /* the CAS identifier */
void * end[];


Aug 6, 2009, 1:58:33 AM8/6/09
to memcached

I don't think we'll be able to take this patch due to changes we've
been making, however this is an *excellent* use case for our new
pluggable engine mechanism that's opening up for development soon.

On Aug 5, 10:01 pm, ""

Jaime Medrano

Aug 7, 2009, 7:12:59 AM8/7/09
Great idea.

> Here at we run memcached in a single threaded mode, so
> we have not tested the patch for the multithread set-ups.

There should be no problem with multithread. clean_up() is only called
inside do_item_alloc(), which always has the global 'cache_lock' mutex held.

I'm a little concerned about the latency that such operation could add
at alloc time. I think a background thread that does the cleaning every
second and doesn't bother the event loop should be better.


Reply all
Reply to author
0 new messages