Extstore revival after crash

132 views
Skip to first unread message

Danny Kopping

unread,
Apr 23, 2023, 4:13:49 AM4/23/23
to memcached
First off, thanks for the amazing work @dormando & others!

Context:
I work at Grafana Labs, and we are very interested in trying out extstore for some very large (>50TB) caches. We plan to split this 50TB cache into about 35 different nodes, each with 1.5TB of NVMe & a small memcached instance. Losing any given node will result in losing ~3% of the overall cache which is acceptable, however if we lose all nodes at once somehow, losing all of our cache will be pretty bad and will put severe pressure on our backend.

Ask:
Having looked at the file that extstore writes on disk, it looks like it has both keys & values contained in it. Would it be possible to "re-warm" the cache on startup by scanning this data and resubmitting it to itself? We could then have add some condition to our readiness check in k8s to wait until the data is all re-warmed and then allow traffic to flow to those instances. Is this feature planned for anytime soon?

Thanks!

dormando

unread,
Apr 23, 2023, 1:24:28 PM4/23/23
to 'Danny Kopping' via memcached
Hey,

Thanks for reaching out!

There is no crash safety in memcached or extstore; it does look like the
data is on disk but it is actually spread across memory and disk, with
recent or heavily accessed data staying in RAM. Best case you only recover
your cold data. Further, keys can appear multiple times in the extstore
datafile and we rely on the RAM index to know which one is current.

I've never heard of anyone losing an entire cluster, but people do try to
mitigate this by replicating cache across availability zones/regions.
This can be done with a few methods, like our new proxy code. I'd be happy
to go over a few scenarios if you'd like.

-Dormando
> --
>
> ---
> You received this message because you are subscribed to the Google Groups "memcached" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to memcached+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/memcached/cc45382b-eee7-4e37-a841-d210bf18ff4bn%40googlegroups.com.
>
>

Javier Arias Losada

unread,
Apr 24, 2023, 7:05:23 AM4/24/23
to memcached
Hi there,

one thing we've done to mitigate this kind of risk is having two copies of every shard in different availability zones in our cloud provider. Also, we run in kubernetes so for us nodes leaving the cluster is a relatively frequent issue... we are playing with a small process that does the warmup of new nodes quicker.

Since we have more than one copy of the data, we do a warmup process. Our cache nodes are MUCH MUCH smaller... so this approach might not be reasonable for your use-case.

This is how our process works, when a new node is restarted or any other situation that involves an empty memcached process starting, our warmup process: 
locates the warmer node for the shard
gets all the keys and TTLS with from the warmer node: lru_crawler metadump all
traverses in reverse the list of keys (lru_crawler goes from the least recently used, for this it's better to go from most recent).
For each key: get the value from the warmer node and add (not set) it to the cold node, including TTL.

This process might lead to some small data inconcistencies, it will depend on your use case how important that is.

Since our access patterns are very skewed (a small % of keys gets the bigger % of traffic, at least during some time) going in reverse in the LRU dump helps being much more effective.

Best
Javier Arias

Danny Kopping

unread,
Apr 24, 2023, 5:33:23 PM4/24/23
to memcached
Thanks for the reply @dormando!

> Best case you only recover your cold data. Further, keys can appear multiple times in the extstore datafile and we rely on the RAM index to know which one is current.

This is actually perfect for our use-case. We just need a big ol' cache of cold data, and we never overwrite keys; they're immutable in our system.
The volume of data we're dealing with is so big that there will be very little hotspotting on any particular keys, so I'm intending to force most of the data into cold storage.

The cache will be used to as a read-through, to protect our upstream service which we're loading many millions of files from (object storage) - sometimes up to several hundred thousand RPS.
It's true that it's unlikely that we'll lose everything all at once, and we will design for frequent failure, but as ever "hope is not a strategy" (although it springs eternal... :))

Aside:
I'm actually busy trying to parse the datafile with a small Go program to try and replay all the data. Solving this warming will give us a lot of confidence to roll this out in a big way across our infra.
What're your thoughts on this and the above?

@Javier, thanks for your thoughts here too. Replication is not an option for us at this scale; that said, your solution is pretty cool!

dormando

unread,
Apr 24, 2023, 5:50:34 PM4/24/23
to memc...@googlegroups.com
Hey,



Aside:
I'm actually busy trying to parse the datafile with a small Go program to try and replay all the data. Solving this warming will give us a lot of confidence to roll this out in a big way across our infra.
What're your thoughts on this and the above?

It would be really bad for both of us if you created a mission critical backup solution based off of an undocumented, unsupported dataformat which potentially changes with version updates. I think you may have also misunderstood me; the data is actually partially in RAM.

Is there any chance I could get you into the MC discord to chat a bit further about your use case? (linked from https://memcached.org/) - easier to play 20 questions there. If that's not possible I'll list a bunch of questions in the mailing list here instead :)


@Javier, thanks for your thoughts here too. Replication is not an option for us at this scale; that said, your solution is pretty cool!

One of many questions; is this due to cost? (ie; don't want to double the cache storage) or some other reason?

Danny Kopping

unread,
Apr 25, 2023, 7:27:06 AM4/25/23
to memc...@googlegroups.com
> It would be really bad for both of us if you created a mission critical backup solution based off of an undocumented, unsupported dataformat which potentially changes with version updates.

Oh absolutely haha! This is more of a POC to prove feasibility, and I was also just curious about what data was actually in the file.

> One of many questions; is this due to cost? (ie; don't want to double the cache storage) or some other reason?

Mostly about cost, yeah.

I'll hit you up on Discord

You received this message because you are subscribed to a topic in the Google Groups "memcached" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/memcached/13dpoGIU1VY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to memcached+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/memcached/C8B67A77-50E3-4CA2-8762-A57093794A5B%40rydia.net.
Reply all
Reply to author
Forward
0 new messages