Hi there,
one thing we've done to mitigate this kind of risk is having two copies of every shard in different availability zones in our cloud provider. Also, we run in kubernetes so for us nodes leaving the cluster is a relatively frequent issue... we are playing with a small process that does the warmup of new nodes quicker.
Since we have more than one copy of the data, we do a warmup process. Our cache nodes are MUCH MUCH smaller... so this approach might not be reasonable for your use-case.
This is how our process works, when a new node is restarted or any other situation that involves an empty memcached process starting, our warmup process:
locates the warmer node for the shard
gets all the keys and TTLS with from the warmer node: lru_crawler metadump all
traverses in reverse the list of keys (lru_crawler goes from the least recently used, for this it's better to go from most recent).
For each key: get the value from the warmer node and add (not set) it to the cold node, including TTL.
This process might lead to some small data inconcistencies, it will depend on your use case how important that is.
Since our access patterns are very skewed (a small % of keys gets the bigger % of traffic, at least during some time) going in reverse in the LRU dump helps being much more effective.
Best
Javier Arias