Blade: A data center garbage collector

Daniel Compton

unread,

Oct 28, 2015, 7:25:17 PM10/28/15

to mechanica...@googlegroups.com

I haven't this paper on Blade discussed on Mechanical Sympathy, and thought it fit into this groups wheelhouse.

From an analysis of the paper:

GC times are a major cause of latency in the tail – Blade aims to fix this. By taking a distributed systems perspective rather than just a single node view, Blade collaborates with the application to move load away from a node that is about to garbage collect. It’s a simple scheme, but very effective. As a bonus, it works best with the highest throughput ‘stop-the-world’ garbage collection algorithm – which also happens to be one of the simplest.

http://blog.acolyer.org/2015/05/06/blade-a-data-center-garbage-collector/

http://arxiv.org/pdf/1504.02578.pdf

What are people's thoughts on this approach of removing nodes from a load balancer while they're GCing?

--

Daniel

Gil Tene

unread,

Oct 28, 2015, 11:48:50 PM10/28/15

to mechanical-sympathy

It's a polite chaos monkey on a schedule.

Richard Warburton

unread,

Oct 29, 2015, 6:26:21 AM10/29/15

to mechanica...@googlegroups.com

Hi,

What are people's thoughts on this approach of removing nodes from a load balancer while they're GCing?

Sigh.

I suspect what they are trying to do is remove the kind of large, long pause times that people sometimes get from combining stop-the-world throughput collectors with large heaps. Ie, full-gc type pauses. It doesn't seem like their approach would be appropriate for solving the more frequent, shorter, pauses that you get from stop-the-world young gen gcs. Maybe the pause is a few milliseconds but that can still be painful for some problem domains.

I don't think its a very well considered proposal tbh. What you're saying is now that in order to solve a problem, which there are other solutions to, we need to invoke "distributed systems"! With all the conceptual overhead, testing overhead, maintenance overhead and complexity involved. I just don't see the win. Also GCs aren't the only reason a JVM pauses, let alone a whole system. You would need all sorts of pause detectors to apply this in practice.

Also, there are other options to reducing the pause times if you have multiple machines. For example if you have a system that can deterministically process events then you can send an event to multiple machines at the same time and simple accept the first answer that you get back. This kind of approach also has other advantages in terms of availability, fault tolerance, etc.

regards,

Richard Warburton

http://insightfullogic.com

@RichardWarburto

Greg Young

unread,

Oct 29, 2015, 7:03:37 AM10/29/15

to mechanica...@googlegroups.com

This reminds me of the old expression "once you learn to use one
machine you can have more"

> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-symp...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
Studying for the Turing test

Richard Warburton

unread,

Oct 29, 2015, 7:18:56 AM10/29/15

to mechanica...@googlegroups.com

Hi,

This reminds me of the old expression "once you learn to use one
machine you can have more"

Exactly.

james bedenbaugh

unread,

Oct 29, 2015, 11:46:09 AM10/29/15

to mechanical-sympathy

Gil Tene

unread,

Oct 29, 2015, 12:09:40 PM10/29/15

to mechanica...@googlegroups.com

I would't be that harsh. While I don't see this as a generic solution, I do think it reflects something people often have to resort to. I've met many people who apply some sort of operational "take myself out in a coordinated fashion by redirecting traffic away, apply System.gc(), and get back in the game." combo to work around the very real problems of large GC pauses that occur in all collectors, including all throughput or mostly-concurrent collectors in JVMs (except for one, of course ;-) ). This is a nice generalization of the work such people end up doing.

Unfortunately, while such schemes can be applied in some use cases (e.g. some cases of stateless load balanced clusters), there are many use cases places it does not apply (e.g. search, caching, Cassandra, etc). It also isn't practical for addressing "frequent but small" pauses (like the common newgen patterns in the 10s to 100s of msec every 1-30 seconds). As you note, idempotent mechanisms that send requests down multiple active paths and can safely act when the first one completes would be more robust, achieve a better SLA and require no more resources [to actually work] than this scheme would. But they also require better architecture, and are harder to post-slap onto existing software whose internals you don't control...

A common mistake I've seen is for people to do this without doubling their associated server resources compared to the amounts their reliable systems require without it, arguing that this sort of coordinated bow-out can leverage the extra resources normally used to absorb failure. The fallacy in such an approach is that an intentional, coordinate bowing out does not remove the possibility of a concurrent failure mode. The opposite tends to be true: these schemes work well as long as there is no failure, but when some other failure does occurs, they tend to amplify it.

Kevin Burton

unread,

Oct 29, 2015, 2:08:42 PM10/29/15

to mechanical-sympathy

I think one problem is how to handle latent requests. But I guess if you put a HARD deadline of 500ms on all requests then you would only have some stragglers at the end.

But I would think there would be a bit of time where the box isn't fully utilized right before a GC cycle which might negate any speedup.

This could be mitigated by having one process per core though.

On Wednesday, October 28, 2015 at 4:25:17 PM UTC-7, Daniel Compton wrote:

Gil Tene

unread,

Oct 29, 2015, 3:00:10 PM10/29/15

to mechanica...@googlegroups.com

On Thursday, October 29, 2015 at 11:08:42 AM UTC-7, Kevin Burton wrote:

I think one problem is how to handle latent requests. But I guess if you put a HARD deadline of 500ms on all requests then you would only have some stragglers at the end.

It's all about being able to predict the need for GC. Or more properly the lack of need for a GC pause in the next N seconds. E.g. if you think you may incur a long pause in the next minute, you can bow out, wait 20 seconds to bleed off traffic. Force a GC, and come back in. Only >20sec latent stuff will be affected.

But I would think there would be a bit of time where the box isn't fully utilized right before a GC cycle which might negate any speedup.

Not a bit of time. A lot of time. This is not about efficiency or speedups. It's about dealing with long tails. You'd want huge safety margins. You'd be forcing GC's on "suspicion" that one might be coming soon, which means that you'll probably be forcing Full GCs at 3x+ the rate that they would normally occur, given the continuous need for headroom against unforeseen behavior in the near future. You'll also "pay" (in low utilization or idle behavior) for bleed-off time.

I would expect N+3 deployments to be the sane way to do this. Where N is counted on whatever unit the system needs to maintain operation and prevent loss. E.g. in a stateless cluster. N is whatever is needed to carry the load. But in e.g. a Database or a partitioned cache, N is 1, and the math is done separately per shard. I.e. if data is sharded across M nodes, N+3 means M * (1 + 3).

The reason scheduled coordinated GCs need N+3 is that N+2 is the baseline for sanity, and this sort of scheme needs a +1 on top of that. Here is how that works:

N+1 is something that people do only until the first time they incur a real unit failure in production. At that point, they experience a curious and rapid learning: On the one hand, they feel the elation that comes from the +1 part having saved them. They have one of those "this could have been sooooo baaaaad" moments, which brings all the terrible things that could have happened (if the +1 redundancy wasn't there) to the surface in a very vivid way. Losing data. Angry phone calls. Getting fired. That sort of thing. Then, with that fresh in their minds, they almost immediately experience extreme angst, as they realize that the +1 protection is temporarily gone (for the next N minutes), and that if anything bad happens before the +1 comes back to life, any and all of those bad things WILL happen. When the +1 unit does come back to life, they feel an enormous sense of relief. The next morning, they usually start working on a +2 solution in order to avoid that angst, and cut down on the nightmares they probably had the night before.

N+2 works. When a unit fails in an N+2 setup, a curious feeling of happiness spreads around. Some of it is pride, but a lot of it is just knowing that things are right.

If you are going to regularly do a -1 (bow out to take care of embarrassing private business in the background), and you have ever experienced actual things failing in production systems that care about stuff like long tails, uptime, and angry phone calls, you will insist in the -1 still leaving you with N+2. When someone inevidably says something to the effect of "but what are the chances...", you'll ask them to come back and say that to your face during an actual unit failure, when you are temporarily down to a +1 situation, scrambling to get back up to +2, and what they are asking you to do is to take that remaining +1 down for a scheduled GC operation "in order to improve SLAs..."

Reply all

Reply to author

Forward