Modern Garbage Collection (good article)

1,774 views

Skip to first unread message

Greg Young

unread,

Dec 21, 2016, 8:23:02 AM12/21/16

to mechanica...@googlegroups.com

Thought people on here would enjoy this
https://medium.com/@octskyward/modern-garbage-collection-911ef4f8bd8e#.ptdzwcmq2

--
Studying for the Turing test

Heath Borders

unread,

Dec 21, 2016, 7:02:40 PM12/21/16

to mechanical-sympathy

This article is also about golang's GC (it does mention GC tradeoffs, and why golang is a good fit for their problem). At the end, they have a benchmark series for a bunch of different GCs. People that have an interest in public perception of Java's GC performance might want to contribute a better Java implementation (or simply brag about Azul's GC in the comments).

https://blog.pusher.com/golangs-real-time-gc-in-theory-and-practice/

Zellyn

unread,

Dec 21, 2016, 11:38:23 PM12/21/16

to mechanical-sympathy

You might be interested in this Golang-group follow-up thread too… https://groups.google.com/forum/#!topic/golang-nuts/DyxPz-cQDe4 (naturally a more pro-Go
take on things…)

Vitaly Davidovich

unread,

Dec 22, 2016, 7:12:14 AM12/22/16

to mechanica...@googlegroups.com

FWIW, I think the Go team is right in favoring lower latency over throughput of their GC given the expected usage scenarios for Go.

In fact, most of the (Hotspot based) Java GC horror stories involve very long pauses (G1 and CMS not excluded) - I've yet to hear anyone complain that their "Big Data" servers are too slow because the GC is chipping away at throughput too much (most of those servers aren't CPU bound to begin with or can't become CPU bound due to bottlenecks elsewhere).

In the Hotspot world, there's also the unfortunate situation right now where G1 isn't really solving any problem nor advancing the performance (latency) boundaries. In fact, it taxes throughput quite heavily (e.g. significantly more expensive write barriers) but isn't breaking any new ground.

--

You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Sent from my phone

Remi Forax

unread,

Dec 22, 2016, 7:37:46 AM12/22/16

to mechanica...@googlegroups.com

De: "Vitaly Davidovich" <vit...@gmail.com>
À: mechanica...@googlegroups.com
Envoyé: Jeudi 22 Décembre 2016 13:12:00
Objet: Re: Modern Garbage Collection (good article)

FWIW, I think the Go team is right in favoring lower latency over throughput of their GC given the expected usage scenarios for Go.

In fact, most of the (Hotspot based) Java GC horror stories involve very long pauses (G1 and CMS not excluded) - I've yet to hear anyone complain that their "Big Data" servers are too slow because the GC is chipping away at throughput too much (most of those servers aren't CPU bound to begin with or can't become CPU bound due to bottlenecks elsewhere).

As an anecdote, in my lab, the only go program we had was a program that optimizes on the fly a huge homemade RDF store depending on most the frequent queries. It was re-written in Java by a Phd student last September. Maybe it was slow because of the Go GC avoiding long pauses, but from the perspective of the team leader, it was just that Go was too slow (and getting slower at each release).

In the Hotspot world, there's also the unfortunate situation right now where G1 isn't really solving any problem nor advancing the performance (latency) boundaries. In fact, it taxes throughput quite heavily (e.g. significantly more expensive write barriers) but isn't breaking any new ground.

Rémi

Vitaly Davidovich

unread,

Dec 22, 2016, 7:47:03 AM12/22/16

to mechanica...@googlegroups.com

It's hard to interpret anecdotes for more than just that - an anecdote :). Rewrites in general tend to be better (for some definition of better, performance included) even if done in the same language.

But, I think as a *principle* of favoring lower latency at the cost of throughput is the right choice for them and how they see Go being used.

Mind you, I'm not a fan nor a user of Go so I'm referring purely to their stipulated strategy on how to evolve their GC.

Remi Forax

unread,

Dec 22, 2016, 8:05:54 AM12/22/16

to mechanica...@googlegroups.com

De: "Vitaly Davidovich" <vit...@gmail.com>
À: mechanica...@googlegroups.com
Envoyé: Jeudi 22 Décembre 2016 13:46:50
Objet: Re: Modern Garbage Collection (good article)

It's hard to interpret anecdotes for more than just that - an anecdote :). Rewrites in general tend to be better (for some definition of better, performance included) even if done in the same language.

yes, it's just an anecdote.

But, I think as a *principle* of favoring lower latency at the cost of throughput is the right choice for them and how they see Go being used.

I don't think it's a good long term strategy for Go even if it can be a good one for how Google uses Go now.

There are a lot of tools written in Go, docker/kubernetes are maybe the poster children but for something like InfluxDB (disclaimer, they give me a t-shirt :) ) it's worrying.

Mind you, I'm not a fan nor a user of Go so I'm referring purely to their stipulated strategy on how to evolve their GC.

Rémi

Vitaly Davidovich

unread,

Dec 22, 2016, 8:15:50 AM12/22/16

to mechanical-sympathy

On Thu, Dec 22, 2016 at 8:05 AM, Remi Forax <fo...@univ-mlv.fr> wrote:

De: "Vitaly Davidovich" <vit...@gmail.com>
À: mechanical-sympathy@googlegroups.com
Envoyé: Jeudi 22 Décembre 2016 13:46:50
Objet: Re: Modern Garbage Collection (good article)
It's hard to interpret anecdotes for more than just that - an anecdote :). Rewrites in general tend to be better (for some definition of better, performance included) even if done in the same language.

yes, it's just an anecdote.

But, I think as a *principle* of favoring lower latency at the cost of throughput is the right choice for them and how they see Go being used.

I don't think it's a good long term strategy for Go even if it can be a good one for how Google uses Go now.
There are a lot of tools written in Go, docker/kubernetes are maybe the poster children but for something like InfluxDB (disclaimer, they give me a t-shirt :) ) it's worrying.

So what would you suggest to them? Obviously everyone would love high throughput and low latency, but realistically/practically speaking here. Also, Go isn't as heap obsessed as Java, so it seems like GC pressure should be lower there (or could be made lower without contorting code) - correct me if I'm wrong though.

I don't know why InfluxDB chose Go. People building infrastructure/systems at that level should not use such languages/runtimes IMO.

Mind you, I'm not a fan nor a user of Go so I'm referring purely to their stipulated strategy on how to evolve their GC.

Rémi

On Thu, Dec 22, 2016 at 7:37 AM Remi Forax <fo...@univ-mlv.fr> wrote:

De: "Vitaly Davidovich" <vit...@gmail.com>
À: mechanical-sympathy@googlegroups.com
Envoyé: Jeudi 22 Décembre 2016 13:12:00
Objet: Re: Modern Garbage Collection (good article)
FWIW, I think the Go team is right in favoring lower latency over throughput of their GC given the expected usage scenarios for Go.

In fact, most of the (Hotspot based) Java GC horror stories involve very long pauses (G1 and CMS not excluded) - I've yet to hear anyone complain that their "Big Data" servers are too slow because the GC is chipping away at throughput too much (most of those servers aren't CPU bound to begin with or can't become CPU bound due to bottlenecks elsewhere).

As an anecdote, in my lab, the only go program we had was a program that optimizes on the fly a huge homemade RDF store depending on most the frequent queries. It was re-written in Java by a Phd student last September. Maybe it was slow because of the Go GC avoiding long pauses, but from the perspective of the team leader, it was just that Go was too slow (and getting slower at each release).

In the Hotspot world, there's also the unfortunate situation right now where G1 isn't really solving any problem nor advancing the performance (latency) boundaries. In fact, it taxes throughput quite heavily (e.g. significantly more expensive write barriers) but isn't breaking any new ground.

Rémi

On Wed, Dec 21, 2016 at 11:38 PM Zellyn <zel...@gmail.com> wrote:

You might be interested in this Golang-group follow-up thread too… https://groups.google.com/forum/#!topic/golang-nuts/DyxPz-cQDe4 (naturally a more pro-Go
take on things…)

On Wednesday, December 21, 2016 at 8:23:02 AM UTC-5, Greg Young wrote:
Thought people on here would enjoy this

https://medium.com/@octskyward/modern-garbage-collection-911ef4f8bd8e#.ptdzwcmq2

--

Studying for the Turing test

--

You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Sent from my phone

--

You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Sent from my phone
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Avi Kivity

unread,

Dec 22, 2016, 8:17:04 AM12/22/16

to mechanica...@googlegroups.com

On 12/22/2016 03:05 PM, Remi Forax wrote:

De: "Vitaly Davidovich" <vit...@gmail.com>
À: mechanica...@googlegroups.com
Envoyé: Jeudi 22 Décembre 2016 13:46:50
Objet: Re: Modern Garbage Collection (good article)

It's hard to interpret anecdotes for more than just that - an anecdote :). Rewrites in general tend to be better (for some definition of better, performance included) even if done in the same language.

yes, it's just an anecdote.

But, I think as a *principle* of favoring lower latency at the cost of throughput is the right choice for them and how they see Go being used.

I don't think it's a good long term strategy for Go even if it can be a good one for how Google uses Go now.

There are a lot of tools written in Go, docker/kubernetes are maybe the poster children but for something like InfluxDB (disclaimer, they give me a t-shirt :) ) it's worrying.

Go changed its character mid-life. From being a systems language it morphed into a network service language. Frankly I don't see how they thought it could be a systems language while sporting a garbage collector.

It's fine for network services like docker, even those that need a high request rate (unlike docker). Not for a database.

Gil Tene

unread,

Dec 22, 2016, 12:14:22 PM12/22/16

to mechanical-sympathy

Go's GC story is evolving. And it's doing so at a fairly quick pace. Judging it based on a current implementation and current choices is premature, I think.

I think that the Go team's *implementation priority* focus on latency is a good one. And unlike others, I don't actually think there is a strong latency vs. throughput tradeoff in the long run. Achieving great latency in GC does not inherently come at the expense of GC throughput or application throughput. Both can one had at the same time. My view is that GC implementations in general (for all runtimes) will move towards good latency + good throughput at the same time at they mature. And Go is no different here.

A key architectural choice that has not been completely finalized in Go (AFAICT) has to do with the ability to move objects. It is common to see early GC-based runtime implementations that do not move objects around succumb to an architectural "hole" that creates a reliance on that fact early in the maturity cycles. This usually happens by allowing native code extensions to assume object addresses are fixed (i.e. that objects will not move during their lifetime) and getting stuck with that "assumed quality" in the long run due to practical compatibility needs. From what I can tell, it is not too late for Go to avoid this problem. Go still has precise GC capability (it knows where all the heap references are), and technically can still allow the movement of objects in the heap, even if current GC implementations do not make use of that ability. My hope for Go is that it is not too late to forgo the undesirable architectural limitation of requiring objects to stay at fixed addresses, because much of what I think is ahead for GC in Go may rely on breaking that assumption. [to be clear, supporting various forms of temporary "pinning" of objects is fine, and does not represent a strong architectural limitation.]

The ability to move objects is key for two main qualities that future Go GC's (and GCs in pretty much any runtime) will find desirable:

1. The ability to defragment the heap in order to support indefinite execution lengths that do not depend on friendly behavior patterns for object sizes over time. And defragmentation requires moving objects around before they die. Heaps that can compact are dramatically more resilient than ones that don't. Much like GC itself, self-defragmenting heaps allow the relatively easy construction of software that is free to deal with arbitrarily complex data shapes and scenarios. The value of this freedom is often disputed by people who focus on "things that have worked in C/C++ for a long time", but is easily seen in the multitude of application logic that breaks when restricted to using "malloc'ed heap" or "arena allocator" memory management mechanisms.

2. The ability of move objects is fundamental to many of the techniques that support high throughput GC and fast allocation logic. Generational GC is a proven, nearly-universally-used technique that tends to improve GC throughput by at least one order of magnitude [compared to non-generational uses of similar algorithms] across a wide spectrum of application behaviors in many different runtimes. It achieves that by fundamentally focusing some efficient-at-sparse-occupancy algorithm on a known-to-be-sparse logical subset of the heap. And it maintains the qualities needed (tracking the logical subset of the heap, and keeping it sparse) by using object relocation (e.g. for promotion). While there *may* exist alternative techniques that achieve similar throughput improvements without moving objects, I don't know of any that have been widely and successfully used so far, in any runtime.

I would predict that as Go's use application cases expand it's GC implementations will evolve to include generational moving collectors. This *does not* necessarily mean a monolithic STW copying newgen. Given their [very correct] focus on latency, I'd expect even the newgen collectors in future generational GC collectors to prioritize the minimization or elimination of STW pauses, which will probably make them either fully concurrent, incremental-STW (with viable arbitrarily small incremental steps), or a combination of the two.

And yes, doing that (concurrent generational GC) while supporting great latency, high throughput, and high efficiency all at the same time, is very possible. It is NOT the hard or inherent tradeoff many people seem to believe it to be. And practical proof points already exist.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Vitaly Davidovich

unread,

Dec 22, 2016, 12:33:09 PM12/22/16

to mechanical-sympathy

On Thu, Dec 22, 2016 at 12:14 PM, Gil Tene <g...@azul.com> wrote:

Go's GC story is evolving. And it's doing so at a fairly quick pace. Judging it based on a current implementation and current choices is premature, I think.

Doesn't seem premature at all to me. Current implementation is what runs in production, what people will experience, and what they should expect/write their code towards. This is not too dissimilar from the Sufficiently Smart Compiler - it *could* do something in *theory*, but practical/real implementations don't. That's not to say their story won't improve over time, but judging current impl is pretty fair IMO.

I think that the Go team's *implementation priority* focus on latency is a good one. And unlike others, I don't actually think there is a strong latency vs. throughput tradeoff in the long run. Achieving great latency in GC does not inherently come at the expense of GC throughput or application throughput. Both can one had at the same time. My view is that GC implementations in general (for all runtimes) will move towards good latency + good throughput at the same time at they mature. And Go is no different here.

I don't see how minimizing latency cannot impact application throughput. The impact can be minimized, but I don't see how it can be (nearly) zero. The impact will vary from app to app, depending on how much it intrinsically can make use of CPU before bottlenecking elsewhere and giving up CPU for reasons other than GC.

A key architectural choice that has not been completely finalized in Go (AFAICT) has to do with the ability to move objects. It is common to see early GC-based runtime implementations that do not move objects around succumb to an architectural "hole" that creates a reliance on that fact early in the maturity cycles. This usually happens by allowing native code extensions to assume object addresses are fixed (i.e. that objects will not move during their lifetime) and getting stuck with that "assumed quality" in the long run due to practical compatibility needs. From what I can tell, it is not too late for Go to avoid this problem. Go still has precise GC capability (it knows where all the heap references are), and technically can still allow the movement of objects in the heap, even if current GC implementations do not make use of that ability. My hope for Go is that it is not too late to forgo the undesirable architectural limitation of requiring objects to stay at fixed addresses, because much of what I think is ahead for GC in Go may rely on breaking that assumption. [to be clear, supporting various forms of temporary "pinning" of objects is fine, and does not represent a strong architectural limitation.]

The ability to move objects is key for two main qualities that future Go GC's (and GCs in pretty much any runtime) will find desirable:

1. The ability to defragment the heap in order to support indefinite execution lengths that do not depend on friendly behavior patterns for object sizes over time. And defragmentation requires moving objects around before they die. Heaps that can compact are dramatically more resilient than ones that don't. Much like GC itself, self-defragmenting heaps allow the relatively easy construction of software that is free to deal with arbitrarily complex data shapes and scenarios. The value of this freedom is often disputed by people who focus on "things that have worked in C/C++ for a long time", but is easily seen in the multitude of application logic that breaks when restricted to using "malloc'ed heap" or "arena allocator" memory management mechanisms.

2. The ability of move objects is fundamental to many of the techniques that support high throughput GC and fast allocation logic. Generational GC is a proven, nearly-universally-used technique that tends to improve GC throughput by at least one order of magnitude [compared to non-generational uses of similar algorithms] across a wide spectrum of application behaviors in many different runtimes. It achieves that by fundamentally focusing some efficient-at-sparse-occupancy algorithm on a known-to-be-sparse logical subset of the heap. And it maintains the qualities needed (tracking the logical subset of the heap, and keeping it sparse) by using object relocation (e.g. for promotion). While there *may* exist alternative techniques that achieve similar throughput improvements without moving objects, I don't know of any that have been widely and successfully used so far, in any runtime.

I would predict that as Go's use application cases expand it's GC implementations will evolve to include generational moving collectors. This *does not* necessarily mean a monolithic STW copying newgen. Given their [very correct] focus on latency, I'd expect even the newgen collectors in future generational GC collectors to prioritize the minimization or elimination of STW pauses, which will probably make them either fully concurrent, incremental-STW (with viable arbitrarily small incremental steps), or a combination of the two.

And yes, doing that (concurrent generational GC) while supporting great latency, high throughput, and high efficiency all at the same time, is very possible. It is NOT the hard or inherent tradeoff many people seem to believe it to be. And practical proof points already exist.

If it's not "hard", why do we not see more GCs that have great throughput and low latency?

On Thursday, December 22, 2016 at 4:12:14 AM UTC-8, Vitaly Davidovich wrote:

FWIW, I think the Go team is right in favoring lower latency over throughput of their GC given the expected usage scenarios for Go.

In fact, most of the (Hotspot based) Java GC horror stories involve very long pauses (G1 and CMS not excluded) - I've yet to hear anyone complain that their "Big Data" servers are too slow because the GC is chipping away at throughput too much (most of those servers aren't CPU bound to begin with or can't become CPU bound due to bottlenecks elsewhere).

In the Hotspot world, there's also the unfortunate situation right now where G1 isn't really solving any problem nor advancing the performance (latency) boundaries. In fact, it taxes throughput quite heavily (e.g. significantly more expensive write barriers) but isn't breaking any new ground.

On Wed, Dec 21, 2016 at 11:38 PM Zellyn <zel...@gmail.com> wrote:

You might be interested in this Golang-group follow-up thread too… https://groups.google.com/forum/#!topic/golang-nuts/DyxPz-cQDe4 (naturally a more pro-Go
take on things…)

On Wednesday, December 21, 2016 at 8:23:02 AM UTC-5, Greg Young wrote:
Thought people on here would enjoy this

https://medium.com/@octskyward/modern-garbage-collection-911ef4f8bd8e#.ptdzwcmq2

--

Studying for the Turing test

--

You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Sent from my phone

Vitaly Davidovich

unread,

Dec 22, 2016, 12:45:54 PM12/22/16

to mechanical-sympathy

On Thu, Dec 22, 2016 at 12:14 PM, Gil Tene <g...@azul.com> wrote:

1. The ability to defragment the heap in order to support indefinite execution lengths that do not depend on friendly behavior patterns for object sizes over time. And defragmentation requires moving objects around before they die. Heaps that can compact are dramatically more resilient than ones that don't. Much like GC itself, self-defragmenting heaps allow the relatively easy construction of software that is free to deal with arbitrarily complex data shapes and scenarios. The value of this freedom is often disputed by people who focus on "things that have worked in C/C++ for a long time", but is easily seen in the multitude of application logic that breaks when restricted to using "malloc'ed heap" or "arena allocator" memory management mechanisms.

Let's not overlook the multitude of applications that simply break because of GC issues :).

Transparent defragmentation is unpredictable and there's no user control. For some apps, that's perfectly fine. For others, it's unacceptable. It's not even just GC - let's not forget linux THP having similar problems. Now, it's very convenient from a developer's perspective to let something else manage it, but it comes at a cost that needs to be kept in mind.

There's also the issue that C/C++ code aimed at high performance does not malloc like crazy. This isn't so much a GC issue, but rather over-reliance on heap (as in the case of Java). In high performance systems, you need control - the control comes at the cost of having to do more things manually via code. For some systems, that's mandatory and for others it's not (why folks write databases in Java/Go is a bit of a puzzler to me). A general-purpose automatic GC cannot, by definition, really know the memory usage patterns. It can try to adapt via heuristics, provide a plethora of tunables ("we don't know the application - we'll give users knobs they can turn themselves because they know their patterns better"), or it can be "always-on" trying to avoid worst case. But so long as there's no cooperation between user code and GC, there's a mismatch there.

Dan Eloff

unread,

Dec 22, 2016, 12:46:25 PM12/22/16

to mechanica...@googlegroups.com

In response to whether Go has limited itself to non-moving garbage collectors, I want to clarify that it has not. Legal uses of unsafe and the C API are subject to strict rules which exist for the sole purpose of keeping that option open. For example, if you take the address of an object and store it in a uintptr, which is treated as an integer by the GC, you may not convert that back into a pointer safely. Any pointer math through uintptr must have both conversions in the same expression, and that expression can't include things like function calls. I'm simplifying a little, but the reason for such a strict rule is precisely that the GC is allowed to move your object and it won't update uintptr references (because it's a precise collector.) Calling into C has similar limitations for the same reasons.

I applaud such forward thinking by the Go team, especially as a moving collector is likely unrivalled precisely for the reasons mentioned in this thread re fragmentation and generational collection.

Gil Tene

unread,

Dec 22, 2016, 12:59:19 PM12/22/16

to mechanical-sympathy

On Thursday, December 22, 2016 at 9:33:09 AM UTC-8, Vitaly Davidovich wrote:

On Thu, Dec 22, 2016 at 12:14 PM, Gil Tene <g...@azul.com> wrote:
Go's GC story is evolving. And it's doing so at a fairly quick pace. Judging it based on a current implementation and current choices is premature, I think.
Doesn't seem premature at all to me. Current implementation is what runs in production, what people will experience, and what they should expect/write their code towards. This is not too dissimilar from the Sufficiently Smart Compiler - it *could* do something in *theory*, but practical/real implementations don't. That's not to say their story won't improve over time, but judging current impl is pretty fair IMO.

Looking at the current state of implementation to judge current use is fine. But projecting from the current state to Go as a whole and assuming inherent qualities (something that I see a lot of) would be similar to concluding that "Java is slow because it is an interpreted language" and basing those conclusions on 1997 JVMs.

I think that the Go team's *implementation priority* focus on latency is a good one. And unlike others, I don't actually think there is a strong latency vs. throughput tradeoff in the long run. Achieving great latency in GC does not inherently come at the expense of GC throughput or application throughput. Both can one had at the same time. My view is that GC implementations in general (for all runtimes) will move towards good latency + good throughput at the same time at they mature. And Go is no different here.
I don't see how minimizing latency cannot impact application throughput. The impact can be minimized, but I don't see how it can be (nearly) zero. The impact will vary from app to app, depending on how much it intrinsically can make use of CPU before bottlenecking elsewhere and giving up CPU for reasons other than GC.

The are tiny tradeoffs involved, but they are "in the noise", and are usually immeasurable in practice. So yes, I'm saying that it is "nearly zero". Concurrency and latency are not inherently coupled with efficiency choices in GC work. In practice, they tend to have very little to do with each other.

The fundamentals are simple:

For a given live set, heap set, and rate of allocation, a GC algorithm has a certain amount of work to do. It needs to locate the dead material and reclaim it such that it can be allocated into again. The choice of algorithms used affect the amount of work needed to achieve this. But this amount of work tends to be only minutely affected by concurrency concerns, and is dramatically affected by other choices. E.g. the choice to use compacting collectors dramatically affects the cost of allocation (bump pointer allocation is dramatically cheaper than allocation in non-contiguously empty regions). E.g. the choice of using a generational filter (or not) has several orders of magnitude higher impact on GC throughput than the choice to do concurrent or tiny-incremental-STW work (or not). The choice to use evacuation as a compaction technique (common in copying and regional collectors, but not in classic MarkSweepCompact) can dramatically affect efficiency (in two opposite directions) and depends a lot more on the amount of empty heap (aka "heap headroom") available than on concurrency or incremental STW design choices.

Given the existence of concurrent collectors that are already able to sustain higher application throughput that STW in many practical cases (e.g. "Cassandra"), the "nearly free" or "even cheaper than free" argument has practical proof points.

A key architectural choice that has not been completely finalized in Go (AFAICT) has to do with the ability to move objects. It is common to see early GC-based runtime implementations that do not move objects around succumb to an architectural "hole" that creates a reliance on that fact early in the maturity cycles. This usually happens by allowing native code extensions to assume object addresses are fixed (i.e. that objects will not move during their lifetime) and getting stuck with that "assumed quality" in the long run due to practical compatibility needs. From what I can tell, it is not too late for Go to avoid this problem. Go still has precise GC capability (it knows where all the heap references are), and technically can still allow the movement of objects in the heap, even if current GC implementations do not make use of that ability. My hope for Go is that it is not too late to forgo the undesirable architectural limitation of requiring objects to stay at fixed addresses, because much of what I think is ahead for GC in Go may rely on breaking that assumption. [to be clear, supporting various forms of temporary "pinning" of objects is fine, and does not represent a strong architectural limitation.]

The ability to move objects is key for two main qualities that future Go GC's (and GCs in pretty much any runtime) will find desirable:

1. The ability to defragment the heap in order to support indefinite execution lengths that do not depend on friendly behavior patterns for object sizes over time. And defragmentation requires moving objects around before they die. Heaps that can compact are dramatically more resilient than ones that don't. Much like GC itself, self-defragmenting heaps allow the relatively easy construction of software that is free to deal with arbitrarily complex data shapes and scenarios. The value of this freedom is often disputed by people who focus on "things that have worked in C/C++ for a long time", but is easily seen in the multitude of application logic that breaks when restricted to using "malloc'ed heap" or "arena allocator" memory management mechanisms.

2. The ability of move objects is fundamental to many of the techniques that support high throughput GC and fast allocation logic. Generational GC is a proven, nearly-universally-used technique that tends to improve GC throughput by at least one order of magnitude [compared to non-generational uses of similar algorithms] across a wide spectrum of application behaviors in many different runtimes. It achieves that by fundamentally focusing some efficient-at-sparse-occupancy algorithm on a known-to-be-sparse logical subset of the heap. And it maintains the qualities needed (tracking the logical subset of the heap, and keeping it sparse) by using object relocation (e.g. for promotion). While there *may* exist alternative techniques that achieve similar throughput improvements without moving objects, I don't know of any that have been widely and successfully used so far, in any runtime.

I would predict that as Go's use application cases expand it's GC implementations will evolve to include generational moving collectors. This *does not* necessarily mean a monolithic STW copying newgen. Given their [very correct] focus on latency, I'd expect even the newgen collectors in future generational GC collectors to prioritize the minimization or elimination of STW pauses, which will probably make them either fully concurrent, incremental-STW (with viable arbitrarily small incremental steps), or a combination of the two.

And yes, doing that (concurrent generational GC) while supporting great latency, high throughput, and high efficiency all at the same time, is very possible. It is NOT the hard or inherent tradeoff many people seem to believe it to be. And practical proof points already exist.
If it's not "hard", why do we not see more GCs that have great throughput and low latency?

Focus? Working on the right problems?

It all starts with saying that keeping latency down is a priority, and that STW pauses (larger than other OS artifacts at least, at least) are unacceptable. The rest is good engineering and implementation.

In that sense, I think the Go team has the right focus and started from working on the right problems. They have a bunch more engineering and implementation work ahead to cover wider use cases, but their choices are already having an effect.

On Thursday, December 22, 2016 at 4:12:14 AM UTC-8, Vitaly Davidovich wrote:

FWIW, I think the Go team is right in favoring lower latency over throughput of their GC given the expected usage scenarios for Go.

In fact, most of the (Hotspot based) Java GC horror stories involve very long pauses (G1 and CMS not excluded) - I've yet to hear anyone complain that their "Big Data" servers are too slow because the GC is chipping away at throughput too much (most of those servers aren't CPU bound to begin with or can't become CPU bound due to bottlenecks elsewhere).

In the Hotspot world, there's also the unfortunate situation right now where G1 isn't really solving any problem nor advancing the performance (latency) boundaries. In fact, it taxes throughput quite heavily (e.g. significantly more expensive write barriers) but isn't breaking any new ground.

On Wed, Dec 21, 2016 at 11:38 PM Zellyn <zel...@gmail.com> wrote:

You might be interested in this Golang-group follow-up thread too… https://groups.google.com/forum/#!topic/golang-nuts/DyxPz-cQDe4 (naturally a more pro-Go
take on things…)

On Wednesday, December 21, 2016 at 8:23:02 AM UTC-5, Greg Young wrote:
Thought people on here would enjoy this

https://medium.com/@octskyward/modern-garbage-collection-911ef4f8bd8e#.ptdzwcmq2

--

Studying for the Turing test

--

You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Sent from my phone

Vitaly Davidovich

unread,

Dec 22, 2016, 1:46:47 PM12/22/16

to mechanical-sympathy

On Thu, Dec 22, 2016 at 12:59 PM, Gil Tene <g...@azul.com> wrote:

On Thursday, December 22, 2016 at 9:33:09 AM UTC-8, Vitaly Davidovich wrote:

On Thu, Dec 22, 2016 at 12:14 PM, Gil Tene <g...@azul.com> wrote:
Go's GC story is evolving. And it's doing so at a fairly quick pace. Judging it based on a current implementation and current choices is premature, I think.
Doesn't seem premature at all to me. Current implementation is what runs in production, what people will experience, and what they should expect/write their code towards. This is not too dissimilar from the Sufficiently Smart Compiler - it *could* do something in *theory*, but practical/real implementations don't. That's not to say their story won't improve over time, but judging current impl is pretty fair IMO.

Looking at the current state of implementation to judge current use is fine. But projecting from the current state to Go as a whole and assuming inherent qualities (something that I see a lot of) would be similar to concluding that "Java is slow because it is an interpreted language" and basing those conclusions on 1997 JVMs.

Sure.

I think that the Go team's *implementation priority* focus on latency is a good one. And unlike others, I don't actually think there is a strong latency vs. throughput tradeoff in the long run. Achieving great latency in GC does not inherently come at the expense of GC throughput or application throughput. Both can one had at the same time. My view is that GC implementations in general (for all runtimes) will move towards good latency + good throughput at the same time at they mature. And Go is no different here.
I don't see how minimizing latency cannot impact application throughput. The impact can be minimized, but I don't see how it can be (nearly) zero. The impact will vary from app to app, depending on how much it intrinsically can make use of CPU before bottlenecking elsewhere and giving up CPU for reasons other than GC.

The are tiny tradeoffs involved, but they are "in the noise", and are usually immeasurable in practice. So yes, I'm saying that it is "nearly zero". Concurrency and latency are not inherently coupled with efficiency choices in GC work. In practice, they tend to have very little to do with each other.

They're necessarily coupled IMO. If a CPU is occupied with GC work that could otherwise be used by the user app, it's a throughput hit. If extra instructions are executed on behalf of GC while also performing user work, it's a loss in throughput (e.g. write barriers, indirection barriers that may hijack execution, etc). I'm not even talking about the possible d-cache and i-cache pollution, cache misses, registers reserved to support GC, etc that may be involved. Since GC cannot know the allocation/memory pattern perfectly, it'll ultimately perform more CPU work than could otherwise be necessary. This is the price of admission for this type of automatic management. There's nothing inherently wrong with it if that cost is acceptable. Likewise, there's a development/mental cost to manual memory management as well. It's all about priorities and cost.

The fundamentals are simple:

For a given live set, heap set, and rate of allocation, a GC algorithm has a certain amount of work to do. It needs to locate the dead material and reclaim it such that it can be allocated into again. The choice of algorithms used affect the amount of work needed to achieve this. But this amount of work tends to be only minutely affected by concurrency concerns, and is dramatically affected by other choices. E.g. the choice to use compacting collectors dramatically affects the cost of allocation (bump pointer allocation is dramatically cheaper than allocation in non-contiguously empty regions). E.g. the choice of using a generational filter (or not) has several orders of magnitude higher impact on GC throughput than the choice to do concurrent or tiny-incremental-STW work (or not). The choice to use evacuation as a compaction technique (common in copying and regional collectors, but not in classic MarkSweepCompact) can dramatically affect efficiency (in two opposite directions) and depends a lot more on the amount of empty heap (aka "heap headroom") available than on concurrency or incremental STW design choices.

Concurrent collectors, at least in Hotspot, have a tendency to be outpaced by the mutators because they try to minimize throughput impact (i.e. they don't run all the time, but get triggered when certain thresholds are breached at certain check points). Then you have a problem of object graph mutation rate (not allocation rate), which in the case of G1 will also increase SATB queue processing and finalize marking phases. Yes, one can tune (or code) around it, and throw lots more headroom when all else fails.

So an alternative is you make mutator threads assist in relocations/fixups, which I believe is what C4 does (in a nutshell, I'm glossing over a bunch of things). But, this adds read barriers. I know we talked about their purported cost before. Certainly one can try to minimize the impact, but I'm having a hard time believing that it approaches the same level of impact as a full STW collector. Of course the impact will vary based on app profile (e.g. reference/pointer heavy vs flatter structures, object graph mutation rate, etc).

Given the existence of concurrent collectors that are already able to sustain higher application throughput that STW in many practical cases (e.g. "Cassandra"), the "nearly free" or "even cheaper than free" argument has practical proof points.

A key architectural choice that has not been completely finalized in Go (AFAICT) has to do with the ability to move objects. It is common to see early GC-based runtime implementations that do not move objects around succumb to an architectural "hole" that creates a reliance on that fact early in the maturity cycles. This usually happens by allowing native code extensions to assume object addresses are fixed (i.e. that objects will not move during their lifetime) and getting stuck with that "assumed quality" in the long run due to practical compatibility needs. From what I can tell, it is not too late for Go to avoid this problem. Go still has precise GC capability (it knows where all the heap references are), and technically can still allow the movement of objects in the heap, even if current GC implementations do not make use of that ability. My hope for Go is that it is not too late to forgo the undesirable architectural limitation of requiring objects to stay at fixed addresses, because much of what I think is ahead for GC in Go may rely on breaking that assumption. [to be clear, supporting various forms of temporary "pinning" of objects is fine, and does not represent a strong architectural limitation.]

The ability to move objects is key for two main qualities that future Go GC's (and GCs in pretty much any runtime) will find desirable:

1. The ability to defragment the heap in order to support indefinite execution lengths that do not depend on friendly behavior patterns for object sizes over time. And defragmentation requires moving objects around before they die. Heaps that can compact are dramatically more resilient than ones that don't. Much like GC itself, self-defragmenting heaps allow the relatively easy construction of software that is free to deal with arbitrarily complex data shapes and scenarios. The value of this freedom is often disputed by people who focus on "things that have worked in C/C++ for a long time", but is easily seen in the multitude of application logic that breaks when restricted to using "malloc'ed heap" or "arena allocator" memory management mechanisms.

2. The ability of move objects is fundamental to many of the techniques that support high throughput GC and fast allocation logic. Generational GC is a proven, nearly-universally-used technique that tends to improve GC throughput by at least one order of magnitude [compared to non-generational uses of similar algorithms] across a wide spectrum of application behaviors in many different runtimes. It achieves that by fundamentally focusing some efficient-at-sparse-occupancy algorithm on a known-to-be-sparse logical subset of the heap. And it maintains the qualities needed (tracking the logical subset of the heap, and keeping it sparse) by using object relocation (e.g. for promotion). While there *may* exist alternative techniques that achieve similar throughput improvements without moving objects, I don't know of any that have been widely and successfully used so far, in any runtime.

I would predict that as Go's use application cases expand it's GC implementations will evolve to include generational moving collectors. This *does not* necessarily mean a monolithic STW copying newgen. Given their [very correct] focus on latency, I'd expect even the newgen collectors in future generational GC collectors to prioritize the minimization or elimination of STW pauses, which will probably make them either fully concurrent, incremental-STW (with viable arbitrarily small incremental steps), or a combination of the two.

And yes, doing that (concurrent generational GC) while supporting great latency, high throughput, and high efficiency all at the same time, is very possible. It is NOT the hard or inherent tradeoff many people seem to believe it to be. And practical proof points already exist.
If it's not "hard", why do we not see more GCs that have great throughput and low latency?

Focus? Working on the right problems?

It all starts with saying that keeping latency down is a priority, and that STW pauses (larger than other OS artifacts at least, at least) are unacceptable. The rest is good engineering and implementation.

But why focus on latency if you're saying it's not "hard" to have low latency and high throughput? Surely if that's the case and you can have your cake and eat it too, it would be done routinely? I think Hotspot initially focused on pure throughput - when the java threads are running, the only overhead is card marks; otherwise, they own the CPU for the time slice they're there and all (otherwise) eligible CPUs are available for the app to use.

In that sense, I think the Go team has the right focus and started from working on the right problems. They have a bunch more engineering and implementation work ahead to cover wider use cases, but their choices are already having an effect.

On Thursday, December 22, 2016 at 4:12:14 AM UTC-8, Vitaly Davidovich wrote:

FWIW, I think the Go team is right in favoring lower latency over throughput of their GC given the expected usage scenarios for Go.

In fact, most of the (Hotspot based) Java GC horror stories involve very long pauses (G1 and CMS not excluded) - I've yet to hear anyone complain that their "Big Data" servers are too slow because the GC is chipping away at throughput too much (most of those servers aren't CPU bound to begin with or can't become CPU bound due to bottlenecks elsewhere).

In the Hotspot world, there's also the unfortunate situation right now where G1 isn't really solving any problem nor advancing the performance (latency) boundaries. In fact, it taxes throughput quite heavily (e.g. significantly more expensive write barriers) but isn't breaking any new ground.

On Wed, Dec 21, 2016 at 11:38 PM Zellyn <zel...@gmail.com> wrote:

You might be interested in this Golang-group follow-up thread too… https://groups.google.com/forum/#!topic/golang-nuts/DyxPz-cQDe4 (naturally a more pro-Go
take on things…)

On Wednesday, December 21, 2016 at 8:23:02 AM UTC-5, Greg Young wrote:
Thought people on here would enjoy this

https://medium.com/@octskyward/modern-garbage-collection-911ef4f8bd8e#.ptdzwcmq2

--

Studying for the Turing test

--

You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Sent from my phone

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Gil Tene

unread,

Dec 22, 2016, 3:58:28 PM12/22/16

to mechanica...@googlegroups.com

On Thursday, December 22, 2016 at 10:46:47 AM UTC-8, Vitaly Davidovich wrote:

On Thu, Dec 22, 2016 at 12:59 PM, Gil Tene <g...@azul.com> wrote:

On Thursday, December 22, 2016 at 9:33:09 AM UTC-8, Vitaly Davidovich wrote:

On Thu, Dec 22, 2016 at 12:14 PM, Gil Tene <g...@azul.com> wrote:
Go's GC story is evolving. And it's doing so at a fairly quick pace. Judging it based on a current implementation and current choices is premature, I think.
Doesn't seem premature at all to me. Current implementation is what runs in production, what people will experience, and what they should expect/write their code towards. This is not too dissimilar from the Sufficiently Smart Compiler - it *could* do something in *theory*, but practical/real implementations don't. That's not to say their story won't improve over time, but judging current impl is pretty fair IMO.

Looking at the current state of implementation to judge current use is fine. But projecting from the current state to Go as a whole and assuming inherent qualities (something that I see a lot of) would be similar to concluding that "Java is slow because it is an interpreted language" and basing those conclusions on 1997 JVMs.
Sure.

I think that the Go team's *implementation priority* focus on latency is a good one. And unlike others, I don't actually think there is a strong latency vs. throughput tradeoff in the long run. Achieving great latency in GC does not inherently come at the expense of GC throughput or application throughput. Both can one had at the same time. My view is that GC implementations in general (for all runtimes) will move towards good latency + good throughput at the same time at they mature. And Go is no different here.
I don't see how minimizing latency cannot impact application throughput. The impact can be minimized, but I don't see how it can be (nearly) zero. The impact will vary from app to app, depending on how much it intrinsically can make use of CPU before bottlenecking elsewhere and giving up CPU for reasons other than GC.

The are tiny tradeoffs involved, but they are "in the noise", and are usually immeasurable in practice. So yes, I'm saying that it is "nearly zero". Concurrency and latency are not inherently coupled with efficiency choices in GC work. In practice, they tend to have very little to do with each other.
They're necessarily coupled IMO. If a CPU is occupied with GC work that could otherwise be used by the user app, it's a throughput hit.

You may be confusing what you know about the details of some "somewhat concurrent, but not designed for simultaneous concurrency and efficiency" implementations with what is inherent to performing concurrent GC work.

The bulk (as in the vast vast vast majority) of overall rate of CPU consumed by GC work is an inherent result of allocation rate (which is a unit of throughput). How much of that work you do per allocation unit is NOT strongly affected by concurrency and interleaving choices in GC algorithms. I listed examples of dramatically more dominant algorithmic choices that have nothing to do with concurrency or incremental ability.

If extra instructions are executed on behalf of GC while also performing user work, it's a loss in throughput (e.g. write barriers, indirection barriers that may hijack execution, etc).

Again, most of that "extra instruction" cost is driven not by concurrency or incremental ability, but by other large-throughput-affecting choices. E.g. the need for a write barrier is usually dictated by generational collection (STW or not doesn't matter), not by concurrency. And adding a write barrier (to achieve generational collection) is a proven way to dramatically increase GC efficiency.

I'm not even talking about the possible d-cache and i-cache pollution, cache misses, registers reserved to support GC, etc that may be involved.

The impacts of these is so tiny that it is hard to measure. Certainly a few orders of magnitude smaller than e.g. generational collection yes/no decisions.

Since GC cannot know the allocation/memory pattern perfectly, it'll ultimately perform more CPU work than could otherwise be necessary. This is the price of admission for this type of automatic management.

When a concurrent GC's triggering and throttling mechanisms successfully get to 98%+ occupancy on regular basis, this perceived cost fades away. And it is easy and common to see that level of full use of the heap resources [yes, "70% occupancy trigger" is a silly thing, but it is in no way inherent or needed, it's just an artifact of some implementations]. Even where things are more conservative (GC kicks off earlier than it should), the impacts of the extra time spent tend to be significantly lower than other choices, and are often offset by the easier choices that concurrency allows.

To use a pratical example: a concurrent generational collector can easily dispense with the artificial line between oldgen and newgen heaps that many STW collectors employ, because when neither generation incurs significant pauses, the original reason for having the line there is gone (when oldgen STW was significant, exhausting the oldgen early was unacceptable). This allows the newgen to always use the entire available heap, reducing newgen collection frequency by a significant factor, which eliminates a significant amount of GC work (with the same application throughput in the same heap size). This results in LESS work for GC, and higher throughput. This increased efficiency is more an artifact of GC algorithm choice (share the heap between old and newgen rather than segregate it) than a result of concurrency. Concurrency simply enables that choice to be made much more easily.

There's nothing inherently wrong with it if that cost is acceptable. Likewise, there's a development/mental cost to manual memory management as well. It's all about priorities and cost.

A virtually-zero additional cost in CPU is acceptable by definition.

The fundamentals are simple:

For a given live set, heap set, and rate of allocation, a GC algorithm has a certain amount of work to do. It needs to locate the dead material and reclaim it such that it can be allocated into again. The choice of algorithms used affect the amount of work needed to achieve this. But this amount of work tends to be only minutely affected by concurrency concerns, and is dramatically affected by other choices. E.g. the choice to use compacting collectors dramatically affects the cost of allocation (bump pointer allocation is dramatically cheaper than allocation in non-contiguously empty regions). E.g. the choice of using a generational filter (or not) has several orders of magnitude higher impact on GC throughput than the choice to do concurrent or tiny-incremental-STW work (or not). The choice to use evacuation as a compaction technique (common in copying and regional collectors, but not in classic MarkSweepCompact) can dramatically affect efficiency (in two opposite directions) and depends a lot more on the amount of empty heap (aka "heap headroom") available than on concurrency or incremental STW design choices.
Concurrent collectors, at least in Hotspot, have a tendency to be outpaced by the mutators because they try to minimize throughput impact (i.e. they don't run all the time, but get triggered when certain thresholds are breached at certain check points). Then you have a problem of object graph mutation rate (not allocation rate), which in the case of G1 will also increase SATB queue processing and finalize marking phases. Yes, one can tune (or code) around it, and throw lots more headroom when all else fails.

You note examples from partially concurrent collectors that had never started with a "stay away from STW pauses" approach, and were not designed to maintain collector efficiencies independent of mutation and/or allocation rates. Future Go collectors do not have to mimic those design choices, and neither are other already-successful concurrent collectors that produce both consistent low latency and high throughput at the same time.

Being outpaced by mutators is a problem only if the algorithms you choose are susceptible to such outpacing. Same for mutation rates. Making it impossible to be outpaced by the mutator by guaranteeing single pass completion and making the collector work invariant to mutation (e.g. by not needing to track or react to mutations to complete a GC phase) eliminates those problems. They are design choices driven by simple core statements, like "don't make collector work per allocation unit vary with allocation rate" and "don't make collector work per allocation unit vary with mutation rate". Both are doable without impacting efficiency.

So an alternative is you make mutator threads assist in relocations/fixups, which I believe is what C4 does (in a nutshell, I'm glossing over a bunch of things). But, this adds read barriers. I know we talked about their purported cost before. Certainly one can try to minimize the impact, but I'm having a hard time believing that it approaches the same level of impact as a full STW collector. Of course the impact will vary based on app profile (e.g. reference/pointer heavy vs flatter structures, object graph mutation rate, etc).

There are many possible designs that can meet those needs without some inherent sacrifice in efficiency. Especially when that efficiency is often regained through some other side-effect.

Yes, the C4 LVB barrier in one such design example. It amounts to adding a test+jmp sequence in reference load path [no memory access, virtually-always-predictable, devolves to a single u-op on x86]. While there is some "cost" to this extra u-op, it's throughput impact is insignificant when compared to other things (e.g. a typical 3-4x reduction in newgen frequency). It is already significantly cheaper than the extra write barrier work done in G1 for remembered set tracking for STW compaction, for example.

An extreme way to think of a STW collector is as "the mutators are actually doing all the GC work". It's just that the mutators have stopped doing any mutator-useful work until they are done with all the GC work when a collection is triggered. A concurrent collector shifts the bulk of the work to background GC threads that may be able to do that work concurrently with the mutator threads doing something useful. Loss of efficiency is in no way inherent to that shift of work (at least not any more than STW parallelism with work stealing queues is less efficient that single-threaded operation).

A self-healing barrier (like the one used in C4) is one example of a design mechanism that maintains this efficiency. Self-healing barriers eliminate the inefficiency sometimes found in concurrent collectors by ensuring that barriers only [effectively] trigger at most once per work unit. This makes the GC work invariant to heap topology, object graph mutations, etc., and to who [mutator or GC thread] is actually doping the work. The work per allocation unit then remains fairly fixed. The wish to shift the vast-vast-vast majority of that work to collector threads is driven purely by mutator latency concerns, but does not affect efficiency in any fundamental or significant way.

Given the existence of concurrent collectors that are already able to sustain higher application throughput that STW in many practical cases (e.g. "Cassandra"), the "nearly free" or "even cheaper than free" argument has practical proof points.

A key architectural choice that has not been completely finalized in Go (AFAICT) has to do with the ability to move objects. It is common to see early GC-based runtime implementations that do not move objects around succumb to an architectural "hole" that creates a reliance on that fact early in the maturity cycles. This usually happens by allowing native code extensions to assume object addresses are fixed (i.e. that objects will not move during their lifetime) and getting stuck with that "assumed quality" in the long run due to practical compatibility needs. From what I can tell, it is not too late for Go to avoid this problem. Go still has precise GC capability (it knows where all the heap references are), and technically can still allow the movement of objects in the heap, even if current GC implementations do not make use of that ability. My hope for Go is that it is not too late to forgo the undesirable architectural limitation of requiring objects to stay at fixed addresses, because much of what I think is ahead for GC in Go may rely on breaking that assumption. [to be clear, supporting various forms of temporary "pinning" of objects is fine, and does not represent a strong architectural limitation.]

The ability to move objects is key for two main qualities that future Go GC's (and GCs in pretty much any runtime) will find desirable:

1. The ability to defragment the heap in order to support indefinite execution lengths that do not depend on friendly behavior patterns for object sizes over time. And defragmentation requires moving objects around before they die. Heaps that can compact are dramatically more resilient than ones that don't. Much like GC itself, self-defragmenting heaps allow the relatively easy construction of software that is free to deal with arbitrarily complex data shapes and scenarios. The value of this freedom is often disputed by people who focus on "things that have worked in C/C++ for a long time", but is easily seen in the multitude of application logic that breaks when restricted to using "malloc'ed heap" or "arena allocator" memory management mechanisms.

2. The ability of move objects is fundamental to many of the techniques that support high throughput GC and fast allocation logic. Generational GC is a proven, nearly-universally-used technique that tends to improve GC throughput by at least one order of magnitude [compared to non-generational uses of similar algorithms] across a wide spectrum of application behaviors in many different runtimes. It achieves that by fundamentally focusing some efficient-at-sparse-occupancy algorithm on a known-to-be-sparse logical subset of the heap. And it maintains the qualities needed (tracking the logical subset of the heap, and keeping it sparse) by using object relocation (e.g. for promotion). While there *may* exist alternative techniques that achieve similar throughput improvements without moving objects, I don't know of any that have been widely and successfully used so far, in any runtime.

I would predict that as Go's use application cases expand it's GC implementations will evolve to include generational moving collectors. This *does not* necessarily mean a monolithic STW copying newgen. Given their [very correct] focus on latency, I'd expect even the newgen collectors in future generational GC collectors to prioritize the minimization or elimination of STW pauses, which will probably make them either fully concurrent, incremental-STW (with viable arbitrarily small incremental steps), or a combination of the two.

And yes, doing that (concurrent generational GC) while supporting great latency, high throughput, and high efficiency all at the same time, is very possible. It is NOT the hard or inherent tradeoff many people seem to believe it to be. And practical proof points already exist.
If it's not "hard", why do we not see more GCs that have great throughput and low latency?

Focus? Working on the right problems?

It all starts with saying that keeping latency down is a priority, and that STW pauses (larger than other OS artifacts at least, at least) are unacceptable. The rest is good engineering and implementation.
But why focus on latency if you're saying it's not "hard" to have low latency and high throughput?

I used "hard" in "hard tradeoff" above in the sense off "rigid", "necessary", or "inherent". Not in the sense "difficult" or "complicated".

As to why focus on latency? I'm not saying focusing only on latency is any more right than focusing only on throughput. But ignoring latency in the core design choices is something that its hard to correct later (as evidenced by 20 years of attempts to deal with this in HotSpot collectors, that skipped latency-derived core design drivers).

Focusing on Latency AND Throughout is the best way too go IMO. And I'm saying that the two are in no way contradictory or opposite. They can sometimes even complimentary. A high throughput concurrent collector is a very possible (and very real) thing.

The Go team have started from the latency side, and IMO they will make lots of future progress on the throughput side. That's why I predict moving generational collection and compacting oldgen in their future. But it's also why I predict that those collectors will not be doing monolithic stop-the-world moving and compaction. i.e. I think they will end up with either concurrent or finely-incremental-STW relocation [of which I think concurrent is both more likely and more desirable] in all generations.

Surely if that's the case and you can have your cake and eat it too, it would be done routinely?

We've been routinely eating our cake and having it with C4. And I firmly believe there are at least 17 other ways for others to skin that same cat. A focused set of people undeterred by "Surely if this was doable it would already have been done" talk is all it takes. And especially since it actually HAS been done at least once, so the question of feasibility is already answered.

I think Hotspot initially focused on pure throughput - when the java threads are running, the only overhead is card marks; otherwise, they own the CPU for the time slice they're there and all (otherwise) eligible CPUs are available for the app to use.

The overall approach with HotSpot can be thought of as "start with a STW generational collector and increment your way to lower average/typical pauses in the old generations". E.g. all practical collectors in HotSpot use a monolithic stop-the-world copying newgen collector (not concurrent or incremental, monolithic, run-to-completion STW, always). And all collectors in HotSpot have a fallback to a full STW oldgen pause. The various non-monolithic-STW "cool modes" (Partially Concurrent non-compacting Oldgen in CMS, Incremental STW compaction in G1) are filters in the oldgen that occur between full STW newgen pauses and full STW oldgen pauses. They are "happy" when the filters work well enough to keep the STW newgen pauses in the tens of msec and delay the full-STW pauses for hours or days. And they consider it to be "ok and still working" even when those levels are not met.

What HotSpot has never focused on is capping STW pause times, which would require the fundamental ways of collecting both the oldgen and the newgen to not be STW (including any fallback). And since that has never been the focus, not a lot of design has been put into achieving concurrency and throughput at the same time. E.g. since the vast amount of CPU in HotSpot GC tends to be spent in the newgen collectors, and those are always monolithic-STW (i.e. they have zero concurrency with the application), considerations of concurrency and throughout don't pop up to the top of anyone's list of concerns.

But that's just HotSpot. It's not "Java". Just one [very] popular implementation of a Java runtime.

Go doesn't have to start from the same point. Instead they seem to be starting from a "start with a low-latency GC and increment your way to higher throughput". They seem to have taken that [very smart] step early [after having a pretty annoying STW early history, which I'd equate to the "classic VM" times in the early Java days], and I'm hoping they'll stick with that path.

In that sense, I think the Go team has the right focus and started from working on the right problems. They have a bunch more engineering and implementation work ahead to cover wider use cases, but their choices are already having an effect.

On Thursday, December 22, 2016 at 4:12:14 AM UTC-8, Vitaly Davidovich wrote:

FWIW, I think the Go team is right in favoring lower latency over throughput of their GC given the expected usage scenarios for Go.

In fact, most of the (Hotspot based) Java GC horror stories involve very long pauses (G1 and CMS not excluded) - I've yet to hear anyone complain that their "Big Data" servers are too slow because the GC is chipping away at throughput too much (most of those servers aren't CPU bound to begin with or can't become CPU bound due to bottlenecks elsewhere).

In the Hotspot world, there's also the unfortunate situation right now where G1 isn't really solving any problem nor advancing the performance (latency) boundaries. In fact, it taxes throughput quite heavily (e.g. significantly more expensive write barriers) but isn't breaking any new ground.

On Wed, Dec 21, 2016 at 11:38 PM Zellyn <zel...@gmail.com> wrote:

You might be interested in this Golang-group follow-up thread too… https://groups.google.com/forum/#!topic/golang-nuts/DyxPz-cQDe4 (naturally a more pro-Go
take on things…)

On Wednesday, December 21, 2016 at 8:23:02 AM UTC-5, Greg Young wrote:
Thought people on here would enjoy this

https://medium.com/@octskyward/modern-garbage-collection-911ef4f8bd8e#.ptdzwcmq2

--

Studying for the Turing test

--

You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Sent from my phone

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Vitaly Davidovich

unread,

Dec 22, 2016, 5:18:30 PM12/22/16

to mechanical-sympathy

So I agree that Go focusing on low latency is the way to go. I'm not sold that they can make it so low cost that building, for example, a high performance database in Go would have comparable latency and throughput to one built in a non-GC'd language/runtime. All the "little/insignificant" inefficiencies you brush away as being virtually zero/undetectable/etc add up. Heck, we've talked about the costs of context (or mode) switches on this thread before, and how they can be dominated by the caches needing to be warmed up (vs register/context being saved/restored). But you deem that type of cost negligible. I'll agree that it's negligible compared to 100s of millis (or worse - multiple secs or tens of secs) GC pauses. It's definitely not negligible for the lower level systems being built.

I'm curious about this bit:

Yes, the C4 LVB barrier in one such design example. It amounts to adding a test+jmp sequence in reference load path [no memory access, virtually-always-predictable, devolves to a single u-op on x86]. While there is some "cost" to this extra u-op, it's throughput impact is insignificant when compared to other things (e.g. a typical 3-4x reduction in newgen frequency). It is already significantly cheaper than the extra write barrier work done in G1 for remembered set tracking for STW compaction, for example.

So there's a test+jmp inserted before every load, which is pretty standard and common fare in Java code. But how is it "no memory access"? IIRC, when we last discussed the intricacies, a LVB would check whether a phase change occurred, which I think is a memory read, isn't it? Or by "no memory access" you mean it should still be valid in the cache? Also, I'm assuming once inlining stops at some point, the LVB/phase change information needs to be regathered by callees that were not inlined? So in other words, this will devolve into another prologue (beyond call frame setup)? I can't see how this won't impact i-cache or pollute the BTB history.

On Thu, Dec 22, 2016 at 3:58 PM, Gil Tene <g...@azul.com> wrote:

On Thursday, December 22, 2016 at 10:46:47 AM UTC-8, Vitaly Davidovich wrote:

On Thu, Dec 22, 2016 at 12:59 PM, Gil Tene <g...@azul.com> wrote:

On Thursday, December 22, 2016 at 9:33:09 AM UTC-8, Vitaly Davidovich wrote:

On Thu, Dec 22, 2016 at 12:14 PM, Gil Tene <g...@azul.com> wrote:
Go's GC story is evolving. And it's doing so at a fairly quick pace. Judging it based on a current implementation and current choices is premature, I think.
Doesn't seem premature at all to me. Current implementation is what runs in production, what people will experience, and what they should expect/write their code towards. This is not too dissimilar from the Sufficiently Smart Compiler - it *could* do something in *theory*, but practical/real implementations don't. That's not to say their story won't improve over time, but judging current impl is pretty fair IMO.

Looking at the current state of implementation to judge current use is fine. But projecting from the current state to Go as a whole and assuming inherent qualities (something that I see a lot of) would be similar to concluding that "Java is slow because it is an interpreted language" and basing those conclusions on 1997 JVMs.
Sure.

I think that the Go team's *implementation priority* focus on latency is a good one. And unlike others, I don't actually think there is a strong latency vs. throughput tradeoff in the long run. Achieving great latency in GC does not inherently come at the expense of GC throughput or application throughput. Both can one had at the same time. My view is that GC implementations in general (for all runtimes) will move towards good latency + good throughput at the same time at they mature. And Go is no different here.
I don't see how minimizing latency cannot impact application throughput. The impact can be minimized, but I don't see how it can be (nearly) zero. The impact will vary from app to app, depending on how much it intrinsically can make use of CPU before bottlenecking elsewhere and giving up CPU for reasons other than GC.

The are tiny tradeoffs involved, but they are "in the noise", and are usually immeasurable in practice. So yes, I'm saying that it is "nearly zero". Concurrency and latency are not inherently coupled with efficiency choices in GC work. In practice, they tend to have very little to do with each other.
They're necessarily coupled IMO. If a CPU is occupied with GC work that could otherwise be used by the user app, it's a throughput hit.

You may be confusing what you know about the details of some "somewhat concurrent, but not designed for simultaneous concurrency and efficiency" implementations with what is inherent to performing concurrent GC work.

The bulk (as in the vast vast vast majority) of overall rate of CPU consumed by GC work is an inherent result of allocation rate (which is a unit of throughput). How much of that work you do per allocation unit is NOT strongly affected by concurrency and interleaving choices in GC algorithms. I listed examples of dramatically more dominant algorithmic choices that have nothing to do with concurrency or incremental ability.

If extra instructions are executed on behalf of GC while also performing user work, it's a loss in throughput (e.g. write barriers, indirection barriers that may hijack execution, etc).

Again, most of that "extra instruction" cost is driven not by concurrency or incremental ability, but by other large-throighpout-affecting choices. E.g. the need for a write barrier is usually dictated by generational collection (STW or not doesn't matter), not by concurrency. And adding a write barrier (to achieve generational collection) is a proven way to dramatically increase GC efficiency.

I'm not even talking about the possible d-cache and i-cache pollution, cache misses, registers reserved to support GC, etc that may be involved.

The impacts of these is so tiny that it is hard to measure. Certainly a few orders of magnitude smaller than e.g. generational collection yes/no decisions.

Since GC cannot know the allocation/memory pattern perfectly, it'll ultimately perform more CPU work than could otherwise be necessary. This is the price of admission for this type of automatic management.

When a concurrent GC's triggering and throttling mechanisms successfully get to 98%+ occupancy on regular basis, this perceived cost fades away. And it is easy and common to see that level of full use of the heap resources [yes, "70% occupancy trigger" is a silly thing, but it is in no way inherent or needed, it's just an artifact of some implementations]. Even where things are more conservative (GC kicks off earlier than it should), the impacts of the extra time spent tend to be significantly lower than other choices, and are often offset by the easier choices that concurrency allows.

To use a pratical example: a concurrent generational collector can easily dispense with the artificial line between oldgen and newgen heaps that many STW collectors employ, because when neither generation incurs significant pauses, the original reason for having the line there is gone (when oldgen STW was significant, exhausting the oldgen early was unacceptable). This allows the newgen to always use the entire available heap, reducing newgen collection frequency by a significant factor, which eliminates a significant amount of GC work (with the same application throughput in the same heap size). This results in LESS work for GC, and higher throughput. This increased efficiency is more an artifact of GC algorithm choice (share the heap between old and newgen rather than segregate it) than a result of concurrency. Concurrency simply enables that choice to be made much more easily.

There's nothing inherently wrong with it if that cost is acceptable. Likewise, there's a development/mental cost to manual memory management as well. It's all about priorities and cost.

A virtually-zero additional cost in CPU is acceptable by definition.

The fundamentals are simple:

For a given live set, heap set, and rate of allocation, a GC algorithm has a certain amount of work to do. It needs to locate the dead material and reclaim it such that it can be allocated into again. The choice of algorithms used affect the amount of work needed to achieve this. But this amount of work tends to be only minutely affected by concurrency concerns, and is dramatically affected by other choices. E.g. the choice to use compacting collectors dramatically affects the cost of allocation (bump pointer allocation is dramatically cheaper than allocation in non-contiguously empty regions). E.g. the choice of using a generational filter (or not) has several orders of magnitude higher impact on GC throughput than the choice to do concurrent or tiny-incremental-STW work (or not). The choice to use evacuation as a compaction technique (common in copying and regional collectors, but not in classic MarkSweepCompact) can dramatically affect efficiency (in two opposite directions) and depends a lot more on the amount of empty heap (aka "heap headroom") available than on concurrency or incremental STW design choices.
Concurrent collectors, at least in Hotspot, have a tendency to be outpaced by the mutators because they try to minimize throughput impact (i.e. they don't run all the time, but get triggered when certain thresholds are breached at certain check points). Then you have a problem of object graph mutation rate (not allocation rate), which in the case of G1 will also increase SATB queue processing and finalize marking phases. Yes, one can tune (or code) around it, and throw lots more headroom when all else fails.

You note examples from partially concurrent collectors that had never started with a "stay away from STW pauses" approach, and were not designed to maintain collector efficiencies independent of mutation and/or allocation rates. Future Go collectors do not have to mimic those design choices, and neither are other already-successful concurrent collectors that produce both consistent low latency and high throughput at the same time.

Being outpaced by mutators is a problem only if the algorithms you choose are susceptible to such outpacing. Same for mutation rates. Making it impossible to be outpaced by the mutator by guaranteeing single pass completion and making the collector work invariant to mutation (e.g. by not needing to track or react to mutations to complete a GC phase) eliminates those problems. They are design choices driven by simple core statements, like "don't make collector work per allocation unit vary with allocation rate" and "don't make collector work per allocation unit vary with mutation rate". Both are doable without impacting efficiency.

So an alternative is you make mutator threads assist in relocations/fixups, which I believe is what C4 does (in a nutshell, I'm glossing over a bunch of things). But, this adds read barriers. I know we talked about their purported cost before. Certainly one can try to minimize the impact, but I'm having a hard time believing that it approaches the same level of impact as a full STW collector. Of course the impact will vary based on app profile (e.g. reference/pointer heavy vs flatter structures, object graph mutation rate, etc).

There are many possible designs that can meet those needs without some inherent sacrifice in efficiency. Especially when that efficiency is often regained through some other side-effect.

Yes, the C4 LVB barrier in one such design example. It amounts to adding a test+jmp sequence in reference load path [no memory access, virtually-always-predictable, devolves to a single u-op on x86]. While there is some "cost" to this extra u-op, it's throughput impact is insignificant when compared to other things (e.g. a typical 3-4x reduction in newgen frequency). It is already significantly cheaper than the extra write barrier work done in G1 for remembered set tracking for STW compaction, for example.

An extreme way to think of a STW collector is as "the mutators are actually doing all the GC work". It's just that the mutators have stopped doing any mutator-useful work until they are done with all the GC work when a collection is triggered. A concurrent collector shifts the bulk of the work to background GC threads that may be able to do that work concurrently with the mutator threads doing something useful. Loss of efficiency is in no way inherent to that shift of work (at least not any more than STW parallelism with work stealing queues is less efficient that single-threaded operation).

A self-healing barrier (like the one used in C4) is one example of a design mechanism that maintains this efficiency. Self-healing barriers eliminate the inefficiency sometimes found in concurrent collectors buy ensuring that barriers only [effectively] trigger at most once per work unit. This makes the GC work invariant to heap topology, object graph mutations, etc., and to who [mutator or GC thread] is actually doping the work. The work per allocation unit then remains fairly fixed. The wish to shift the vast-vast-vast majority of that work to collector threads is driven purely by mutator latency concerns, but does not affect efficiency in any fundamental or significant way.

Given the existence of concurrent collectors that are already able to sustain higher application throughput that STW in many practical cases (e.g. "Cassandra"), the "nearly free" or "even cheaper than free" argument has practical proof points.

A key architectural choice that has not been completely finalized in Go (AFAICT) has to do with the ability to move objects. It is common to see early GC-based runtime implementations that do not move objects around succumb to an architectural "hole" that creates a reliance on that fact early in the maturity cycles. This usually happens by allowing native code extensions to assume object addresses are fixed (i.e. that objects will not move during their lifetime) and getting stuck with that "assumed quality" in the long run due to practical compatibility needs. From what I can tell, it is not too late for Go to avoid this problem. Go still has precise GC capability (it knows where all the heap references are), and technically can still allow the movement of objects in the heap, even if current GC implementations do not make use of that ability. My hope for Go is that it is not too late to forgo the undesirable architectural limitation of requiring objects to stay at fixed addresses, because much of what I think is ahead for GC in Go may rely on breaking that assumption. [to be clear, supporting various forms of temporary "pinning" of objects is fine, and does not represent a strong architectural limitation.]

The ability to move objects is key for two main qualities that future Go GC's (and GCs in pretty much any runtime) will find desirable:

1. The ability to defragment the heap in order to support indefinite execution lengths that do not depend on friendly behavior patterns for object sizes over time. And defragmentation requires moving objects around before they die. Heaps that can compact are dramatically more resilient than ones that don't. Much like GC itself, self-defragmenting heaps allow the relatively easy construction of software that is free to deal with arbitrarily complex data shapes and scenarios. The value of this freedom is often disputed by people who focus on "things that have worked in C/C++ for a long time", but is easily seen in the multitude of application logic that breaks when restricted to using "malloc'ed heap" or "arena allocator" memory management mechanisms.

2. The ability of move objects is fundamental to many of the techniques that support high throughput GC and fast allocation logic. Generational GC is a proven, nearly-universally-used technique that tends to improve GC throughput by at least one order of magnitude [compared to non-generational uses of similar algorithms] across a wide spectrum of application behaviors in many different runtimes. It achieves that by fundamentally focusing some efficient-at-sparse-occupancy algorithm on a known-to-be-sparse logical subset of the heap. And it maintains the qualities needed (tracking the logical subset of the heap, and keeping it sparse) by using object relocation (e.g. for promotion). While there *may* exist alternative techniques that achieve similar throughput improvements without moving objects, I don't know of any that have been widely and successfully used so far, in any runtime.

I would predict that as Go's use application cases expand it's GC implementations will evolve to include generational moving collectors. This *does not* necessarily mean a monolithic STW copying newgen. Given their [very correct] focus on latency, I'd expect even the newgen collectors in future generational GC collectors to prioritize the minimization or elimination of STW pauses, which will probably make them either fully concurrent, incremental-STW (with viable arbitrarily small incremental steps), or a combination of the two.

And yes, doing that (concurrent generational GC) while supporting great latency, high throughput, and high efficiency all at the same time, is very possible. It is NOT the hard or inherent tradeoff many people seem to believe it to be. And practical proof points already exist.
If it's not "hard", why do we not see more GCs that have great throughput and low latency?

Focus? Working on the right problems?

It all starts with saying that keeping latency down is a priority, and that STW pauses (larger than other OS artifacts at least, at least) are unacceptable. The rest is good engineering and implementation.
But why focus on latency if you're saying it's not "hard" to have low latency and high throughput?

I used "hard" in "hard tradeoff" above in the sense off "rigid", "necessary", or "inherent". Not in the sense "difficult" or "complicated".

As to why focus on latency? I'm not saying focusing only on latency is any more right than focusing only on throughput. But ignoring latency in the core design choices is something that its hard to correct later (as evidenced by 20 years of attempts to deal with this in HotSpot collectors, that skipped latency-derived core design drivers).

Focusing on Latency AND Throughout is the best way too go IMO. And I'm saying that the two are in no way contradictory or opposite. They can sometimes even complimentary. A high throughput concurrent collector is a very possible (and very real) thing.

The Go team have started from the latency side, and IMO they will make lots of future progress on the throughput side. That's why I predict moving generational collection and compacting oldgen in their future. But it's also why I predict that those collectors will not be doing monolithic stop-the-world moving and compaction. i.e. I think they will end up with either concurrent or finely-incremental-STW relocation [of which I think concurrent is both more likely and more desirable] in all generations.

Surely if that's the case and you can have your cake and eat it too, it would be done routinely?

We've been routinely eating our cake and having it with C4. And I firmly believe there are at least 17 other ways for other to skin that same cat. A focused set of people undeterred by "Surely if this was doable it would already have been done" talk is all it takes. And especially since it actually HAS been done at least once, the question of feasibility is already answered.

I think Hotspot initially focused on pure throughput - when the java threads are running, the only overhead is card marks; otherwise, they own the CPU for the time slice they're there and all (otherwise) eligible CPUs are available for the app to use.

The overall approach with HotSpot can be though of as "start with a STW generational collector and increment your way to lower average/typical pauses in the old generations". E.g. all practical collectors in HotSpot use a monolithic stop-the-world copying newgen collector (not concurrent or incremental, monolithic, run-to-completion STW, always). And all collectors in HotSpot have a fallback to a full STW oldgen pause. The various non-monolithic-STW "cool modes" (Partially Concurrent non-compacting Oldgen in CMS, Incremental STW compaction in G1) are filters in the oldgen that occur between full STW newgen pauses and full STW oldgen pauses. They are "happy" when the filters work well enough to keep the STW newgen pauses in the tens of msec and delay the full-STW pauses for hours or days. And they consider it to be "ok and still working" even when those levels are not met.

What HotSpot has never focused on is capping STW pause times, which would require the fundamental way of collecting both the oldgen and the newgen to now be STW (including any fallback). And since that has never been the focus, not a lot of design has been put into achieving concurrency and throughput at the same time. E.g. since the vast amount of CPU in HotSpot GC tends to be spent in the newgen collectors, and those are always monolithic-STW (i.e. they have zero concurrency with the application), considerations of concurrency and throughout don't pop up to the top of anyone's list of concerns.

But that's just HotSpot. It's not "Java". Just one [very] popular implementation of a Java runtime.

Go doesn't have to start from the same point. Instead they seem to be starting from a "start with a low-latency GC and increment your way to higher throughput". They seem to have taken that [very smart] step early [after having a pretty annoying STW early history, which I'd equate to the "classic VM" times in the early Java days], and I'm hoping they'll stick with that path.

In that sense, I think the Go team has the right focus and started from working on the right problems. They have a bunch more engineering and implementation work ahead to cover wider use cases, but their choices are already having an effect.

On Thursday, December 22, 2016 at 4:12:14 AM UTC-8, Vitaly Davidovich wrote:

FWIW, I think the Go team is right in favoring lower latency over throughput of their GC given the expected usage scenarios for Go.

In fact, most of the (Hotspot based) Java GC horror stories involve very long pauses (G1 and CMS not excluded) - I've yet to hear anyone complain that their "Big Data" servers are too slow because the GC is chipping away at throughput too much (most of those servers aren't CPU bound to begin with or can't become CPU bound due to bottlenecks elsewhere).

In the Hotspot world, there's also the unfortunate situation right now where G1 isn't really solving any problem nor advancing the performance (latency) boundaries. In fact, it taxes throughput quite heavily (e.g. significantly more expensive write barriers) but isn't breaking any new ground.

On Wed, Dec 21, 2016 at 11:38 PM Zellyn <zel...@gmail.com> wrote:

You might be interested in this Golang-group follow-up thread too… https://groups.google.com/forum/#!topic/golang-nuts/DyxPz-cQDe4 (naturally a more pro-Go
take on things…)

On Wednesday, December 21, 2016 at 8:23:02 AM UTC-5, Greg Young wrote:
Thought people on here would enjoy this

https://medium.com/@octskyward/modern-garbage-collection-911ef4f8bd8e#.ptdzwcmq2

--

Studying for the Turing test

--

You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Sent from my phone

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Gil Tene

unread,

Dec 22, 2016, 8:14:03 PM12/22/16

to mechanical-sympathy

On Thursday, December 22, 2016 at 2:18:30 PM UTC-8, Vitaly Davidovich wrote:
> So I agree that Go focusing on low latency is the way to go. I'm not sold that they can make it so low cost that building, for example, a high performance database in Go would have comparable latency and throughput to one built in a non-GC'd language/runtime. All the "little/insignificant" inefficiencies you brush away as being virtually zero/undetectable/etc add up. Heck, we've talked about the costs of context (or mode) switches on this thread before, and how they can be dominated by the caches needing to be warmed up (vs register/context being saved/restored). But you deem that type of cost negligible. I'll agree that it's negligible compared to 100s of millis (or worse - multiple secs or tens of secs) GC pauses. It's definitely not negligible for the lower level systems being built.
>
>
> I'm curious about this bit:Yes, the C4 LVB barrier in one such design example. It amounts to adding a test+jmp sequence in reference load path [no memory access, virtually-always-predictable, devolves to a single u-op on x86]. While there is some "cost" to this extra u-op, it's throughput impact is insignificant when compared to other things (e.g. a typical 3-4x reduction in newgen frequency). It is already significantly cheaper than the extra write barrier work done in G1 for remembered set tracking for STW compaction, for example.
> So there's a test+jmp inserted before every load, which is pretty standard and common fare in Java code. But how is it "no memory access"? IIRC, when we last discussed the intricacies, a LVB would check whether a phase change occurred, which I think is a memory read, isn't it? Or by "no memory access" you mean it should still be valid in the cache? Also, I'm assuming once inlining stops at some point, the LVB/phase change information needs to be regathered by callees that were not inlined? So in other words, this will devolve into another prologue (beyond call frame setup)? I can't see how this won't impact i-cache or pollute the BTB history.

The test+jmp happens after a reference load, not before a load (or a store) that uses a reference [in this sense it is very different and less frequent than "use barrier" variants of read barriers, such at brooks style read barriers].

The test is between the just-loaded reference and a very stable (changes only when phase changes occur) value. That value can be read from memory (from a thread local location for sample), but can (and is) cached in a register. And that caching can be even done across frames, such that only leaving and entering Java (e.g. Making JNI calls) needs to re-establish it. The choice of keeping the value in a register or "spilling/filling" it from memory can be left to the register allocator (its just like anything else held in registers, except that "spilling" can be made free).

There is certainly an I-cache footprint here, and also an increased density of branches (when compared to not having the test+jmp at all). It's just that those effects are tiny in the greater scheme of things. For some perspective: the I-cache and branch density impacts of lvbs in current implementations are much smaller than the extra I-cache and branch density impacts of the current G1 write barrier implementations (compared to the cheaper non-G1 write barriers).

> To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

>
>
> For more options, visit https://groups.google.com/d/optout.
>
>
>
> --
>
> Sent from my phone
>
>
>
>
>
>
>
> --
>
> You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
>

> To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

>
> For more options, visit https://groups.google.com/d/optout.
>
>
>
>
>
>
>
>
>
>
> --
>
> You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
>

> To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

>
> For more options, visit https://groups.google.com/d/optout.
>
>
>
>
>
>
>
>
>
>
> --
>
> You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
>

> To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Reply all

Reply to author

Forward

0 new messages