Transaction Oriented Collector (TOC) - Go's next GC

5,116 views
Skip to first unread message

r...@golang.org

unread,
Jun 24, 2016, 8:16:03 AM6/24/16
to golang-dev
Over the course of the next several months we hope to implement, test, optimize, and if the numbers warrent, deploy a brand new garbage collection (GC) algorithm for Go. The Transaction Oriented Collector or TOC algorithm will be integrated with and augment Go's current low latency collector.  golang.org/s/gctoc contains an overview, some informal proofs showing correctness, and our implementation approach. Comments and suggestions are always welcome.

tri...@gmail.com

unread,
Jun 24, 2016, 9:52:12 AM6/24/16
to golang-dev
We expect to hear great new, cheers.

no...@google.com

unread,
Jun 24, 2016, 1:33:32 PM6/24/16
to golang-dev, tri...@gmail.com
do I understand correctly that "(Before|After): 0 <text>"  actually means "(Before|After) 0 indicates <text>"? I spent some time scratching my head try to understand why there was "0 allocated since the last GC"

Nodir Turakulov

unread,
Jun 24, 2016, 2:25:20 PM6/24/16
to golang-dev, tri...@gmail.com
I am a bit confused. Assuming heap spans are shared by different goroutines, does "Current Sweep Pointer" belong to a span or goroutine?

if a span owns current sweep pointer and an allocation in any goroutine can move it, the sentence "Objects between the initial sweep pointer and the current sweep pointer were either marked reachable or have been allocated by this goroutine." does not seem to be correct because an object may belong to another goroutine if another goroutine moved the current sweeping pointer.
Also it is unclear to me how the algorithm can "reset the Current Sweep Pointer to the goroutine Initial Sweep Pointer" in Figure 7; wouldn't it break other goroutines?

if a goroutine owns current sweep pointer, then the sentence "Each span maintains a current sweep pointer" is confusing because they seem to imply the opposite.
Also I assume the algorithm will make sure that segments [initial sweep pointer, current sweep pointer) of different goroutines will never intersect, otherwise unmarked bit past current sweep pointer does not necessarily mean that mutator can use the associated memory for a new object because it may used by an unpublished object of another goroutine, it seems. If segments of different goroutines never intersect, does that mean that there may be only one segment per span per goroutine or goroutine will manage multiple segments per span?

thank you

Caleb Spare

unread,
Jun 24, 2016, 2:37:46 PM6/24/16
to Rick Hudson, golang-dev
This is an interesting document. I have a few questions for you:

1. How does the GC understand that a goroutine is "dormant" between
requests? (As opposed to, say, waiting on i/o to finish processing a
request.)
1a. Do you think this algorithm will implicitly encourage people
towards one-goroutine-per-request architectures?

2. How do you think TOC will degenerate on workloads that do not match
the transactional hypothesis?
2a. What transactional and non-transactional workloads will you use to
evaluate TOC?

Thanks!
Caleb
> --
> You received this message because you are subscribed to the Google Groups
> "golang-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-dev+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Matthew Dempsky

unread,
Jun 24, 2016, 6:09:35 PM6/24/16
to r...@golang.org, golang-dev
Currently, spans allocate objects of a single allocation size class, and the per-P mcache structure caches a span per size class.  From the doc's description, it's not obvious to me if these will still be true under TOC.

E.g., does "running goroutine" refer to a G or P when talking about maintaining an initial sweep pointer?  If G, does that mean we'll need more spans so that we can cache them per G?  Or is it generally already the case that we can cache spans per-G?

Also, is it over-reading to understand that "an" initial sweep pointer means a single span is used for all recently allocated objects, thereby suggesting size classes are going away?  Or would it actually be an initial sweep pointer per size class?

On Fri, Jun 24, 2016 at 5:16 AM, <r...@golang.org> wrote:
Over the course of the next several months we hope to implement, test, optimize, and if the numbers warrent, deploy a brand new garbage collection (GC) algorithm for Go. The Transaction Oriented Collector or TOC algorithm will be integrated with and augment Go's current low latency collector.  golang.org/s/gctoc contains an overview, some informal proofs showing correctness, and our implementation approach. Comments and suggestions are always welcome.

--

Yura Sokolov

unread,
Jun 25, 2016, 8:32:03 PM6/25/16
to golang-dev
Some (useless) thoughts:

I think, it is could be good to maintain "generations of goroutines per thread":
- thread (P) maintains list of young generations of goroutines
- newly born goroutine falls into youngest generation.
- If size of youngest generation overflow some limit (for example, 10), then new generation added as youngest
- all goroutines from same "young generation" (of same P) uses same TOC - same heap span, same Initial Sweep and Current Sweep Pointer
- if all goroutines from same "generation" dies before GC, then "Current Sweep Pointer" for this TOC is reset, and TOC reused.
- when GC came, all survived young generations (ie, which have at least 1 live goroutine) became tenured generations.
- tenured generations behave same as young generation till GC came.
- when GC came, all survived tenured goroutines are became usual, and do not use TOC.

Pitfalls of this approach:
- young and tenured goroutine could not be individually stolen be other thread (P).
  only whole "young/tenured" generation at once.
- long living goroutines uses different allocation mechanism than young/tenured (not TOC).

Probably, there could be a way to explicitly mark some "long lived" goroutines for
using individual TOC:
- for example, i mark some long living loop as "long lived with 10MB TOC"
- the goroutine allocates using TOC until it allocates 10MB,
- then goroutine make "individual mark phase" - marks all its reachable and not published objects:
- after that it may reset "current sweep pointer" to "initial sweep pointer" safely.
- perhaps, copying collection could be used here instead:
  since reachable and not published objects are invisible to other goroutines,
  it is safe to move them around.

With regards,
Sokolov Yura

пятница, 24 июня 2016 г., 15:16:03 UTC+3 пользователь r...@golang.org написал:

Dmitry Vyukov

unread,
Jun 26, 2016, 12:49:54 PM6/26/16
to golang-dev
On Friday, June 24, 2016 at 2:16:03 PM UTC+2, r...@golang.org wrote:
Over the course of the next several months we hope to implement, test, optimize, and if the numbers warrent, deploy a brand new garbage collection (GC) algorithm for Go. The Transaction Oriented Collector or TOC algorithm will be integrated with and augment Go's current low latency collector.  golang.org/s/gctoc contains an overview, some informal proofs showing correctness, and our implementation approach. Comments and suggestions are always welcome.



What percent of objects that can be collected by TOC can also potentially be stack allocated?
TOC and stack allocation solve the same problem, while stack allocation is cheaper, provide better locality and shorter recycling time. 

Keith Randall

unread,
Jun 26, 2016, 1:22:17 PM6/26/16
to Dmitry Vyukov, golang-dev
TOC can handle all kinds of non-compile-time-constant-size and loop-allocated objects which cannot be done with stack allocation.


--

Dmitry Vyukov

unread,
Jun 27, 2016, 5:30:08 AM6/27/16
to Keith Randall, golang-dev
I mean that aggressive stack allocation is an essential part of
performance story long term regardless of what we do in malloc/GC. It
avoids malloc overhead, cuts GC overhead, provides good locality and
quick reuse. While TOC incurs malloc overhead and cost of barriers
enabled all the time. If we ignore opportunities for more aggressive
stack allocation while evaluating TOC, then we can falsely conclude
that it's profitable. While if we allocate more objects on stack it
may turn out that TOC savings do not outweigh the cost of write
barriers and increased request latency.
That's why I am asking about the ratios. When I looked at HTTP
allocations, some of them were clear candidates for stack allocation,
and some are leaking to other goroutines. The question is: what's the
size of that middle ground that does not leak but is inherently
unallocatable on stack?

Dmitry Vyukov

unread,
Jun 27, 2016, 5:55:17 AM6/27/16
to Keith Randall, golang-dev
On Mon, Jun 27, 2016 at 11:29 AM, Dmitry Vyukov <dvy...@google.com> wrote:
> I mean that aggressive stack allocation is an essential part of
> performance story long term regardless of what we do in malloc/GC. It
> avoids malloc overhead, cuts GC overhead, provides good locality and
> quick reuse. While TOC incurs malloc overhead and cost of barriers
> enabled all the time. If we ignore opportunities for more aggressive
> stack allocation while evaluating TOC, then we can falsely conclude
> that it's profitable. While if we allocate more objects on stack it
> may turn out that TOC savings do not outweigh the cost of write
> barriers and increased request latency.
> That's why I am asking about the ratios. When I looked at HTTP
> allocations, some of them were clear candidates for stack allocation,
> and some are leaking to other goroutines. The question is: what's the
> size of that middle ground that does not leak but is inherently
> unallocatable on stack?
>
>
> On Sun, Jun 26, 2016 at 7:22 PM, Keith Randall <k...@google.com> wrote:
>> TOC can handle all kinds of non-compile-time-constant-size and
>> loop-allocated objects which cannot be done with stack allocation.


Besides more obvious stack allocation opportunities that we currently
miss, dynamically-sized and loop-allocated objects can be handled by
allocating them of a separate shadow stack with bump-the-pointer
allocation and rewind of the shadow stack pointer on return. That
should handle "request deserialization" case.
On top of that, since the shadow stack is non copied on normal stack
growth, it can also handle some of the cases of shared objects when we
can statically prove strict scoping. For example, if a function spawns
a set of subgoroutines and joins them before exit, or if a function
sends a request to another goroutine over a channel and waits for a
response, the shared objects can be allocated on the shadow stack.

Rick Hudson

unread,
Jun 27, 2016, 8:18:21 AM6/27/16
to Yura Sokolov, golang-dev
At a higher non-implementation level what is being discussed here is how does one create or recognize a federation of goroutines where local means local to the federation and when all the goroutines in a federation exits, TOC can free all objects not published outside the federation.

Using birthday or age and a universal clock to identify such federations seems misplaced in Go. Each goroutine can be throught of as having its own clock and unlike languages where only a handfull of threads (and clocks) are concurrently active in Go there may be tens or hundreds of thousands of goroutines active. 

 

--
You received this message because you are subscribed to a topic in the Google Groups "golang-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/golang-dev/WcZaqTE51ZU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to golang-dev+...@googlegroups.com.

r...@google.com

unread,
Jun 27, 2016, 10:06:18 AM6/27/16
to golang-dev, tri...@gmail.com, no...@google.com
That is correct, Before 0 _indicates_ <text>

Randall Farmer

unread,
Jun 27, 2016, 10:16:34 AM6/27/16
to golang-dev
I'm enthused. Clever but understandable, seems like a great fit for Go and a lot of stuff folks build in it.

Am I guessing right that the single contiguous span is a simplification to describe the idea, and there are still spans per size class, etc.? If so, does that mean that a goroutine's min. footprint while running might include a span for each size class it uses where they don't now? Seems fine if so (we're still talking KBs and you get a lot in return), but could rate a sentence in any eventual release notes, .


Something fun re: data being published in possibly surprising ways. The net/http server uses a sync.Pool to get a pointer to a bufio.Writer likely allocated by some other goroutine, and then sets the bufio.Writer's target io.Writer to a chunkWriter for writing to the HTTP response. The chunkWriter holds a pointer to the response, so it can reach request and response headers, etc. So by using a *bufio.Writer that isn't in your local spans, you prevent the TOC from collecting the whole request/response objs. Not how big a deal that is in terms of bytes, but it's surprising behavior! (Also, talking here about HTTP/1.1 code to be clear. Know h2 stuff is intentionally communicating a lot.)

Here I see two workarounds. If you copy a bufio.Writer by value instead of by pointer, now it really is in local memory and setting its Writer doesn't publish anything. (If you're doing that you'd also need to reset the target io.Writer to nil before returning it to the pool.) I imagine something similar may come up in other situations--you wish some struct were local but it's not, so you copy it and your wish comes true, heh. Or, much simpler fix, but more specific to this situation, is to drop the sync.Pools if TOC makes it cheap to just locally alloc a bufio.Writer per response.


Assuming this goes in, I am sort of curious what you guys' position winds up being re: launching goroutines you otherwise wouldn't as a hint to the collector, and curious if a runtime.LocalGC() will be exposed. I'm not sure I have much to add on either.

r...@golang.org

unread,
Jun 27, 2016, 10:34:50 AM6/27/16
to golang-dev, r...@golang.org
Good questions but I will try to avoid speculation. Ask us again in several months when we have better answers.

On Friday, June 24, 2016 at 2:37:46 PM UTC-4, Caleb Spare wrote:
This is an interesting document. I have a few questions for you:

1. How does the GC understand that a goroutine is "dormant" between
requests? (As opposed to, say, waiting on i/o to finish processing a
request.)
Blocked with a very small stack seems to be a reasonable heuristic and is where we intend to start. Build and measure.

1a. Do you think this algorithm will implicitly encourage people
towards one-goroutine-per-request architectures?
Complicating a clean architecture to facilitate memory management is not one of our goals. The best way forward is to create clean maintainable architectures and trust us to make them fast.
 

2. How do you think TOC will degenerate on workloads that do not match
the transactional hypothesis?
It will degenerate to what we have now plus the cost of the write barrier. Keeping the cost of the write barrier low is the challenge.
 
2a. What transactional and non-transactional workloads will you use to
evaluate TOC? 
 
Our evaluations will be based on real world production systems as well as the current set of benchmarks and tests. Who we are able to recruit to help will play a large part in the workloads we will use to evaluate TOC.

r...@golang.org

unread,
Jun 27, 2016, 10:44:16 AM6/27/16
to golang-dev
Inline.


On Monday, June 27, 2016 at 10:16:34 AM UTC-4, Randall Farmer wrote:
I'm enthused. Clever but understandable, seems like a great fit for Go and a lot of stuff folks build in it.

Am I guessing right that the single contiguous span is a simplification to describe the idea, and there are still spans per size class, etc.? If so, does that mean that a goroutine's min. footprint while running might include a span for each size class it uses where they don't now? Seems fine if so (we're still talking KBs and you get a lot in return), but could rate a sentence in any eventual release notes, .
That is correct. We tried to avoid implementation detail in this document.
 


Something fun re: data being published in possibly surprising ways. The net/http server uses a sync.Pool to get a pointer to a bufio.Writer likely allocated by some other goroutine, and then sets the bufio.Writer's target io.Writer to a chunkWriter for writing to the HTTP response. The chunkWriter holds a pointer to the response, so it can reach request and response headers, etc. So by using a *bufio.Writer that isn't in your local spans, you prevent the TOC from collecting the whole request/response objs. Not how big a deal that is in terms of bytes, but it's surprising behavior! (Also, talking here about HTTP/1.1 code to be clear. Know h2 stuff is intentionally communicating a lot.) 

Here I see two workarounds. If you copy a bufio.Writer by value instead of by pointer, now it really is in local memory and setting its Writer doesn't publish anything. (If you're doing that you'd also need to reset the target io.Writer to nil before returning it to the pool.) I imagine something similar may come up in other situations--you wish some struct were local but it's not, so you copy it and your wish comes true, heh. Or, much simpler fix, but more specific to this situation, is to drop the sync.Pools if TOC makes it cheap to just locally alloc a bufio.Writer per response.


Assuming this goes in, I am sort of curious what you guys' position winds up being re: launching goroutines you otherwise wouldn't as a hint to the collector, and curious if a runtime.LocalGC() will be exposed. I'm not sure I have much to add on either.
 
runtime.LocalGC will not be exposed. My position is to launch goroutines when it makes the application cleaner. Do not warp application logic or complicate APIs to fit some version of the GC. 

hay

unread,
Aug 6, 2016, 11:10:56 AM8/6/16
to golang-dev, funny....@gmail.com
Hi Sir,

How about adding a keyword to launch goroutine in gctoc like: tgo TestGoroutine()? And it would be possible to set option for gctoc goroutine so GC doesn't come in until the goroutine is finished? It would help building soft real-time systems.

Kind regards,

Ian Lance Taylor

unread,
Aug 6, 2016, 12:03:13 PM8/6/16
to hay, golang-dev, Юрий Соколов
On Sat, Aug 6, 2016 at 7:20 AM, hay <habiba...@gmail.com> wrote:
>
> How about adding a keyword to launch goroutine in gctoc like: tgo
> TestGoroutine()? And it would be possible to set option for gctoc goroutine
> so GC doesn't come in until the goroutine is finished? It would help
> building soft real-time systems.

I think the current approach that the runtime is taking is already
making Go suitable for soft real-time systems. The GC is largely
concurrent, and stop-the-world times are steadily shrinking.

In the current runtime it does not make sense to speak of stopping the
GC. The GC is always running. It does make sense to speak of
blocking a stop-the-world step. But it would not be a good idea for a
Go program to be able to block a stop-the-world step for an arbitrary
length of time. That makes it very easy for a program to break
itself.

I do not think that is a direction we should move in. And if we did,
I do not think it should be done by adding a new language keyword.
The same effect could be had by calling a runtime function
"runtime.EnableGC(enable bool)". But I don't think that runtime
function would be a good idea either.

Ian

aguila...@gmail.com

unread,
May 26, 2017, 7:47:58 PM5/26/17
to golang-dev, r...@golang.org
On Monday, June 27, 2016 at 10:34:50 AM UTC-4, r...@golang.org wrote:
Good questions but I will try to avoid speculation. Ask us again in several months when we have better answers.

I was reminded of this thread today. Do you have any updates to share on the status of this work?

r...@golang.org

unread,
May 28, 2017, 11:14:53 AM5/28/17
to golang-dev, r...@golang.org
ROC* is running on a separate branch and there are CLs out for review. The algorithm is being studied to see how it runs on a cross section of Go programs and using the results to understand and improve performance. We have yet to do a preform rigorous, publishable, and most importantly reproducible evaluation that compares ROC to the current approach. Stay tuned.

*ROC (Request Oriented Collector) is the new name for TOC that avoids the acidic confusion of the word transaction.

Henrik Johansson

unread,
Oct 10, 2017, 8:30:24 AM10/10/17
to r...@golang.org, golang-dev
Is there any news on this?
I saw this talk from SRECon17 https://www.youtube.com/watch?v=q1h2g84EX1M and he mentiones the ROC at some point close to the end.
It is from June I guess so I may very well be a bit late the GC party.

--

Rick Hudson

unread,
Oct 10, 2017, 8:44:37 AM10/10/17
to Henrik Johansson, golang-dev
We are working on a write-up, collecting numbers, and getting on paper what we learned. It is a bit overdue but new ideas keep popping up.

To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+unsubscribe@googlegroups.com.

Henrik Johansson

unread,
Oct 10, 2017, 8:48:07 AM10/10/17
to Rick Hudson, golang-dev
Great, thanks!

To unsubscribe from this group and stop receiving emails from it, send an email to golang-dev+...@googlegroups.com.

Dmitry Vyukov

unread,
Jul 18, 2018, 7:28:11 AM7/18/18
to golang-dev
On Monday, June 27, 2016 at 11:55:17 AM UTC+2, Dmitry Vyukov wrote:
Thinking of this more, if we have not just secondary shadow stack, but a secondary shadow stack per _frame_, then we could even express things like: if this conditional dynamically-sized allocation site allocate, then the allocation needs to go to parent-parent secondary stack, because compiler proved that it is scoped against parent-parent frame.
Such scheme is potentially capable of capturing most of what TOC would capture, and maybe even more (e.g. allocations that are temporary accessible to another goroutine but still scoped against host goroutine), while still providing cheap allocation, perfect locality and prompt, bulk deallocation with no barriers involved.

Philip Hofer

unread,
Jul 27, 2018, 1:41:40 PM7/27/18
to golang-dev

Thinking of this more, if we have not just secondary shadow stack, but a secondary shadow stack per _frame_, then we could even express things like: if this conditional dynamically-sized allocation site allocate, then the allocation needs to go to parent-parent secondary stack, because compiler proved that it is scoped against parent-parent frame.
Such scheme is potentially capable of capturing most of what TOC would capture, and maybe even more (e.g. allocations that are temporary accessible to another goroutine but still scoped against host goroutine), while still providing cheap allocation, perfect locality and prompt, bulk deallocation with no barriers involved.

It would be interesting to see what cases the compiler could detect with moderate effort.

One problem I foresee is that many functions that could have this optimization applied for some call sites would not be able to use it for other call sites. So, would you monomorphize call-sites based on escape analysis?

More concretely, let's say I have the following:

  function f() []byte {
    o := make([]byte, 1)
    o[0] = 0x3f
    return o
  }

  function g() bool {
    return f()[0] == 0x3f
  }

  function h() {
    r := f()
    // ... do something with 'r' like assign it to a global or send it on a channel
  }

(Let's ignore the effects of inlining.) The call site of f() in g() can use g's shadow stack. The call site of f() in h() must use the heap. How do you compile only one copy of f? Or do you compile two?

Austin Clements

unread,
Jul 27, 2018, 1:53:18 PM7/27/18
to Philip Hofer, golang-dev
Specializing f is definitely a possibility, though has the usual downsides: How far do you have to specialize up the call stack? Do you specialize on allocation of things other than the return value and, if so, how many copies of f might you need?

Another possibility is to pass this information around dynamically in a hidden argument. We would only have to add this in cases where it's useful (e.g., when f's allocation could go on the shadow stack) and it would be easy to map different bits to different allocations. With this sort of approach, there's even a fighting chance of making this work through interface and closure calls. This is a little tricky with the calling convention, though. Maybe we steal the bottom three bits from the context register. :)

--

Philip Hofer

unread,
Jul 27, 2018, 2:04:01 PM7/27/18
to Austin Clements, golang-dev
> Another possibility is to pass this information around dynamically in a
> hidden argument. We would only have to add this in cases where it's useful
> (e.g., when f's allocation could go on the shadow stack) and it would be
> easy to map different bits to different allocations. With this sort of
> approach, there's even a fighting chance of making this work through
> interface and closure calls. This is a little tricky with the calling
> convention, though. Maybe we steal the bottom three bits from the context
> register. :)

Yes, that seems like the way to go. My intuition is that you'd end up
with exponentially growing specialization opportunities. (... unless
you can determine that the function is only called once, but in that
case it should just be inlined.)

With closure calls, you're already passing around "hidden" values, so
it would be easy enough to add the information to the closure context.

For regular function calls, you could re-use the closure context register.
Reply all
Reply to author
Forward
0 new messages