[sync] errgroup: rethink concurrency patterns

Bryan C. Mills (Gerrit)

unread,

Sep 10, 2018, 4:01:48 PM9/10/18

to Ian Lance Taylor, goph...@pubsubhelper.golang.org, Bryan C. Mills, golang-co...@googlegroups.com

Bryan C. Mills has uploaded this change for review.

errgroup: rethink concurrency patterns

A few events in the last couple of weeks have prompted me to revise
errgroup a bit.

1. https://golang.org/cl/131815 pointed out that errgroup can cause
tests to deadlock if t.Fatal or t.Skip is called in one of the
goroutines in the group. While we do not recomment t.Fatal or t.Skip
outside the test's main goroutine, it's difficult to debug such a
deadlock and arguably more useful if we build in some sort of support.

2. In my GopherCon talk, “Rethinking Classical Concurrency Patterns”,
I recommended that API authors “[m]ake concurrency an internal
detail.” In multiple discussions after the talk, folks asked me how to
handle panics in goroutines, and I realized that making concurrency an
internal detail requires that we propagate panics (and runtime.Goexit
calls) back to the caller's goroutine.

3. Kevin Burke asked whether I would be interested in a proposal to
move errgroup to the standard library. I replied that there is at
least one open problem with the API: namely, that unlike
sync.WaitGroup, an errgroup.Group cannot be reused after Wait is
called. My reasoning was that the associated context.CancelFunc must
be called in order to avoid leaking resources, and I did not want to
expand the required API surface with a boilerplate cancel call.

I looking at the examples again with that reasosing in mind, and
realized that it is almost always more robust to defer two operations:
“cancel the goroutines” and “wait for the group to exit”. A single
call suffices for both of those operations, and we can make even that
optional by attaching a finalizer to the Group: the Context associated
with a Group is already closely tied to the lifetimes of its
goroutines, so we can reasonably cancel the Context as soon as the
Group becomes unreachable.

Updates golang/go#15758.
Updates golang/go#25448.

Change-Id: Ica4ce9e4569867d1485f19365af76ca010d7b6aa
---
M errgroup/errgroup.go
M errgroup/errgroup_example_md5all_test.go
M errgroup/errgroup_test.go
3 files changed, 281 insertions(+), 54 deletions(-)

diff --git a/errgroup/errgroup.go b/errgroup/errgroup.go
index 533438d..5d00513 100644
--- a/errgroup/errgroup.go
+++ b/errgroup/errgroup.go
@@ -7,6 +7,9 @@
 package errgroup
 
 import (
+	"bytes"
+	"fmt"
+	"runtime"
 	"sync"
 
 	"golang.org/x/net/context"
@@ -17,51 +20,186 @@
 //
 // A zero Group is valid and does not cancel on error.
 type Group struct {
-	cancel func()
-
 	wg sync.WaitGroup
 
+	cancelOnce sync.Once
+	cancelCtx  func()
+
 	errOnce sync.Once
 	err     error
+
+	terminateOnce sync.Once
+	panic         interface{}
+	goexit        bool
 }
 
 // WithContext returns a new Group and an associated Context derived from ctx.
 //
-// The derived Context is canceled the first time a function passed to Go
-// returns a non-nil error or the first time Wait returns, whichever occurs
-// first.
+// The derived Context is canceled if any goroutine in the group returns a
+// non-nil error, panics, or invokes runtime.Goexit, if any goroutine calls Stop
+// on the returned Group, or if the Group becomes unreachable.
 func WithContext(ctx context.Context) (*Group, context.Context) {
 	ctx, cancel := context.WithCancel(ctx)
-	return &Group{cancel: cancel}, ctx
+	g := new(Group)
+	g.cancelCtx = cancel
+
+	// Per https://golang.org/pkg/context/, “Failing to call the CancelFunc leaks
+	// the child and its children until the parent is canceled or the timer
+	// fires.” However, here we can do a bit better: the only goroutines using the
+	// Context attached to a Group should be the goroutines associated with that
+	// group, so we can cancel the Context (and reclaim any associated resources)
+	// as soon as the group becomes unreachable.
+	runtime.SetFinalizer(g, (*Group).cancel)
+
+	return g, ctx
 }
 
-// Wait blocks until all function calls from the Go method have returned, then
-// returns the first non-nil error (if any) from them.
+// Wait blocks until all goroutines in the group have exited.
+//
+// If any goroutine panicked or invoked runtime.Goexit, Wait panics with a
+// corresponding value or invokes runtime.Goexit.
+//
+// Otherwise, Wait returns the first non-nil error (if any) returned by any of
+// the functions passed to Go.
 func (g *Group) Wait() error {
 	g.wg.Wait()
-	if g.cancel != nil {
-		g.cancel()
+	if g.panic != nil {
+		panic(g.panic)
+	}
+	if g.goexit {
+		runtime.Goexit()
 	}
 	return g.err
 }
 
-// Go calls the given function in a new goroutine.
+// Stop cancels the Context associated with g, if any, then waits for all
+// goroutines started by the Go method to exit.
+func (g *Group) Stop() {
+	g.cancel()
+	g.wg.Wait()
+}
+
+func (g *Group) cancel() {
+	g.cancelOnce.Do(func() {
+		if g.cancelCtx != nil {
+			g.cancelCtx()
+			runtime.SetFinalizer(g, nil)
+		}
+	})
+}
+
+// Go calls the given function in a new goroutine,
+// adding that goroutine to the group.
 //
-// The first call to return a non-nil error cancels the group; its error will be
-// returned by Wait.
+// The first goroutine in the group that returns a non-nil error, panics, or
+// invokes runtime.Goexit will cancel the group.
 func (g *Group) Go(f func() error) {
 	g.wg.Add(1)
 
 	go func() {
-		defer g.wg.Done()
+		var goexiting bool
 
-		if err := f(); err != nil {
+		// ⚠ Even though we recover (and save) any panic from f, runtime.Goexit
+		// cannot be recovered: we must defer at least enough of the result
+		// processing to handle the goexit path.
+		defer func() {
+			if goexiting {
+				g.terminateOnce.Do(func() {
+					g.goexit = true
+					g.cancel()
+				})
+			}
+			g.wg.Done()
+		}()
+
+		panicValue, err := doubleDeferSandwich(&goexiting, f)
+		if panicValue != nil {
+			g.terminateOnce.Do(func() {
+				g.panic = panicValue
+				g.cancel()
+			})
+		} else if err != nil {
 			g.errOnce.Do(func() {
 				g.err = err
-				if g.cancel != nil {
-					g.cancel()
-				}
+				g.cancel()
 			})
 		}
 	}()
 }
+
+// doubleDeferSandwich uses two 'defer' statements to determine whether
+// f panicked, invoked runtime.Goexit, or returned normally.
+//
+// (See https://golang.org/issue/25448#issuecomment-420006237.)
+func doubleDeferSandwich(goexiting *bool, f func() error) (panicValue interface{}, err error) {
+	normalReturn := false
+	recovered := false
+
+	defer func() {
+		if !normalReturn && !recovered {
+			*goexiting = true
+		}
+	}()
+
+	func() {
+		defer func() {
+			if !normalReturn {
+				// Ideally, we would wait to take a stack trace until we've determined
+				// whether this is a panic or a runtime.Goexit.
+				//
+				// Unfortunately, the only way we can distinguish the two is to see
+				// whether the recover stopped the goroutine from terminating, and by
+				// the time we know that, the part of the stack trace relevant to the
+				// panic has been discarded.
+				panicValue = errorOrStack(recover())
+			}
+		}()
+
+		err = f()
+		normalReturn = true
+	}()
+
+	if !normalReturn {
+		recovered = true
+	}
+	return panicValue, err
+}
+
+// A panicStack is an arbitrary value recovered from a panic
+// augmented with the stack trace at which the panic occurred.
+type panicStack struct {
+	value interface{}
+	stack []byte
+}
+
+func (p panicStack) String() string {
+	return fmt.Sprintf("%v\n\nvia errgroup.Go:\n%s", p.value, p.stack)
+}
+
+// errorOrStack returns v if it implements error, or a panicStack
+// that wraps v otherwise.
+func errorOrStack(v interface{}) interface{} {
+	// If the panic value is an error (such as http.ErrAbortHandler or
+	// bytes.ErrTooLarge), preserve its structure.
+	if _, ok := v.(error); ok {
+		return v
+	}
+
+	// Otherwise, capture a stack trace to aid in debugging.
+	stack := make([]byte, 2<<10)
+	n := runtime.Stack(stack, false)
+	for n == len(stack) {
+		stack = make([]byte, len(stack)*2)
+		n = runtime.Stack(stack, false)
+	}
+	stack = stack[:n]
+
+	// The first line of the stack trace is of the form "goroutine N [status]:"
+	// but by the time the panic reaches Wait the goroutine will no longer exist
+	// and its status will have changed. Trim out the misleading line.
+	if line := bytes.IndexByte(stack[:], '\n'); line >= 0 {
+		stack = stack[line+1:]
+	}
+
+	return panicStack{value: v, stack: stack}
+}
diff --git a/errgroup/errgroup_example_md5all_test.go b/errgroup/errgroup_example_md5all_test.go
index 714b5ae..4e068bd 100644
--- a/errgroup/errgroup_example_md5all_test.go
+++ b/errgroup/errgroup_example_md5all_test.go
@@ -35,16 +35,20 @@
 	sum  [md5.Size]byte
 }
 
+type token struct{}
+
 // MD5All reads all the files in the file tree rooted at root and returns a map
 // from file path to the MD5 sum of the file's contents. If the directory walk
 // fails or any read operation fails, MD5All returns an error.
 func MD5All(ctx context.Context, root string) (map[string][md5.Size]byte, error) {
-	// ctx is canceled when g.Wait() returns. When this version of MD5All returns
-	// - even in case of error! - we know that all of the goroutines have finished
-	// and the memory they were using can be garbage-collected.
+	// ctx is canceled when g.Stop is called or any goroutine returns an error.
+	// When this version of MD5All returns — even in case of error! — we know that
+	// all of the goroutines have finished and the memory they were using can be
+	// garbage-collected.
 	g, ctx := errgroup.WithContext(ctx)
-	paths := make(chan string)
+	defer g.Stop()
 
+	paths := make(chan string)
 	g.Go(func() error {
 		defer close(paths)
 		return filepath.Walk(root, func(path string, info os.FileInfo, err error) error {
@@ -63,39 +67,46 @@
 		})
 	})
 
-	// Start a fixed number of goroutines to read and digest files.
-	c := make(chan result)
-	const numDigesters = 20
-	for i := 0; i < numDigesters; i++ {
+	// Read and digest files concurrently,
+	// storing the results so far in a channel buffer.
+	results := make(chan map[string][md5.Size]byte, 1)
+	results <- make(map[string][md5.Size]byte)
+
+	// Use a semaphore to bound the number of files in flight.
+	const maxInFlight = 20
+	semaphore := make(chan token, maxInFlight)
+	for path := range paths {
+		select {
+		case <-ctx.Done():
+			// We can return immediately without leaving any goroutines behind:
+			// the 'defer g.Stop()' above will finish cleaning up.
+			return nil, ctx.Err()
+
+		case semaphore <- token{}:
+		}
+
+		path := path
 		g.Go(func() error {
-			for path := range paths {
-				data, err := ioutil.ReadFile(path)
-				if err != nil {
-					return err
-				}
-				select {
-				case c <- result{path, md5.Sum(data)}:
-				case <-ctx.Done():
-					return ctx.Err()
-				}
+			defer func() { <-semaphore }()
+
+			data, err := ioutil.ReadFile(path)
+			if err != nil {
+				return err
 			}
+			sum := md5.Sum(data)
+
+			m := <-results
+			m[path] = sum
+			results <- m
 			return nil
 		})
 	}
-	go func() {
-		g.Wait()
-		close(c)
-	}()
 
-	m := make(map[string][md5.Size]byte)
-	for r := range c {
-		m[r.path] = r.sum
-	}
 	// Check whether any of the goroutines failed. Since g is accumulating the
 	// errors, we don't need to send them (or check for them) in the individual
 	// results sent on the channel.
 	if err := g.Wait(); err != nil {
 		return nil, err
 	}
-	return m, nil
+	return <-results, nil
 }
diff --git a/errgroup/errgroup_test.go b/errgroup/errgroup_test.go
index 6a9696e..41545dd 100644
--- a/errgroup/errgroup_test.go
+++ b/errgroup/errgroup_test.go
@@ -9,6 +9,7 @@
 	"fmt"
 	"net/http"
 	"os"
+	"strings"
 	"testing"
 
 	"golang.org/x/net/context"
@@ -35,6 +36,8 @@
 // the sync.WaitGroup example at https://golang.org/pkg/sync/#example_WaitGroup.
 func ExampleGroup_justErrors() {
 	var g errgroup.Group
+	defer g.Stop()
+
 	var urls = []string{
 		"http://www.golang.org/",
 		"http://www.google.com/",
@@ -65,6 +68,7 @@
 func ExampleGroup_parallel() {
 	Google := func(ctx context.Context, query string) ([]Result, error) {
 		g, ctx := errgroup.WithContext(ctx)
+		defer g.Stop()
 
 		searches := []Search{Web, Image, Video}
 		results := make([]Result, len(searches))
@@ -149,6 +153,7 @@
 
 	for _, tc := range cases {
 		g, ctx := errgroup.WithContext(context.Background())
+		defer g.Stop()
 
 		for _, err := range tc.errs {
 			err := err
@@ -161,16 +166,89 @@
 				g, tc.errs, err, tc.want)
 		}
 
-		canceled := false
-		select {
-		case <-ctx.Done():
-			canceled = true
-		default:
-		}
-		if !canceled {
+		if (ctx.Err() != nil) != (tc.want != nil) {
 			t.Errorf("after %T.Go(func() error { return err }) for err in %v\n"+
-				"ctx.Done() was not closed",
-				g, tc.errs)
+				"ctx.Err() = %v",
+				g, tc.errs, ctx.Err())
 		}
 	}
 }
+
+func terminateInGroup(t *testing.T, terminate func() error) (panicValue interface{}) {
+	t.Helper()
+
+	defer func() {
+		panicValue = recover()
+	}()
+
+	g, ctx := errgroup.WithContext(context.Background())
+	defer g.Stop()
+
+	var waited = false
+	g.Go(func() error {
+		<-ctx.Done()
+		waited = true
+		return ctx.Err()
+	})
+	defer func() {
+		if !waited {
+			t.Errorf("did not wait for other goroutines to exit")
+		}
+	}()
+
+	g.Go(terminate)
+
+	err := g.Wait()
+	t.Fatalf("g.Wait() unexpectedly returned (with error %v)", err)
+	return panicValue
+}
+
+func TestPanic(t *testing.T) {
+	t.Run("<nil>", func(t *testing.T) {
+		got := terminateInGroup(t, func() error {
+			panic(nil)
+		})
+		if !strings.HasPrefix(fmt.Sprint(got), "<nil>") {
+			t.Errorf("panicked with %v; want .String() beginning with <nil>", got)
+		}
+	})
+
+	t.Run("non-error", func(t *testing.T) {
+		const s = "some string"
+		got := terminateInGroup(t, func() error {
+			panic(s)
+		})
+		if !strings.HasPrefix(fmt.Sprint(got), s) {
+			t.Errorf("panicked with %v; want .String() beginning with %#v", got, s)
+		}
+	})
+
+	t.Run("error", func(t *testing.T) {
+		var errPanic = errors.New("errPanic")
+		got := terminateInGroup(t, func() error {
+			panic(errPanic)
+		})
+		if got != errPanic {
+			t.Errorf("panicked with %v; want %v", got, errPanic)
+		}
+	})
+}
+
+func TestGoexit(t *testing.T) {
+	// We have to test runtime.Goexit from a separate goroutine: testing.T itself
+	// uses runtime.Goexit for SkipNow and FailNow, so it gets cranky if a test
+	// goroutine calls runtime.Goexit otherwise.
+	c := make(chan interface{}, 1)
+	t.Run("goexit via Skip", func(inner *testing.T) {
+		defer close(c)
+		c <- terminateInGroup(t, func() error {
+			inner.Skip("goexit!")
+			return nil
+		})
+	})
+	got := <-c
+
+	if got != nil {
+		t.Errorf("panicked with %v; want runtime.Goexit()", got)
+	}
+}

To view, visit change 134395. To unsubscribe, or for help writing mail filters, visit settings.

Bryan C. Mills (Gerrit)

unread,

Sep 10, 2018, 4:07:42 PM9/10/18

to Bryan C. Mills, Brad Fitzpatrick, goph...@pubsubhelper.golang.org, Sameer Ajmani, Damien Neil, Kevin Burke, golang-co...@googlegroups.com

Bryan C. Mills uploaded patch set #2 to this change.

View Change

errgroup: rethink concurrency patterns

A few events in the last couple of weeks have prompted me to revise
errgroup a bit.

1. https://golang.org/cl/131815 pointed out that errgroup can cause
tests to deadlock if t.Fatal or t.Skip is called in one of the

goroutines in the group. While we do not recommend t.Fatal or t.Skip


outside the test's main goroutine, it's difficult to debug such a
deadlock and arguably more useful if we build in some sort of support.

2. In my GopherCon talk, “Rethinking Classical Concurrency Patterns”,
I recommended that API authors “[m]ake concurrency an internal
detail.” In multiple discussions after the talk, folks asked me how to
handle panics in goroutines, and I realized that making concurrency an
internal detail requires that we propagate panics (and runtime.Goexit
calls) back to the caller's goroutine.

3. Kevin Burke asked whether I would be interested in a proposal to
move errgroup to the standard library. I replied that there is at
least one open problem with the API: namely, that unlike
sync.WaitGroup, an errgroup.Group cannot be reused after Wait is
called. My reasoning was that the associated context.CancelFunc must
be called in order to avoid leaking resources, and I did not want to
expand the required API surface with a boilerplate cancel call.

I looked at the examples again with that reasoning in mind, and


realized that it is almost always more robust to defer two operations:
“cancel the goroutines” and “wait for the group to exit”. A single

call suffices for both, and we can make even that optional by


attaching a finalizer to the Group: the Context associated with a
Group is already closely tied to the lifetimes of its goroutines, so
we can reasonably cancel the Context as soon as the Group becomes
unreachable.

Updates golang/go#15758.
Updates golang/go#25448.

Change-Id: Ica4ce9e4569867d1485f19365af76ca010d7b6aa
---
M errgroup/errgroup.go
M errgroup/errgroup_example_md5all_test.go
M errgroup/errgroup_test.go
3 files changed, 281 insertions(+), 54 deletions(-)

To view, visit change 134395. To unsubscribe, or for help writing mail filters, visit settings.

Bryan C. Mills (Gerrit)

unread,

Sep 10, 2018, 4:07:43 PM9/10/18

to Bryan C. Mills, goph...@pubsubhelper.golang.org, Brad Fitzpatrick, Kevin Burke, Sameer Ajmani, Damien Neil, golang-co...@googlegroups.com

Uploaded patch set 2: Commit message was updated.

View Change

To view, visit change 134395. To unsubscribe, or for help writing mail filters, visit settings.

Kevin Burke (Gerrit)

unread,

Sep 10, 2018, 4:10:02 PM9/10/18

to Bryan C. Mills, goph...@pubsubhelper.golang.org, Brad Fitzpatrick, Sameer Ajmani, Damien Neil, golang-co...@googlegroups.com

Looking forward to reviewing this! May take me a little time though :(

View Change

To view, visit change 134395. To unsubscribe, or for help writing mail filters, visit settings.

Damien Neil (Gerrit)

unread,

Sep 10, 2018, 5:41:36 PM9/10/18

to Bryan C. Mills, goph...@pubsubhelper.golang.org, Brad Fitzpatrick, Kevin Burke, Sameer Ajmani, golang-co...@googlegroups.com

I replied that there is at least one open problem with the API: namely, that unlike sync.WaitGroup, an errgroup.Group cannot be reused after Wait is called.

This is still the case after these changes, no?

View Change

2 comments:

File errgroup/errgroup.go:
- Patch Set #2, Line 52: runtime.SetFinalizer(g, (*Group).cancel)
  I wonder if this is too clever.
- Patch Set #2, Line 133: // (See https://golang.org/issue/25448#issuecomment-420006237.)
  This seems like a huge amount of effort and complexity to distinguish panic(nil) from runtime.Goexit(). ~Nobody does panic(nil) intentionally and the only common use of runtime.Goexit() is in the testing package--and if you're handling a Goexit here you're already in violation of the testing package's usage guidelines.
  It seems like it would be a lot simpler to just convert Goexit into a panic.

To view, visit change 134395. To unsubscribe, or for help writing mail filters, visit settings.

Bryan C. Mills (Gerrit)

unread,

Sep 11, 2018, 9:43:53 AM9/11/18

to Bryan C. Mills, Brad Fitzpatrick, goph...@pubsubhelper.golang.org, Sameer Ajmani, Damien Neil, Kevin Burke, golang-co...@googlegroups.com

Bryan C. Mills uploaded patch set #3 to this change.

View Change

errgroup: rethink concurrency patterns

A few events in the last couple of weeks have prompted me to revise
errgroup a bit.

1. https://golang.org/cl/131815 pointed out that errgroup can cause
tests to deadlock if t.Fatal or t.Skip is called in one of the

goroutines in the group. While we do not recommend t.Fatal or t.Skip


outside the test's main goroutine, it's difficult to debug such a
deadlock and arguably more useful if we build in some sort of support.

2. In my GopherCon talk, “Rethinking Classical Concurrency Patterns”,
I recommended that API authors “[m]ake concurrency an internal
detail.” In multiple discussions after the talk, folks asked me how to
handle panics in goroutines, and I realized that making concurrency an
internal detail requires that we propagate panics (and runtime.Goexit
calls) back to the caller's goroutine.

3. Kevin Burke asked whether I would be interested in a proposal to
move errgroup to the standard library. I replied that there is at
least one open problem with the API: namely, that unlike
sync.WaitGroup, an errgroup.Group cannot be reused after Wait is
called. My reasoning was that the associated context.CancelFunc must
be called in order to avoid leaking resources, and I did not want to
expand the required API surface with a boilerplate cancel call.

I looked at the examples again with that reasoning in mind, and


realized that it is almost always more robust to defer two operations:
“cancel the goroutines” and “wait for the group to exit”. A single

call suffices for both, and simplifies many functions that return
early on error.



Updates golang/go#15758.
Updates golang/go#25448.

Change-Id: Ica4ce9e4569867d1485f19365af76ca010d7b6aa
---
M errgroup/errgroup.go
M errgroup/errgroup_example_md5all_test.go
M errgroup/errgroup_test.go

3 files changed, 322 insertions(+), 46 deletions(-)

To view, visit change 134395. To unsubscribe, or for help writing mail filters, visit settings.

Bryan C. Mills (Gerrit)

unread,

Sep 11, 2018, 9:43:54 AM9/11/18

to Bryan C. Mills, goph...@pubsubhelper.golang.org, Brad Fitzpatrick, Kevin Burke, Sameer Ajmani, Damien Neil, golang-co...@googlegroups.com

Uploaded patch set 3.

(2 comments)

View Change

2 comments:

File errgroup/errgroup.go:
- Patch Set #2, Line 52: // The derived Context is canceled if any goroutine in the group returns a

- I wonder if this is too clever.

- Hmm, I think you're right.
  It could lead to spurious cancellation in call sites that look like:
```
	g, ctx := errgroup.WithContext(ctx)
	[…]
```

- ```
	if err := g.Wait(); err != nil {
```

- ```
		return […]
	}
	keepUsing(ctx)
```
  Ideally, I think we should redefine context.WithCancel itself to avoid the leak (e.g., using the same finalizer approach as here), but in the meantime perhaps it's better to require the user to defer a call to Stop.
- Patch Set #2, Line 133: })
  ~Nobody does panic(nil) intentionally
  That's true, but an *unintentional* panic(nil) would be much more difficult to diagnose if we turn it into a Goexit: the former halts the program with an error while the latter silently terminates the goroutine.

- > and the only common use of runtime.Goexit() is in the testing package

- I checked the call graph in Google's codebase before I wrote the CL, and (disappointingly) found at least seven distinct callers outside the testing package.
  So you're right that it's uncommon, but I don't think it's uncommon enough to ignore entirely.
  (If you think runtime.Goexit adds too much complexity in general, I would tend to agree: but I would argue that the proper solution to that would be to propose that we remove runtime.Goexit for Go 2.)

- > and if you're handling a Goexit here you're already in violation of the testing package's usage guidelines.

- That's true. But absent a proper diagnostic for that (https://golang.org/issue/15758), propagating the goexit arguably provides a much better user experience than hanging the test or converting the failure to `panic(nil)`.
  Today it would take quite a bit of thought on the user's part to get from the `panic(nil)` error message back to the root cause (calling `t.FailNow` or `t.SkipNow` or one of their wrappers in the wrong goroutine), and even more thought to figure out how to restructure the test to avoid it, when they could just write the code they intend and rely on errgroup to fix up the goroutines in the “obvious” way.
  (See also Ch. 10 of the Ousterhout book: I think he takes it too far in general, but his argument does seem valid here.)

- > It seems like it would be a lot simpler to just convert Goexit into a panic.

- It turns out not to be all that much simpler.
  We would still need an out-parameter even if we failed to distinguish the two cases, since we can't recover a Goexit in order to return a value from the function, and we would still need the `normalReturn` variable in order to detect `panic(nil)` or `runtime.Goexit` even if we treat them identically.
  What *is* a lot simpler is to ignore the possibility of `runtime.Goexit` entirely, but that results in hard-to-debug deadlocks.

To view, visit change 134395. To unsubscribe, or for help writing mail filters, visit settings.

Bryan C. Mills (Gerrit)

unread,

Sep 11, 2018, 9:47:41 AM9/11/18

to Bryan C. Mills, goph...@pubsubhelper.golang.org, Brad Fitzpatrick, Kevin Burke, Sameer Ajmani, Damien Neil, golang-co...@googlegroups.com

Patch Set 2:
(2 comments)

I replied that there is at least one open problem with the API: namely, that unlike sync.WaitGroup, an errgroup.Group cannot be reused after Wait is called.

This is still the case after these changes, no?

No: it is now possible to reuse an errgroup.Group arbitrarily many times provided that no error occurs.

Most call sites that would benefit from reuse return after the first error anyway, so that limitation is fairly insignificant in practice — and, I suspect, is a much better match for users' intuitions about the behavior of the package.

View Change

1 comment:

File errgroup/errgroup.go:
- Patch Set #2, Line 52: // The derived Context is canceled if any goroutine in the group returns a
  Hmm, I think you're right. […]
  Removed the finalizer (and added a new `New` function to preserve compatibility for existing callers of `WithContext`).

To view, visit change 134395. To unsubscribe, or for help writing mail filters, visit settings.

Damien Neil (Gerrit)

unread,

Sep 11, 2018, 7:32:19 PM9/11/18

to Bryan C. Mills, goph...@pubsubhelper.golang.org, Brad Fitzpatrick, Kevin Burke, Sameer Ajmani, golang-co...@googlegroups.com

View Change

2 comments:

File errgroup/errgroup.go:
- Patch Set #2, Line 133: })

- That's true, but an *unintentional* panic(nil) would be much more difficult to diagnose if we turn it into a Goexit: the former halts the program with an error while the latter silently terminates the goroutine.

- What if you turn both panic(nil) and Goexit into a panic with an informational message that says it may have been triggered by either?
File errgroup/errgroup.go:
- Patch Set #3, Line 44: // Deprecated: use New instead, and defer a call to Stop to clean up.
  I don't understand why this needs to be deprecated.
  If I'm following correctly, the only difference between WithContext and New is that a group created with New won't cancel the group context after Wait determines all goroutines in the group have exited. Is this change really important enough to introduce a new API?
  I also don't understand why it is desirable to be able to reuse a Group, but only if it hasn't encountered any errors. That's very subtle. What's wrong with saying you can't add new goroutines to a Group after calling Wait?

To view, visit change 134395. To unsubscribe, or for help writing mail filters, visit settings.

Bryan C. Mills (Gerrit)

unread,

Sep 12, 2018, 4:48:17 PM9/12/18

to Bryan C. Mills, goph...@pubsubhelper.golang.org, Brad Fitzpatrick, Kevin Burke, Sameer Ajmani, Damien Neil, golang-co...@googlegroups.com

View Change

2 comments:

File errgroup/errgroup.go:

- Patch Set #2, Line 133: })
  That's true, but an *unintentional* panic(nil) would be much more difficult to diagnose if we turn it into a Goexit: the former halts the program with an error while the latter silently terminates the goroutine.
  What if you turn both panic(nil) and Goexit into a panic with an informational message that says it may have been triggered by either?

- That's still more complex for the user than turning panic(nil) into a panic and Goexit into a Goexit. Goexit more-or-less unambiguously means “shut down this call tree without reporting an error”, and in the current draft that's exactly what we do.
  (Again, I tend to agree that Goexit is a misfeature, but given that it exists and is not deprecated, I think it's better to support it than to subvert its intent.)
File errgroup/errgroup.go:

- Patch Set #3, Line 44: // Deprecated: use New instead, and defer a call to Stop to clean up.
  I don't understand why this needs to be deprecated.

- The awkward interaction between WithContext, Wait, and cancellation is literally the #1 reason I told Kevin I wouldn't be comfortable with a proposal to move errgroup into the standard library.
  Cancel-on-wait introduces subtle bugs under obscure conditions, and it seems possible to make it less bug-prone.

- > If I'm following correctly, the only difference between WithContext and New is that a group created with New won't cancel the group context after Wait determines all goroutines in the group have exited. Is this change really important enough to introduce a new API?

Yes.

Consider a function that looks like this:

	g, ctx := errgroup.WithContext(ctx)
	g.Go(func() (err error) {
		x, err = […]
		return err
	})
	g.Go(func() (err error) {
		y, err = […]
		return err
	})

- ```
	if err := g.Wait(); err != nil {
```

- ```
		return err
	}
```
```
	return use(ctx, x, y)
```
  That looks like perfectly reasonable code,¹ but it has a bug: at the call to `use`, `ctx` is always cancelled. That bug will hopefully be caught during testing, but it may be difficult to track down,² and will likely frustrate the user greatly in the meantime.
  In contrast, without implicit cancellation it will do exactly what it appears to do: first do some processing in the group, then (absent an error) do some additional processing.
  ¹ Redeclarations are tricky. (See also https://golang.org/issue/377.)
  ² https://golang.org/issue/26356

- > I also don't understand why it is desirable to be able to reuse a Group, but only if it hasn't encountered any errors. That's very subtle. What's wrong with saying you can't add new goroutines to a Group after calling Wait?

- See the above example. Reusing a group looks like a reasonable thing to do given the `if err := g.Wait(); err != nil { return err }` idiom (and will look even more reasonable if something like `check` is added in Go 2!). It works fine with the `sync.WaitGroup`, and the technical reason why it shouldn't work is (at best) obscure.
  Declaring that you can't add new goroutines optimizes for implementation simplicity (here and in the `context` package), not for API simplicity, and the latter is more important.

To view, visit change 134395. To unsubscribe, or for help writing mail filters, visit settings.

Damien Neil (Gerrit)

unread,

Sep 12, 2018, 5:01:15 PM9/12/18

to Bryan C. Mills, goph...@pubsubhelper.golang.org, Brad Fitzpatrick, Kevin Burke, Sameer Ajmani, golang-co...@googlegroups.com

View Change

1 comment:

File errgroup/errgroup.go:

- Patch Set #3, Line 44: // Deprecated: use New instead, and defer a call to Stop to clean up.

- > I don't understand why this needs to be deprecated. […]
  But this change doesn't let you reuse a group; it lets you reuse a group under some circumstances (no errors, no panic) but not others. That seems strictly more complicated than never being able to reuse one.
  Or as a variation on the example above:
```
defer someOperation(ctx)
```

- ```
if err := g.Wait(); err != nil {
```

- ```
  return err
}
```
  Now someOperation gets a context that may or may not be canceled depending on whether an error occurred in the group. That may well be what the user wants, but it's not at all obvious that this is what's happening.
  Implementation complexity frequently hints at API complexity; if it's difficult to implement, it's probably difficult to describe what it's doing. I think that's the case here.

To view, visit change 134395. To unsubscribe, or for help writing mail filters, visit settings.

Bryan C. Mills (Gerrit)

unread,

Sep 12, 2018, 5:54:32 PM9/12/18

to Bryan C. Mills, goph...@pubsubhelper.golang.org, Brad Fitzpatrick, Kevin Burke, Sameer Ajmani, Damien Neil, golang-co...@googlegroups.com

View Change

1 comment:

File errgroup/errgroup.go:

- Patch Set #3, Line 44: // Deprecated: use New instead, and defer a call to Stop to clean up.

- But this change doesn't let you reuse a group; it lets you reuse a group under some circumstances (no errors, no panic) but not others. That seems strictly more complicated than never being able to reuse one.
  This change makes it so that the derived context is canceled on any “abnormal” exit.
  That seems strictly simpler that “canceled on any abnormal exit, or if you've already verified that nothing abnormal happened so far”.

- > Or as a variation on the example above:
  >
  > defer someOperation(ctx)

- > if err := g.Wait(); err != nil {

- > return err
  > }
  >
  > Now someOperation gets a context that may or may not be canceled depending on whether an error occurred in the group. That may well be what the user wants, but it's not at all obvious that this is what's happening.

- I think that to a large extent people already know to be wary of `defer` operations using `ctx` variables. (For example, `someOperation` will also fail to execute if the function is returning due to a propagated timeout error.)
  And arguably a `defer` with a `Context` parameter is much more obscure than “doing more work after checking for an error” in general.

- > Implementation complexity frequently hints at API complexity; if it's difficult to implement, it's probably difficult to describe what it's doing. I think that's the case here.

- The implementation complexity here is almost entirely due to language complexity.
  `errgroup` combines error-handling with goroutine management. It is difficult to implement because the language has two different error-propagation regimes (`error` and `panic`) and three different ways to terminate a goroutine (`Context`, `panic` and `runtime.Goexit`) and they're not orthogonal.
  I can't make the language any less complex than it already is. However, if I can make the package API more closely match the language, then users of the package might only have to deal with one source of complexity instead of two.

To view, visit change 134395. To unsubscribe, or for help writing mail filters, visit settings.

Bryan C. Mills (Gerrit)

unread,

Oct 31, 2018, 10:02:37 AM10/31/18

to Bryan C. Mills, goph...@pubsubhelper.golang.org, Brad Fitzpatrick, Kevin Burke, Sameer Ajmani, Damien Neil, golang-co...@googlegroups.com

Damien and Brad: can we make any further progress on this review? Please let me know if I should find a different reviewer.

View Change

To view, visit change 134395. To unsubscribe, or for help writing mail filters, visit settings.

Damien Neil (Gerrit)

unread,

Oct 31, 2018, 12:38:00 PM10/31/18

to Bryan C. Mills, goph...@pubsubhelper.golang.org, Brad Fitzpatrick, Kevin Burke, Sameer Ajmani, golang-co...@googlegroups.com

I still think this is taking on more complexity than is necessary.

Doing a lot of complex, hard work to hide underlying attributes of the system (e.g., interactions between Goexit and the testing package) makes the errgroup package more difficult to reason about, not less. It invites adopting an ever more complex implementation to paper over increasingly obscure corner cases, as well as encouraging users to maintain a flawed understanding of the language.

However, those are philosophical points. The implementation looks correct.

+1, but I'll leave a +2 to a second opinion.

Patch set 3:Code-Review +1

View Change

To view, visit change 134395. To unsubscribe, or for help writing mail filters, visit settings.

roger peppe (Gerrit)

unread,

Jan 24, 2020, 11:01:09 AM1/24/20

to Bryan C. Mills, goph...@pubsubhelper.golang.org, roger peppe, Tobias Schottdorf, Damien Neil, Brad Fitzpatrick, Kevin Burke, Sameer Ajmani, golang-co...@googlegroups.com

I'm not convinced this is a good idea...

Patch set 3:Code-Review -1

View Change

1 comment:

File errgroup/errgroup.go:
- Patch Set #3, Line 142: func doubleDeferSandwich(goexiting *bool, f func() error) (panicValue interface{}, err error) {
  I'm sorry, this is clever but hideous.
  It's feeling like this non-test-oriented code is being polluted by test-oriented concerns. I'm not sure that this `runtime.Goexit` propagation-to-parent is really something worth preserving in normal production code, and it has runtime performance implications too. I don't mind the propagation of panic, but that on its own would be easy to do.
  The changes to this package seem to me like a good argument for adding a `T.Go` method to the testing package, tbh, where these changes would feel appropriate, rather than making what was nice and simple code into something quite arcane.

To view, visit change 134395. To unsubscribe, or for help writing mail filters, visit settings.

Bryan C. Mills (Gerrit)

unread,

Jan 24, 2020, 11:20:03 AM1/24/20

to Bryan C. Mills, goph...@pubsubhelper.golang.org, roger peppe, Tobias Schottdorf, Damien Neil, Brad Fitzpatrick, Kevin Burke, Sameer Ajmani, golang-co...@googlegroups.com

View Change

1 comment:

File errgroup/errgroup.go:

- Patch Set #3, Line 142: func doubleDeferSandwich(goexiting *bool, f func() error) (panicValue interface{}, err error) {

- I'm sorry, this is clever but hideous. […]
  The concerns here are not strictly test-oriented — they apply to any code that may panic (i.e. any code), and any code that may invoke runtime.Goexit (which is admittedly usually just the `testing` package).
  The general concept is to make the concurrency properties of a function implemented using `errgroup` an implementation detail: if such a function invokes a user-provided callback or method, and that method panics or invokes runtime.Goexit, that panic or `runtime.Goexit` should behave the same as if the function did not use internal concurrency. (This is closely related to the point around slide 33 of my 2018 GopherCon talk.¹)
  
  Note that the runtime performance implications should be much less of a concern in Go 1.14, since the compiler now open-codes unconditional or bounded `defer` statements: the performance of this function in the general case should now be comparable to a straight-line call to `f()`.
  
  ¹https://drive.google.com/file/d/1nPdvhB0PutEJzdCq5ms6UI58dp50fcAN/view

To view, visit change 134395. To unsubscribe, or for help writing mail filters, visit settings.

Bryan Mills (Gerrit)

unread,

Jul 8, 2022, 2:25:50 PM7/8/22

to Bryan Mills, goph...@pubsubhelper.golang.org, roger peppe, Tobias Schottdorf, Damien Neil, Brad Fitzpatrick, Kevin Burke, Sameer Ajmani, golang-co...@googlegroups.com

Bryan Mills abandoned this change.

View Change

Abandoned

Carried forward in part in https://go.dev/issue/53757.

To view, visit change 134395. To unsubscribe, or for help writing mail filters, visit settings.

Reply all

Reply to author

Forward