Question on 'src/internal/singleflight': The return result of the ForgetUnshared method seems to be inaccurate

236 views
Skip to first unread message

atomic

unread,
Sep 21, 2022, 2:45:24 AM9/21/22
to golang-nuts
hello

I find that the `src/internal/singleflight/singleflight.go ForgetUnshared()` method returns results that are not always expected

For this I wrote a test code, I copied the code in the src/internal/singleflight/singleflight.go file to the main package, and wrote a main function to test it, if ForgetUnshared() returns correctly, this code It should not panic, but the fact that it will panic every time it runs, is there something wrong with my understanding of ForgetUnshared()?

The test code cannot be run in goplay, so I posted a link: 

The go version I use for testing is 1.16, 1.19.1

result:
```
$ go run cmd/main.go
panic: callUUID=[9314284969] err=[context canceled] currentUUId=[6980556786]
```

Cuong Manh Le

unread,
Sep 21, 2022, 4:17:35 AM9/21/22
to golang-nuts
Hello,

You use time.Sleep in your program, so the behavior is not predictable. In fact, I get it success or panic randomly.

You can see https://go-review.googlesource.com/c/sync/+/424114 to see a predictable test of ForgetUnshared .

atomic

unread,
Sep 21, 2022, 5:01:22 AM9/21/22
to golang-nuts
Thanks for your reply, but I still don't understand why time.Sleep is causing my test program to panic.

In fact, this is a real online environment problem. My application uses http.Client.Do(), but it occasionally has errors: [lookup xxxxx on xxxxx: dial udp xxxxx: operation was canceled], after looking at the code, I found that it may be There is a problem with ForgetUnshared, lookupIPAddr uses ForgetUnshared: https://github.com/golang/go/blob/4a4127bccc826ebb6079af3252bc6bfeaec187c4/src/net/lookup.go#L336

Brian Candler

unread,
Sep 21, 2022, 8:18:30 AM9/21/22
to golang-nuts
Notice that DoChan starts a goroutine for the task...

        go g.doCall(c, key, fn)

... and then returns immediately.

Also notice that the random time you pick for cancelTime can be longer than the different random time you sleep inside the goroutine (i.e. the function which you pass to DoChan).  Hence the goroutine can return a result, before the cancelTime is reached.

Try this modification:

--- main.go.orig    2022-09-21 13:14:10.000000000 +0100
+++ main.go    2022-09-21 13:13:43.000000000 +0100
@@ -144,7 +144,7 @@
             defer wg.Done()

             ch, _ := g.DoChan(key, func() (interface{}, error) {
-                time.Sleep(randTimeout())
+                time.Sleep(5000 * time.Millisecond)
                 if ctx.Err() == context.Canceled {
                     return nil, fmt.Errorf("callUUID=[%d] err=[%s]", uuid, ctx.Err())
                 }
@@ -152,7 +152,7 @@
             })

             // randomly choose a timeout to cancel
-            cancelTime := time.After(randTimeout())
+            cancelTime := time.After(10 * time.Millisecond)
             select {
             case <-cancelTime:
                 // cancel only if no other goroutines share

atomic

unread,
Sep 21, 2022, 10:12:47 PM9/21/22
to golang-nuts
> Also notice that the random time you pick for cancelTime can be longer than the different random time you sleep inside the goroutine (i.e. the function which you pass to DoChan).  Hence the goroutine can return a result, before the cancelTime is reached.

Although the goroutine can return a result before cancelTime arrives, the returned result should not be err because I haven't had time to call cancel(). 

Brian Candler

unread,
Sep 22, 2022, 3:37:07 PM9/22/22
to golang-nuts
OK, I see where you're coming from - and I agree, this is a difficult one!

The point you were making is that

                                if g.ForgetUnshared(key) {
                                        cancel()
                                }

should only invoke cancel() if this result wasn't shared: i.e. there's only one receiver in the c.chans array, and c.dups == 0.  So where's the race, given that everything in g is done under a mutex?

What I have discovered so far is: when g.ForgetUnshared(key) returns true and the problem occurs, the key is not present in the map (as opposed to being present with c.dups == 0).  But I've not been able to work out why yet.

Incidentally, a minor style observation: you passed in ctx to your go func(...), but not cancel. As far as I can see, both ctx and cancel are local variables which drop immediately out of scope - there's no way they can be modified later outside of the goroutine.  So I believe you don't need to pass ctx at all: you can access it via the closure.  But if you do pass one "to be on the safe side", then I think the other should be passed as well - otherwise it's confusing why you passed in only one.

In fact, in this case, you could move the ctx/cancel creation inside the go func(...) anyway.  The only thing which needs to be outside is the wg.Add(1).

Brian Candler

unread,
Sep 22, 2022, 6:16:22 PM9/22/22
to golang-nuts
OK, I think I have it.  It's ugly.

Firstly, note that multiple instances of doCall can be running for the same key.  This happens when:

1. you invoke DoChan.  This inserts a 'c' (call struct) into the map and starts doCall in a goroutine.
2. at this point it's not shared: i.e. you don't call DoChan again with the same key (yet).
3. you invoke ForgetUnshared on this key. This "detaches" it, but doCall carries on running. It has its own local copy of 'c' so it knows where to send the result, even though the map is now empty.
4. you invoke DoChan again with the same key.  This inserts a new 'c' into the map and starts a new doCall goroutine.

At this point, you have two instances of doCall running, and the map is pointing at the second one.

This is where it gets ugly.

5. you invoke DoChan yet again with the same key. This turns it into a shared task, with c.dups > 0, len(c.chans) > 1.
6. the first instance of doCall terminates.  At this point it unconditionally removes the key from the map - even though it had previously been removed by ForgetUnshared!

func (g *Group) doCall(c *call, key string, fn func() (interface{}, error)) {
        c.val, c.err = fn()
        c.wg.Done()

        g.mu.Lock()
        delete(g.m, key)    // <<<< NOTE
        for _, ch := range c.chans {
                ch <- Result{c.val, c.err, c.dups > 0}
        }
        g.mu.Unlock()
}

So, even though it's the first instance of doCall which is terminating, it's removing the second instance of doCall from the map.  This is now also a detached task.

7. In one of the two goroutines, the timeout event occurs.  It calls ForgetUnshared, which happily returns true because the key does not exist in the map - and therefore you proceed to cancel the context.

But actually a task with this key *is* running; and furthermore, it is a shared task, with 2 channel receivers.

8. Once the sleep has completed in the task function, it notices that the context is cancelled and returns an error.

9. doCall sends the resulting error down multiple channels (those you started in steps 4 and 5 above)

10. The select { case res := <-ch } triggers in the *other* goroutine - the one which didn't have a timeout. Hence it receives the error, and that's where you panic().

Brian Candler

unread,
Sep 22, 2022, 6:23:13 PM9/22/22
to golang-nuts
And here's a proof-of-concept fix which seems to do the job:

--- main.go.orig    2022-09-21 13:14:10.000000000 +0100
+++ main.go    2022-09-22 23:19:54.000000000 +0100
@@ -27,6 +27,7 @@
     // not written after the WaitGroup is done.
     dups  int
     chans []chan<- Result
+    forgotten bool
 }

 // Group represents a class of work and forms a namespace in
@@ -101,7 +102,9 @@
     c.wg.Done()

     g.mu.Lock()
-    delete(g.m, key)
+    if !c.forgotten {
+        delete(g.m, key)
+    }

     for _, ch := range c.chans {
         ch <- Result{c.val, c.err, c.dups > 0}
     }
@@ -121,6 +124,7 @@
         return true
     }
     if c.dups == 0 {
+        c.forgotten = true
         delete(g.m, key)
         return true

atomic

unread,
Sep 22, 2022, 10:23:29 PM9/22/22
to golang-nuts
Thank you so much, so happy, you are amazing.
You answered a question that has been bothering me for days, I opened an issue on github, can you submit a pr to fix this?
https://github.com/golang/go/issues/55343

Cuong Manh Le

unread,
Sep 22, 2022, 11:05:51 PM9/22/22
to golang-nuts
Seems to me this commit is not port to the internal singleflight: https://github.com/golang/sync/commit/56d357773e8497dfd526f0727e187720d1093757

atomic

unread,
Sep 22, 2022, 11:59:03 PM9/22/22
to golang-nuts
No wonder I didn't find related questions, the method names in the two packages are different:
One is Forget and the other is ForgetUnshared

Cuong Manh Le

unread,
Sep 23, 2022, 12:09:57 AM9/23/22
to golang-nuts
I sent https://go-review.googlesource.com/c/go/+/433315 for fixing the issue.

Brian Candler

unread,
Sep 23, 2022, 3:11:45 AM9/23/22
to golang-nuts
On Friday, 23 September 2022 at 03:23:29 UTC+1 atomic wrote:
Thank you so much, so happy, you are amazing.
You answered a question that has been bothering me for days

Thank you for providing a carefully isolated and reproducible problem.  It was a fun challenge!

And thanks to Cuong for creating the full patch and a proper test case. 
Reply all
Reply to author
Forward
0 new messages