Correctness issue in sync.Cond implementation?

398 views
Skip to first unread message

Wedson Almeida Filho

unread,
Jan 20, 2016, 11:24:12 PM1/20/16
to golang-nuts
I'm under the impression that the current implementation of condition variables in Go is incorrect and can lead to deadlocks.

1. [Atomically] increments the waiter count
2. Releases the mutex
3. Waits on the semaphore
4. Re-acquires mutex

It looks like it's possible for one thread to "steal" a signal/broadcast intended to another one if it loses the CPU between steps 2 & 3 above. For example, the following sequence seems problematic:

a. Thread A: calls cond.Wait(), loses CPU between step 2 & 3 above. (Preempted by the kernel scheduler.)
b. Thread B: calls cond.Signal() or cond.Broadcast(); semaphore gets incremented to 1.
c. Thread C: calls cond.Wait(), the wait on the semaphore is immediately satisfied by the count stored by the previous step.
d. Thread A: regains CPU, sleeps on the semaphore. It never gets the notification sent by thread B.

A more concrete example might be a [simplified] consumer/producer implementation (I'm only keeping track of a count):

func produce() {
  m.Lock()
  while (count == max) {
    cond.Wait()
  }

  count++
  if count == 1 {
    cond.Broadcast()
  }
  m.Unlock()
}

func consume() {
  m.Lock()
  while (count == 0) {
    cond.Wait()
  }

  count--
  if count == max-1 {
    cond.Broadcast()
  }
  m.Unlock()
}

For the code above, suppose count == 0. The following sequence is possible:
a. goroutine A calls consume(), which will call cond.Wait(); suppose it loses the CPU between steps 2 & 3.
b. goroutine B calls produce(), which will call cond.Broadcast(), which will cause the semaphore to go to 1.
c. goroutine B calls produce() repeatedly, until count reaches the max.
d. goroutine B will itself call cond.Wait(), which will be satisfied immediately.
e. goroutine B will call cond.Wait() again, and sleep.
f. goroutine A regains the CPU, and sleeps on the semaphore as well.

The threads are now deadlocked. Granted that it's a race condition that is hard to hit, it looks like a correctness issue.

Am I missing something in the implementation that prevents this or is it actually broken for this scenario?

Thanks,
-Wedson

Ian Lance Taylor

unread,
Jan 21, 2016, 7:10:42 PM1/21/16
to Wedson Almeida Filho, golang-nuts
On Wed, Jan 20, 2016 at 8:23 PM, 'Wedson Almeida Filho' via
golang-nuts <golan...@googlegroups.com> wrote:
> I'm under the impression that the current implementation of condition
> variables in Go is incorrect and can lead to deadlocks.

Thanks. I'm not quite convinced there is a problem with cond.Signal,
but I do see a potential problem with cond.Broadcast. Filed
https://golang.org/issue/14064 .

Ian

Wedson Almeida Filho

unread,
Jan 22, 2016, 2:11:08 AM1/22/16
to Ian Lance Taylor, golang-nuts
Ian,

Thanks for filing the bug.

Would you mind elaborating why you don't think the race applies to cond.Signal?

-Wedson

roger peppe

unread,
Jan 22, 2016, 4:30:23 AM1/22/16
to Wedson Almeida Filho, golang-nuts
I agree that this is probably a bug, but ISTM that you're only likely to
be bitten by it if you are trying to optimise on the number of times you
call Broadcast. In your example, if you call Broadcast any time produce
or consume are called, then I believe you can't see the problem.

I'm thinking that reducing calls to Broadcast is likely to be a premature
optimisation. I took your example and benchmarked it with and without
calling broadcast every time [1]. The time difference is 2-5%, almost lost
in the noise, and that's for a system that's doing nothing else. Far more
significant (even under go 1.6) is the difference between GOMAXPROCS=1
and GOMAXPROCS > 1.

Here are some results (all median of 5 runs):

GOMAXPROCS=4
broadcastAlways false 100000 11509 ns/op
broadcastAlways true 100000 11661 ns/op

GOMAXPROCS=1
broadcastAlways false 500000 3423 ns/op
broadcastAlways true 500000 3465 ns/op

I'm not suggesting that the bug shouldn't be fixed, but
in the meantime, can you think of a situation where "just
call Broadcast regardless" isn't likely to be a reasonable thing to do?

cheers,
rog.

[1] http://play.golang.org/p/Cbd8O5j5JD

On 21 January 2016 at 04:23, 'Wedson Almeida Filho' via golang-nuts
> --
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-nuts...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Ian Lance Taylor

unread,
Jan 22, 2016, 10:16:07 AM1/22/16
to Wedson Almeida Filho, golang-nuts
On Thu, Jan 21, 2016 at 11:10 PM, Wedson Almeida Filho
<weds...@google.com> wrote:
>
> Thanks for filing the bug.
>
> Would you mind elaborating why you don't think the race applies to
> cond.Signal?

Because it's an error to use cond.Signal rather than cond.Broadcast
unless you already clearly know which goroutine is going to receive
the signal. If your program goes right back to waiting on the same
condition variable after calling Signal, then you are doing it wrong.
That kind of race can happen with any condition variable
implementation.

The problem I think you've pointed out with cond.Broadcast is that it
can fail to wake up all waiting goroutines. With cond.Broadcast,
unlike cond.Signal, it is reasonable to go back to waiting on the
condition variable again, because at that point all the waiting
goroutines should have awoken.

At least, that is how I see it.

Ian

Wedson Almeida Filho

unread,
Jan 22, 2016, 10:57:32 AM1/22/16
to Ian Lance Taylor, golang-nuts
On Fri, Jan 22, 2016 at 3:15 PM, Ian Lance Taylor <ia...@golang.org> wrote:
On Thu, Jan 21, 2016 at 11:10 PM, Wedson Almeida Filho
<weds...@google.com> wrote:
>
> Thanks for filing the bug.
>
> Would you mind elaborating why you don't think the race applies to
> cond.Signal?

Because it's an error to use cond.Signal rather than cond.Broadcast
unless you already clearly know which goroutine is going to receive
the signal.-  If your program goes right back to waiting on the same

condition variable after calling Signal, then you are doing it wrong.
That kind of race can happen with any condition variable
implementation.

Ah, I see now your point.

However, in the example I posted, if we constrain the number of producers and consumers to 1 each, then I can use cond.Signal() instead of cond.Broadcast(), as I do clearly know which goroutine is going to receive the signal: there is only one other goroutine involved, I know my next Wait() call is not the intended receiver because it hasn't happened yet.


The problem I think you've pointed out with cond.Broadcast is that it
can fail to wake up all waiting goroutines.  With cond.Broadcast,
unlike cond.Signal, it is reasonable to go back to waiting on the
condition variable again, because at that point all the waiting
goroutines should have awoken.


I think a more accurate way to describe it is that cond.Signal (or cond.Broadcast) can fail to wake up a waiting goroutine, and instead wake up the next goroutine that tries to wait [which happens in the future].

The waking up of the next goroutine is not an issue, as spurious wakeups are somewhat expected, but failing to wake up the previously known-to-be-waiting goroutine seems to violate the condvar guarantees.

The fix is probably simple enough: enqueue an entry in a waiter list *before* releasing the mutex; then Signal/Broadcast should wake those waiters in the queue.

roger peppe

unread,
Jan 22, 2016, 1:03:02 PM1/22/16
to Wedson Almeida Filho, Ian Lance Taylor, golang-nuts
On 22 January 2016 at 15:57, 'Wedson Almeida Filho' via golang-nuts
<golan...@googlegroups.com> wrote:
> I think a more accurate way to describe it is that cond.Signal (or
> cond.Broadcast) can fail to wake up a waiting goroutine, and instead wake up
> the next goroutine that tries to wait [which happens in the future].

As Dmitry points out in the issue, this can't happen, because
runtime_Syncsemrelease will actually block until the
original thread gets around to calling runtime_Syncsemacquire.

Wedson Almeida Filho

unread,
Jan 22, 2016, 1:33:28 PM1/22/16
to roger peppe, Ian Lance Taylor, golang-nuts
On Fri, Jan 22, 2016 at 6:02 PM, roger peppe <rogp...@gmail.com> wrote:

As Dmitry points out in the issue, this can't happen, because
runtime_Syncsemrelease will actually block until the
original thread gets around to calling runtime_Syncsemacquire.

Ah, thanks! I hadn't noticed that it was using a "synchronous" semaphore; this definitely takes care of the Signal() case.

However, I believe the Broadcast case Ian described is still there. I'll add my thoughts to the bug.
Reply all
Reply to author
Forward
0 new messages