I'm under the impression that the current implementation of condition variables in Go is incorrect and can lead to deadlocks.
1. [Atomically] increments the waiter count
2. Releases the mutex
3. Waits on the semaphore
4. Re-acquires mutex
It looks like it's possible for one thread to "steal" a signal/broadcast intended to another one if it loses the CPU between steps 2 & 3 above. For example, the following sequence seems problematic:
a. Thread A: calls cond.Wait(), loses CPU between step 2 & 3 above. (Preempted by the kernel scheduler.)
b. Thread B: calls cond.Signal() or cond.Broadcast(); semaphore gets incremented to 1.
c. Thread C: calls cond.Wait(), the wait on the semaphore is immediately satisfied by the count stored by the previous step.
d. Thread A: regains CPU, sleeps on the semaphore. It never gets the notification sent by thread B.
A more concrete example might be a [simplified] consumer/producer implementation (I'm only keeping track of a count):
func produce() {
m.Lock()
while (count == max) {
cond.Wait()
}
count++
if count == 1 {
cond.Broadcast()
}
m.Unlock()
}
func consume() {
m.Lock()
while (count == 0) {
cond.Wait()
}
count--
if count == max-1 {
cond.Broadcast()
}
m.Unlock()
}
For the code above, suppose count == 0. The following sequence is possible:
a. goroutine A calls consume(), which will call cond.Wait(); suppose it loses the CPU between steps 2 & 3.
b. goroutine B calls produce(), which will call cond.Broadcast(), which will cause the semaphore to go to 1.
c. goroutine B calls produce() repeatedly, until count reaches the max.
d. goroutine B will itself call cond.Wait(), which will be satisfied immediately.
e. goroutine B will call cond.Wait() again, and sleep.
f. goroutine A regains the CPU, and sleeps on the semaphore as well.
The threads are now deadlocked. Granted that it's a race condition that is hard to hit, it looks like a correctness issue.
Am I missing something in the implementation that prevents this or is it actually broken for this scenario?
Thanks,
-Wedson