Data race reported in the race detection code within a waitgroup

Péter Szilágyi

unread,

Mar 25, 2014, 4:39:53 AM3/25/14

to golang-nuts, Dmitry Vyukov

Hi,

I am seeing an interesting race report in my code. Specifically, in a complex interaction with a WaitGroup, the race detector signals quite a lot of races for Add/Wait or Done/Wait.

I haven't managed to reproduce a simple example where this happens, but my code on github does indeed display this quite often, so I guess that will do.

You can fetch the code via

$ go get github.com/karalabe/iris

And inside the proto/pastry package run the TestMaintenance test with the following options. The race isn't reported in every run, but usually every second one does introduce it.

$ cd $GOPATH/github.com/karalabe/iris/proto/pastry

$ GOMAXPROCS=32 go test -cpu=32 -race -v -run=TestMaintenance

=== RUN TestMaintenance-32

==================

WARNING: DATA RACE

Read by goroutine 186:

sync.raceRead()

/opt/google/go/src/pkg/sync/race.go:37 +0x35

sync.(*WaitGroup).Add()

/opt/google/go/src/pkg/sync/waitgroup.go:60 +0xd0

github.com/karalabe/iris/proto/pastry.(*Overlay).exch()

/home/karalabe/work/iris/src/github.com/karalabe/iris/proto/pastry/maintenance.go:197 +0x50

github.com/karalabe/iris/proto/pastry.(*Overlay).process()

/home/karalabe/work/iris/src/github.com/karalabe/iris/proto/pastry/routing.go:198 +0xe54

github.com/karalabe/iris/proto/pastry.(*Overlay).deliver()

/home/karalabe/work/iris/src/github.com/karalabe/iris/proto/pastry/routing.go:96 +0xac

github.com/karalabe/iris/proto/pastry.(*Overlay).route()

/home/karalabe/work/iris/src/github.com/karalabe/iris/proto/pastry/routing.go:58 +0x4e0

github.com/karalabe/iris/proto/pastry.(*peer).processor()

/home/karalabe/work/iris/src/github.com/karalabe/iris/proto/pastry/peer.go:158 +0x442

Previous write by goroutine 122:

sync.raceWrite()

/opt/google/go/src/pkg/sync/race.go:41 +0x35

sync.(*WaitGroup).Wait()

/opt/google/go/src/pkg/sync/waitgroup.go:120 +0x176

github.com/karalabe/iris/proto/pastry.(*Overlay).manager()

/home/karalabe/work/iris/src/github.com/karalabe/iris/proto/pastry/maintenance.go:84 +0x437

Goroutine 186 (running) created at:

github.com/karalabe/iris/proto/pastry.(*peer).Start()

/home/karalabe/work/iris/src/github.com/karalabe/iris/proto/pastry/peer.go:83 +0x3d

github.com/karalabe/iris/proto/pastry.(*Overlay).dedup()

/home/karalabe/work/iris/src/github.com/karalabe/iris/proto/pastry/handshake.go:240 +0x8d

github.com/karalabe/iris/proto/pastry.(*Overlay).shake()

/home/karalabe/work/iris/src/github.com/karalabe/iris/proto/pastry/handshake.go:221 +0x7df

github.com/karalabe/iris/proto/pastry.(*Overlay).dial()

/home/karalabe/work/iris/src/github.com/karalabe/iris/proto/pastry/handshake.go:172 +0x3e2

github.com/karalabe/iris/proto/pastry.func·004()

/home/karalabe/work/iris/src/github.com/karalabe/iris/proto/pastry/maintenance.go:133 +0xa3

github.com/karalabe/iris/pool.(*ThreadPool).runner()

/home/karalabe/work/iris/src/github.com/karalabe/iris/pool/thread.go:140 +0xa7

Goroutine 122 (running) created at:

github.com/karalabe/iris/proto/pastry.(*Overlay).Boot()

/home/karalabe/work/iris/src/github.com/karalabe/iris/proto/pastry/overlay.go:148 +0x3d0

github.com/karalabe/iris/proto/pastry.TestMaintenance()

/home/karalabe/work/iris/src/github.com/karalabe/iris/proto/pastry/maintenance_test.go:134 +0xa5a

testing.tRunner()

/opt/google/go/src/pkg/testing/testing.go:422 +0x130

==================

--- PASS: TestMaintenance-32 (4.76 seconds)

PASS

Found 1 data race(s)

exit status 66

FAIL github.com/karalabe/iris/proto/pastry 5.768s

I'm guessing this is a race detector bug, but maybe someone more intimate with the detector could verify it :)

Cheers,

Peter

Rob Pike

unread,

Mar 25, 2014, 5:43:26 AM3/25/14

to Péter Szilágyi, golang-nuts, Dmitry Vyukov

From the stack traces alone, this looks like a bug that showed up
today in some Google code. The issue is that there is no
happens-before relation between Waitgroup.Add and Waitgroup.Wait, so
your code does an Add after a Wait, the race detector will tell you.
You need to establish a synchronization to guarantee the Add does not
pass the Wait.

-rob

Dmitry Vyukov

unread,

Mar 25, 2014, 5:46:52 AM3/25/14

to Rob Pike, Péter Szilágyi, golang-nuts

Yes, this is the situation these reads and writes in WaitGroup are
intended to detect.
Peter, please check that you call wg.Wait strictly after all "root" wg.Add's.

roger peppe

unread,

Mar 25, 2014, 6:06:22 AM3/25/14

to Rob Pike, Péter Szilágyi, golang-nuts, Dmitry Vyukov

I hope that's not strictly true. This code does an Add after Wait
is *called* (but not before it returns) but I've always considered it to be ok.
It was a scenario I remember thinking about when WaitGroup was initially
designed.

http://play.golang.org/p/DN_uEiF-5G

Jan Mercl

unread,

Mar 25, 2014, 6:11:56 AM3/25/14

to roger peppe, Rob Pike, Péter Szilágyi, golang-nuts, Dmitry Vyukov

On Tue, Mar 25, 2014 at 11:06 AM, roger peppe <rogp...@gmail.com> wrote:
> I hope that's not strictly true. This code does an Add after Wait
> is *called* (but not before it returns) but I've always considered it to be ok.
> It was a scenario I remember thinking about when WaitGroup was initially
> designed.
>
> http://play.golang.org/p/DN_uEiF-5G

There's no way how to guarantee that concurrently calling wg.Add and
wg.Wait will work properly. Not in this implementation and not in any
other. The reason is that .Wait can in such scenario legitimately
succeed before the .Add is performed. As Dmitry already stated, .Add
must HB .Wait, there's no other option.

-j

Péter Szilágyi

unread,

Mar 25, 2014, 6:26:10 AM3/25/14

to golang-nuts

Sorry, forgot to reply to the mailing list too.

---------- Forwarded message ----------
From: Péter Szilágyi <pet...@gmail.com>
Date: Tue, Mar 25, 2014 at 12:25 PM
Subject: Re: [go-nuts] Data race reported in the race detection code within a waitgroup
To: Dmitry Vyukov <dvy...@google.com>

Hi all,

I can understand the logic in detecting such cases, as usually that could signal a race, but my use case actually is functioning as intended and uses Add/Done/Wait quite interleaved.

Specifically, I'm maintaining a routing table which gets updated every time some remote connections or state updates arrive. Since my updates are quite expensive (they fire network state exchanges), I first collect all received updates into a temporary map and merge that into my routing table after no more is being received. Currently that is implemented by having a WaitGroup to count the number of threads waiting to add their updates into the temp collector. When all processes finish adding, a manager process swaps out the temporary map and merges them into the routing table. The manager then loops, waiting for new updates. I guess the race detector is bothered by this reuse of a waitgroup. But imho, this shouldn't be a problem.

Generically, I have a stream of events inbound from arbitrarily many goroutines, and I want to batch them together and process them only when no more is arriving (i.e. during a silent period). Maybe a WaitGroup isn't the best solution, but what would be more appropriate?

Channels aren't appropriate because they are bounded, and I want my event reporting routines to quickly continue (i.e. async) and not stall until the manager actually consumes them (buffered channels can fill up and then we're back to blocking).
A collector map and mutex combo can handle the batching, but then my manager doesn't have any mechanism to wait for all pending updates to reach the collector.

The perfect solution for my use case would be an unbounded channel. But since I don't have such a thing, I'm open to other design suggestions. WaitGroups solve the issue. Is my use case abusing it?

Cheers,

Peter

Dmitry Vyukov

unread,

Mar 25, 2014, 6:54:45 AM3/25/14

to Jan Mercl, roger peppe, Rob Pike, Péter Szilágyi, golang-nuts

The key word in my reply is *root* wg.Add's.
In the example, the first wg.Add is *root*, it ensures that the
counter does not drop to zero while we are doing other wg.Add's
The second wg.Add is "child". We've already ensured the counter can't
drop to zero while we are doing this wg.Add, so it can proceed
concurrently with wg.Done.

Péter Szilágyi

unread,

Mar 25, 2014, 7:04:55 AM3/25/14

to golang-nuts

After pondering about it a bit, I could circumvent a waitgroup by effectively simulating it with atomic ops and a channel, but it feels odd to go down to atomic level to implement waiting for some operations to finish (pseudo code, not tested or even compiled):

var signal chan struct{}

var pend int32

var lock sync.Mutex

var coll []*Event

func batch(e *Event) {

// Increment a pending counter

atomic.AddInt32(&pend, 1)

// Store the event into a temp collector

lock.Lock()

coll = append(coll, e)

lock.Unlock()

// Decrement the pending counter and check if silent period

if atomic.AddInt32(&pend, -1) == 0 {

// Silent period, nobody else wants in (maybe just temporary, we don't care)

select {

case signal <- struct{}{}:

// Ok, processor notified

default:

// Somebody already notified it, return

}

func processor() {

var temp []*Event

for {

// Wait for a (potential) silent period

<- signal

// Swap out collector

lock.Lock()

temp, coll = coll, []*Event{}

lock.Unlock()

// Process batched events

// ...

}

Note, I don't want a synchronization guarantee that pend is strictly zero when my processing is run. I just want an ability to optimize system state exchanges by batching events.

Dmitry Vyukov

unread,

Mar 25, 2014, 7:03:10 AM3/25/14

to Péter Szilágyi, golang-nuts

I do not see how this can possibly work:

// Inserts a state exchange into the exchange queue
func (o *Overlay) exch(p *peer, s *state) {
// Insert the state exchange
o.eventPend.Add(1)
o.eventLock.Lock()
o.exchSet[p] = s
o.eventLock.Unlock()
o.eventPend.Done()

You do Add and Done at the same instant in time.
If o.eventPend.Wait is called before o.exch, then there is indeed a
race -- Wait won't wait for Add.
If o.eventPend.Wait is called after o.exch, then Add/Done are
senseless -- the sum of Add/Done is zero.
If o.eventPend.Wait is called concurrently with o.exch, then you have
a mix of these 2 bad outcomes.

> --
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-nuts...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Péter Szilágyi

unread,

Mar 25, 2014, 7:11:29 AM3/25/14

to Dmitry Vyukov, golang-nuts

On Tue, Mar 25, 2014 at 1:03 PM, Dmitry Vyukov <dvy...@google.com> wrote:

I do not see how this can possibly work:

// Inserts a state exchange into the exchange queue
func (o *Overlay) exch(p *peer, s *state) {
// Insert the state exchange
o.eventPend.Add(1)
o.eventLock.Lock()
o.exchSet[p] = s
o.eventLock.Unlock()
o.eventPend.Done()

You do Add and Done at the same instant in time.
If o.eventPend.Wait is called before o.exch, then there is indeed a
race -- Wait won't wait for Add.

Yes, but in my case that is not a problem, since I don't call Wait only a single time, but in a loop. The add will be caught by the next wait invocation.

If o.eventPend.Wait is called after o.exch, then Add/Done are
senseless -- the sum of Add/Done is zero.

The point of add and done is wait until all concurrent go routines currently *inside* o.exch to finish. For example I receive 10 state updates. I want all 10 to be inserted into o.exchSet before allowing it to be swapped out by my manager. In the mean time more updates can arrive, it's not a problem, they will be processed in the next iteration.

If o.eventPend.Wait is called concurrently with o.exch, then you have
a mix of these 2 bad outcomes.

The issue is that my use of the WaitGroup here is an optimization. I do not require guarantees that no updates arrive during processing. I just want to optimize on network messages by batching already arrived updates and not processing them separately.

Dmitry Vyukov

unread,

Mar 25, 2014, 7:14:01 AM3/25/14

to Péter Szilágyi, golang-nuts

pend buys you nothing in this case
just

func batch(e *Event) {

lock.Lock()
coll = append(coll, e)
lock.Unlock()

select {
case signal <- struct{}{}:
// Ok, processor notified
default:
// Somebody already notified it, return
}
}

func processor() {
var temp []*Event

for _ = range signal {

// Swap out collector
lock.Lock()
temp, coll = coll, []*Event{}
lock.Unlock()

// Process batched events
// ...
}
}

sufficiently buffered channel should do as well:

updates := make(chan *Event, 1000)

func batch(e *Event) {
updates <- e
}

func processor() {
for {
var temp []*Event
temp = append(temp, <-update)
deadline := time.After(time.Second)
collectLoop: for {
select {
case e := <-updates:
temp = append(temp, e)
case <-deadline:
break collectLoop
}
}
// Process temp
// ...
}
}

this will give you much much better guarantess wrt batching

Dmitry Vyukov

unread,

Mar 25, 2014, 7:18:12 AM3/25/14

to Péter Szilágyi, golang-nuts

On Tue, Mar 25, 2014 at 3:11 PM, Péter Szilágyi <pet...@gmail.com> wrote:
>
> On Tue, Mar 25, 2014 at 1:03 PM, Dmitry Vyukov <dvy...@google.com> wrote:
>>
>> I do not see how this can possibly work:
>>
>> // Inserts a state exchange into the exchange queue
>> func (o *Overlay) exch(p *peer, s *state) {
>> // Insert the state exchange
>> o.eventPend.Add(1)
>> o.eventLock.Lock()
>> o.exchSet[p] = s
>> o.eventLock.Unlock()
>> o.eventPend.Done()
>>
>>
>> You do Add and Done at the same instant in time.
>> If o.eventPend.Wait is called before o.exch, then there is indeed a
>> race -- Wait won't wait for Add.
>
> Yes, but in my case that is not a problem, since I don't call Wait only a
> single time, but in a loop. The add will be caught by the next wait
> invocation.
>
>>
>> If o.eventPend.Wait is called after o.exch, then Add/Done are
>> senseless -- the sum of Add/Done is zero.

I believe this is incorrect usage of WaitGroup. The counter must not
be incremented from zero while a goroutine is in wg.Wait.
Future wg implementations can break your code.

> The point of add and done is wait until all concurrent go routines currently
> *inside* o.exch to finish. For example I receive 10 state updates. I want
> all 10 to be inserted into o.exchSet before allowing it to be swapped out by
> my manager. In the mean time more updates can arrive, it's not a problem,
> they will be processed in the next iteration.

This looks like a very very weak (close to senseless) aggregation
strategy to me. The region between Add and Done can be executed in
100ns. So even if 10 updates have arrived almost simultaneously, you
still can have 10 separate updates.

See my previous email with a much stronger aggregation strategy,
basically the updater collects all updates in a given time window
after the first update.

Péter Szilágyi

unread,

Mar 25, 2014, 7:43:31 AM3/25/14

to Dmitry Vyukov, golang-nuts

On Tue, Mar 25, 2014 at 1:18 PM, Dmitry Vyukov <dvy...@google.com> wrote:

On Tue, Mar 25, 2014 at 3:11 PM, Péter Szilágyi <pet...@gmail.com> wrote:
>
> On Tue, Mar 25, 2014 at 1:03 PM, Dmitry Vyukov <dvy...@google.com> wrote:
>>
>> I do not see how this can possibly work:
>>
>> // Inserts a state exchange into the exchange queue
>> func (o *Overlay) exch(p *peer, s *state) {
>> // Insert the state exchange
>> o.eventPend.Add(1)
>> o.eventLock.Lock()
>> o.exchSet[p] = s
>> o.eventLock.Unlock()
>> o.eventPend.Done()
>>
>>
>> You do Add and Done at the same instant in time.
>> If o.eventPend.Wait is called before o.exch, then there is indeed a
>> race -- Wait won't wait for Add.
>
> Yes, but in my case that is not a problem, since I don't call Wait only a
> single time, but in a loop. The add will be caught by the next wait
> invocation.
>
>>
>> If o.eventPend.Wait is called after o.exch, then Add/Done are
>> senseless -- the sum of Add/Done is zero.

I believe this is incorrect usage of WaitGroup. The counter must not
be incremented from zero while a goroutine is in wg.Wait.
Future wg implementations can break your code.

Although I could argue that this is limiting a waitgroup's capabilities, I can accept this reasoning. It would probably be extremely rare that such a case is indeed warranted and detecting potential races in non-warranted code is more valuable. That is why I asked if this use-case was abusing the wg.

> The point of add and done is wait until all concurrent go routines currently
> *inside* o.exch to finish. For example I receive 10 state updates. I want
> all 10 to be inserted into o.exchSet before allowing it to be swapped out by
> my manager. In the mean time more updates can arrive, it's not a problem,
> they will be processed in the next iteration.

This looks like a very very weak (close to senseless) aggregation
strategy to me. The region between Add and Done can be executed in
100ns. So even if 10 updates have arrived almost simultaneously, you
still can have 10 separate updates.

I guess the worst case scenario would be that the first update passes and the rest will be collected while the processor is running. But fair enough, my thought flow wasn't the strongest here :))

See my previous email with a much stronger aggregation strategy,
basically the updater collects all updates in a given time window
after the first update.

I'm not too fond of time windows since they can cause significant latencies, but maybe it would be a good solution in the current scenario (alas with a much shorted window of milliseconds) as there shouldn't be too many cascading updates. Dunno, will sleep on it and decide afterwards :)

Anyway, thanks for your time on clearing these up :)

Dmitry Vyukov

unread,

Mar 25, 2014, 8:47:02 AM3/25/14

to Péter Szilágyi, golang-nuts

well, in your current code you don't have any bound on latency at all,
a tight sequence of updates can delay the batch infinitely
with the time window you do have the worst case bound on latency. and,
sure, you can reduce it to 10ms or whatever works for you

if you are concerned about potential updates channel overflow, you can
do the following:

func processor() {
for {
var temp []*Event
temp = append(temp, <-update)
deadline := time.After(time.Second)
collectLoop: for {
select {
case e := <-updates:
temp = append(temp, e)
case <-deadline:
break collectLoop
}
}

batchChan <- temp // another goroutine will handle it asynchronously
}
}

this way you always have a goroutine reading from updates chan, so it
won't overflow until something unexpected happens (in which case you
do want to block producers, or discard messages).

ben.bitd...@gmail.com

unread,

Mar 31, 2014, 12:53:14 PM3/31/14

to pet...@gmail.com, golan...@googlegroups.com

sup world

Reply all

Reply to author

Forward