Map of maps race condition error : "invalid memory address or nil pointer dereference"

1,223 views
Skip to first unread message

Alec Matusis

unread,
Oct 16, 2014, 7:33:36 PM10/16/14
to golan...@googlegroups.com
I have a race condition in a busy TCP server with 10000+ goroutines (one goroutine per client):

type Server struct {
  sync
.RWMutex
 
Groups map[string] *Group
...
}


type
Group struct {
 
Clients map[*Client] bool
 
....
}



This code:
--> for client := range s.Groups[groupName].Clients {
      err
:= s.WriteToClient(client, msg)



crashes very often with
panic: runtime error: invalid memory address or nil pointer dereference


Whenever I update 

s.Groups
or
s.Groups[name].Clients
i always update it inside
s.Lock() s.Unlock()
block.

We would strongly prefer to not use Rlocks when we loop over the clients for performance reasons (we did several benchmarks). We do not care about the accuracy of s.Groups[name].Clients at each point in time, we only care about it not crashing with memory errors.

When I split s.Groups[name].Clients into

  var g = s.Groups[groupName]
 
for client := range g.Clients {
    err
:= s.WriteToClient(client, msg)

the invalid memory error happens about 1000x more rarely. I would like to understand why such splitting helps, and whether it's possible to only use write locks, but not read locks and guarantee that there are no memory errors?

Jesse McNelis

unread,
Oct 16, 2014, 7:44:26 PM10/16/14
to Alec Matusis, golang-nuts
On Fri, Oct 17, 2014 at 10:33 AM, Alec Matusis <mat...@gmail.com> wrote:
> the invalid memory error happens about 1000x more rarely. I would like to
> understand why such splitting helps, and whether it's possible to only use
> write locks, but not read locks and guarantee that there are no memory
> errors?

No. Maps aren't thread safe, they are a complex data structure that
can't be atomically updated.
You'll need to put locks around all reads and writes if you want to
use them concurrently.

No values in Go are thread safe, slices, maps, chan values, ints, floats etc.

k...@golang.org

unread,
Oct 16, 2014, 7:50:05 PM10/16/14
to golan...@googlegroups.com, mat...@gmail.com, jes...@jessta.id.au
Run your program with the race detector: https://golang.org/doc/articles/race_detector.html
It will correctly complain that you can't have a concurrent read and write map operation.

Alec Matusis

unread,
Oct 16, 2014, 7:55:41 PM10/16/14
to golan...@googlegroups.com, mat...@gmail.com, jes...@jessta.id.au, k...@golang.org
I tried the race detector: the servers grew to multi-gigabyte size (instead of stable 100MB) within 15 minutes, and took the physical machine down. During these 15 min, no race conditions were detected. Without race detector, memory errors happened within 1-2 minutes. 
I do understand that maps need to be locked during their modifications, but I do not quite understand why they need to be locked during reading them, if we do not care about consistency..?

Ian Taylor

unread,
Oct 16, 2014, 8:02:08 PM10/16/14
to Alec Matusis, golang-nuts, jes...@jessta.id.au, k...@golang.org
On Thu, Oct 16, 2014 at 4:55 PM, Alec Matusis <mat...@gmail.com> wrote:
>
> I tried the race detector: the servers grew to multi-gigabyte size (instead
> of stable 100MB) within 15 minutes, and took the physical machine down.
> During these 15 min, no race conditions were detected. Without race
> detector, memory errors happened within 1-2 minutes.
> I do understand that maps need to be locked during their modifications, but
> I do not quite understand why they need to be locked during reading them, if
> we do not care about consistency..?

Because maps are a complex mutating data structure. During a write
parts of the map, including internal pointers, will be changing.
Reading the map at the same time as a write is mutating it can easily
send the program into invalid memory.

Ian


>> On Thursday, October 16, 2014 4:50:05 PM UTC-7, k...@golang.org wrote:
>> Run your program with the race detector:
>> https://golang.org/doc/articles/race_detector.html
>> It will correctly complain that you can't have a concurrent read and write
>> map operation.
>>
>>
>> On Thursday, October 16, 2014 4:44:26 PM UTC-7, Jesse McNelis wrote:
>>>
>>> On Fri, Oct 17, 2014 at 10:33 AM, Alec Matusis <mat...@gmail.com> wrote:
>>> > the invalid memory error happens about 1000x more rarely. I would like
>>> > to
>>> > understand why such splitting helps, and whether it's possible to only
>>> > use
>>> > write locks, but not read locks and guarantee that there are no memory
>>> > errors?
>>>
>>> No. Maps aren't thread safe, they are a complex data structure that
>>> can't be atomically updated.
>>> You'll need to put locks around all reads and writes if you want to
>>> use them concurrently.
>>>
>>> No values in Go are thread safe, slices, maps, chan values, ints, floats
>>> etc.
>
> --
> You received this message because you are subscribed to the Google Groups
> "golang-nuts" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to golang-nuts...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

andrewc...@gmail.com

unread,
Oct 16, 2014, 10:22:35 PM10/16/14
to golan...@googlegroups.com
If you don't want to have read locks you probably need to use different non standard data structures. I think there are a bunch of associative data structures with different characteristics that could be used, but depend on your specific application.

Unfortunately Go has poor selections of data structures at the moment.

Tahir

unread,
Oct 16, 2014, 10:58:04 PM10/16/14
to golan...@googlegroups.com
If you can refactor,  merging your two maps into a single one with a struct key will probably help (e.g. Key{Group, Client}). 

Andrew has a blog entry that could interest you if you haven't seen it yet : (http://blog.golang.org/go-maps-in-action)

You could probably avoid one lock that way I think.

It has an incidence on the way you can iterate through your data though.

Jesse McNelis

unread,
Oct 16, 2014, 11:17:38 PM10/16/14
to Alec Matusis, golang-nuts
On Fri, Oct 17, 2014 at 10:33 AM, Alec Matusis <mat...@gmail.com> wrote:
> We would strongly prefer to not use Rlocks when we loop over the clients for
> performance reasons (we did several benchmarks). We do not care about the
> accuracy of s.Groups[name].Clients at each point in time, we only care about
> it not crashing with memory errors.

Did these benchmarks show that this is a bottleneck in your
application that means that you can't achieve performance that is
required for your use-case?
Or did they just show the obvious, that synchronising shared state is
more expensive than not synchronising?

Alec Matusis

unread,
Oct 16, 2014, 11:32:38 PM10/16/14
to golan...@googlegroups.com, mat...@gmail.com, jes...@jessta.id.au

Did these benchmarks show that this is a bottleneck in your 
application that means that you can't achieve performance that is 
required for your use-case? 
Or did they just show the obvious, that synchronising shared state is 
more expensive than not synchronising? 

They actually did: our goal was to handle 100k clients on a 8 core server, and with Read locks it's not feasible (nearly feasible without). We had an older code version that actually was stable without the read locks, and the performance was about 2x. 
I am somewhat optimistic that I can now copy 
s.RLock()
clients
:= s.Groups[name].Clients

s
.RUnlock()
for c := range clients {
   err
:= s.WriteToClient(client, msg)
....

Since I deep copy, I think the for loop does not require a read lock. This has not been possible with 100k goroutines until https://code.google.com/p/go/issues/detail?id=8287 , but I just checked the latest Go from the repository tip, and the memory stays constant, even though I deep copy.

Klaus Post

unread,
Oct 17, 2014, 8:28:08 AM10/17/14
to golan...@googlegroups.com, mat...@gmail.com, jes...@jessta.id.au
Hi!

Does that work? I think you are only copying the pointer to the Clients map, not the content (or "deep copy"). If the client map is updated while you iterate the array, you risk a collision.

Could you do the "Groups map[string] *Group" as a "Groups map[int] *Group" - and use an ID for the lookup - that would at least be faster, and may free up resources for you? I saw some time ago that ints are about 5 times faster than strings.

Also, you could have a RWMutex in each Group - that way you only have to lock that while reading/writing inside a group, and you only have to use the global lock when reading/writing from the groups map.

From what I can see you *need* synchronization for what you are trying to do, just try to distribute the locking and keep it as short as possible.


/Klaus

Jesper Louis Andersen

unread,
Oct 17, 2014, 4:44:03 PM10/17/14
to Alec Matusis, golang-nuts, Jesse McNelis

On Fri, Oct 17, 2014 at 5:32 AM, Alec Matusis <mat...@gmail.com> wrote:
They actually did: our goal was to handle 100k clients on a 8 core server, and with Read locks it's not feasible (nearly feasible without). We had an older code version that actually was stable without the read locks, and the performance was about 2x. 

This sounds like something where the (new) Go 1.4 atomic.Value from sync/atomic might come in handy:


Alternatively, something based on a ctrie data structure or something such. Since Go does not give any kind of concurrent safety guarantee for maps, the compiler and runtime system are rather free to do whatever it wants with your maps. Hence, even if it works now, it won't work in the future. You need to address this, sooner or later since later Go versions might be worse at this and yield more crashes in the code. 


--
J.

Robert Melton

unread,
Oct 17, 2014, 5:28:45 PM10/17/14
to Alec Matusis, golang-nuts
On Thu, Oct 16, 2014 at 7:33 PM, Alec Matusis <mat...@gmail.com> wrote:
> I have a race condition in a busy TCP server with 10000+ goroutines (one
> goroutine per client):
> ...
>
> We would strongly prefer to not use Rlocks when we loop over the clients for
> performance reasons (we did several benchmarks). We do not care about the
> accuracy of s.Groups[name].Clients at each point in time, we only care about
> it not crashing with memory errors.

If you don't care about the accuracy (as long as it is eventually
consistent), couldn't you buffer all the writes? Either make a
duplicate map with the new writes on it (that isn't be read from) or
just buffer the writes, and at sane points, either swap the maps (or
flush the buffer into the map) and use that for the next round of
reads.

--
Robert Melton | http://robertmelton.com

egon

unread,
Oct 17, 2014, 7:00:42 PM10/17/14
to golan...@googlegroups.com
What are the characteristics of the program?

i.e.
How many groups?
How many clients per group?
How often are messages sent/broadcast? In a single group? In the application?
Is the 100k absolute upper-limit?

If you are having 100k clients in a very chatty protocol you may saturate your network card. And it might be easier to just distribute the work load over multiple computers + it would give you better fault tolerance. You wouldn't want to have 100k complaints if a single server goes down.

What is the context? What does the application do? Simple chat server?

Do the messages have to go through a central dispatch?

Few wild guesses that may help. Partition your groups and clients. That way you can avoid blocking on a single piece of data for too long.

e.g. a simple approach for ints: http://play.golang.org/p/kq443uXyOX

For Groups you would do the stripes based on hash of group name (e.g djb(groupname) % 53).
For Clients you would use some Id or connection info.

+ egon


On Friday, 17 October 2014 02:33:36 UTC+3, Alec Matusis wrote:
I have a race condition in a busy TCP server with 10000+ goroutines (one goroutine per client):

type Server struct {
  sync
.RWMutex
 
Groups map[string] *Group
...
}


type
Group struct {
 
Clients map[*Client] bool

Change this to 
   Clients map[*Client]struct{}

It uses less memory.

Alec Matusis

unread,
Oct 17, 2014, 11:49:49 PM10/17/14
to golan...@googlegroups.com, mat...@gmail.com, jes...@jessta.id.au
Does that work? I think you are only copying the pointer to the Clients map, not the content (or "deep copy"). If the client map is updated while you iterate the array, you risk a collision.

This is essentially what I did for now.. Deep copied the Group[name].Clients in a for loop under an Rlock(), and then send stuff into each client while looping over the copy without any locks. A copy of 5k clients takes 1-2ms, but sending into 5k clients takes over 80ms. So it's better to Rlock for 1-2ms than for 80ms. 
I also distributed the locks between the Group.Lock() when I add/remove clients to each group, and separately use Server.Lock() when I add/remove Groups.

Alec Matusis

unread,
Oct 17, 2014, 11:54:46 PM10/17/14
to golan...@googlegroups.com
If you are having 100k clients in a very chatty protocol you may saturate your network card. And it might be easier to just distribute the work load over multiple computers + it would give you better fault tolerance. You wouldn't want to have 100k complaints if a single server goes down.

To handle that many clients in a network card, you need to use MSI-X, that have multiple Rx-Tx queue vectors in each interface, and furthermore evenly distribute those interrupts over all CPU cores. From our experience, one can handle way over 100k connections in a modern Ethernet adapters that support MSI-X. We specifically chose GO over C to handle that many clients in a single multiprocessor system. Fault tolerance is provided by hot spares.
Reply all
Reply to author
Forward
0 new messages