Concurrent "co" package

Dmitry Vyukov

unread,

Aug 8, 2011, 3:17:24 PM8/8/11

to golan...@googlegroups.com

I've start work on the "co" package:

http://codereview.appspot.com/4850045/

(it's not necessary become a standard library, it's just the simpest way for me to store it for now)

It includes a facility to query system parameters like number of hardware threads:

http://codereview.appspot.com/4850045/diff/3001/src/pkg/co/sysconf.go

The intended usage is:

runtime.GOMAXPROCS(co.SysConf(co.ConfThreadCount))

Everything else is based on runtime extension - proc (OS worker thread) local data:

http://codereview.appspot.com/4850045/diff2/1:3001/src/pkg/runtime/mem.go

Access to proc local data takes about 8ns and of course is perfectly scalable. The important feature is a possibility to iterate over proc local data of all procs.

Distributed RWMutex:

http://codereview.appspot.com/4850045/diff2/1:3001/src/pkg/co/distributedrwmutex.go

Semantically it's equivalent to sync.RWMutex, but scales linearly on read-mostly workloads. Intended usage is episodically updated data like config/settings or periodically but infrequently updated data like routing tables.

Statistics counters:

http://codereview.appspot.com/4850045/diff2/1:3001/src/pkg/co/statcounter.go

which allow to collect various statistics in a scalable way.

Distributed resource cache:

http://codereview.appspot.com/4850045/diff/3001/src/pkg/co/resourcecache.go

The interface is basically Put/Get interface{}.

I've incorporated it into standard fmt package:

http://codereview.appspot.com/4850045/diff/3001/src/pkg/fmt/print.go

and it improved performance of fmt.Sprintf() several times on a concurrent benchmark (currently fmt uses ad-hoc chan-based cache for format/scanf).

What do you think on this? Does Go need such a package? What concurrency-related components do you need in your programs? And especially to core Go devs - can proc local data eventually make it's way into runtime?

Dmitry Vyukov

unread,

Aug 9, 2011, 5:04:23 AM8/9/11

to golan...@googlegroups.com

On Mon, Aug 8, 2011 at 11:17 PM, Dmitry Vyukov <dvy...@google.com> wrote:

Everything else is based on runtime extension - proc (OS worker thread) local data:
http://codereview.appspot.com/4850045/diff2/1:3001/src/pkg/runtime/mem.go

Access to proc local data takes about 8ns and of course is perfectly scalable. The important feature is a possibility to iterate over proc local data of all procs.

And especially to core Go devs - can proc local data eventually make it's way into runtime?

Per thread data is really crucial for scalability, because scalability is distribution. I may try to implement it w/o runtime help by means of extremely OS-specific hacks, but that's going to be difficult, non-portable (to new supported OS) and most likely slower; and I won't be able to do it for Plan9.

John Arbash Meinel

unread,

Aug 9, 2011, 5:19:42 AM8/9/11

to Dmitry Vyukov, golan...@googlegroups.com

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

How does per-thread data interact with per-coroutine? Given that
coroutines can jump between OS threads, it seems risky.

John
=:->

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk5A+64ACgkQJdeBCYSNAAM7TACgpvPmb/qs3d0LoUkd73fLwXnf
L4QAoKHYB5jrrQ6TBOo41SDXT7Ylr71G
=fawn
-----END PGP SIGNATURE-----

Dmitry Vyukov

unread,

Aug 9, 2011, 5:32:23 AM8/9/11

to John Arbash Meinel, golan...@googlegroups.com

On Tue, Aug 9, 2011 at 1:19 PM, John Arbash Meinel <jo...@arbash-meinel.com> wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 8/9/2011 11:04 AM, Dmitry Vyukov wrote:
> On Mon, Aug 8, 2011 at 11:17 PM, Dmitry Vyukov <dvy...@google.com

> <mailto:dvy...@google.com>> wrote:
>
> Everything else is based on runtime extension - proc (OS worker
> thread) local data:
> http://codereview.appspot.com/4850045/diff2/1:3001/src/pkg/runtime/mem.go
>
>
Access to proc local data takes about 8ns and of course is perfectly
> scalable. The important feature is a possibility to iterate over proc
> local data of all procs.
>
> And especially to core Go devs - can proc local data eventually make
> it's way into runtime?
>
>
> Per thread data is really crucial for scalability, because
> scalability is distribution. I may try to implement it w/o runtime
> help by means of extremely OS-specific hacks, but that's going to be
> difficult, non-portable (to new supported OS) and most likely slower;
> and I won't be able to do it for Plan9.
>

How does per-thread data interact with per-coroutine? Given that
coroutines can jump between OS threads, it seems risky.

Yes, it's somewhat risky.

Currently I rely on cooperative scheduling - no re-scheduling unless explicitly requested. Distributed RWMutex is OK with a weaker guarantee - per-proc data must not go away during access (GOMAXPROCS is changes during access, and consequently part of proc local data is deallocated).

If the compiler implements preemptive scheduling by compiling preemption points into go code, I think I can switch to C/asm for critical parts. I suspect that some parts of current runtime make the same assumptions (for example, no rescheduling when holding low level runtime lock), so I think there must stay a way to disable preemption (well, of course it can be not accessible outside of runtime).

Jan Mercl

unread,

Aug 9, 2011, 5:35:08 AM8/9/11

to golan...@googlegroups.com

On Monday, August 8, 2011 9:17:24 PM UTC+2, Dmitry Vyukov wrote:

I've start work on the "co" package:

What do you think on this?

I have just a non technical comment. Imagine some code like this:

package p

import (

"co"

"compress/gzip"

"fmt"

"hash/fnv"

"http"

"net"

....

"time"

"unicode"

)

For, I guess, every above example of an imported package, except "co", even a beginner can somehow guess what the package is probably about. Let me suggest to find some other name. Unfortunately I've no good/better idea to propose.

Lars Pensjö

unread,

Aug 9, 2011, 5:46:54 AM8/9/11

to golan...@googlegroups.com

How about using the name "concurrency"?

Russ Cox

unread,

Aug 9, 2011, 9:20:33 AM8/9/11

to golan...@googlegroups.com

I think eventually we may need some kind of proc-local data.
We already have it, behind the scenes, so the only question
is whether and how to expose it. In general I would rather
expose as little as possible, instead using what's there to
build higher-level, easier to use APIs.

The particular API you've used for proc-local data in this code
is a problematic one to commit to, because goroutines are not
guaranteed to be cooperatively scheduled. It has always been
the plan to introduce preemption, and the API can't preclude that.

My biggest concern about the co code you've shown is that
it has too many knobs. If there is a right way to set the
knobs, then the code should do that without being asked.
If there is no right way to set the knobs, then they shouldn't
be there.

I think that some of these kinds of things would be useful
to have in package sync, but they'd have to be simplified.
For example, the cache constructor is too complex.
I want to be able to say

var cache sync.Cache
cache.Put(x)
x := cache.Get().(T)

and that's it. Let the code do the right thing for me.
Similarly

var n sync.Counter
n.Add(10)
fmt.Println(n.Get())

This is similar to how you can just declare a sync.Mutex
(or embed it in some other data structure that you're
allocating) and use it. You don't have to ask for it to
be allocated, and you don't have to release it when
you're done. That simplicity is important to preserve.

Russ

Dmitry Vyukov

unread,

Aug 9, 2011, 11:28:24 AM8/9/11

to r...@golang.org, golan...@googlegroups.com

On Tue, Aug 9, 2011 at 5:20 PM, Russ Cox <r...@golang.org> wrote:

I think eventually we may need some kind of proc-local data.
We already have it, behind the scenes, so the only question
is whether and how to expose it. In general I would rather
expose as little as possible, instead using what's there to
build higher-level, easier to use APIs.

May you please elaborate here? Elsewhere you said "runtime is for the lowest of the lowest"... proc local data in runtime is like Semacquire/release not intended for direct use in application code.

The particular API you've used for proc-local data in this code
is a problematic one to commit to, because goroutines are not
guaranteed to be cooperatively scheduled. It has always been
the plan to introduce preemption, and the API can't preclude that.

Well... I understand the problem... any suggestions?

Perhaps we can only provide a guarantee that proc local data won't go away under the feet, that is, that the pointer will remain valid pointer to some proc local data (proc local data of some thread, same slot of course). Some components are OK with such guarantee (DistributedRWMutex). Others can switch to either atomic modification of the data (StatCounter can do atomic.Add on the var), or use temporal goroutine to thread locking. Does it sound reasonable?

My biggest concern about the co code you've shown is that
it has too many knobs. If there is a right way to set the
knobs, then the code should do that without being asked.
If there is no right way to set the knobs, then they shouldn't
be there.

I think that some of these kinds of things would be useful
to have in package sync, but they'd have to be simplified.
For example, the cache constructor is too complex.
I want to be able to say

var cache sync.Cache
cache.Put(x)
x := cache.Get().(T)

and that's it. Let the code do the right thing for me.

I think some simplifications can be made (that's only a quick proof-of-concept). However regarding cache, I really have to idea as to how to setup it. I depends on element "cost" and usage pattern, both can't be estimated/predicted. Perhaps it be parametrized with a single parameter - maximum number of elements cached - and then it will be distributed across thread-local caches and the global cache.

Regarding ctor/dtor, I can manage w/o ctor (return nil), but what if element contains a file or a connection which requires explicit closing?.. How to handle it w/o a dtor?

Similarly

var n sync.Counter
n.Add(10)
fmt.Println(n.Get())

This is similar to how you can just declare a sync.Mutex
(or embed it in some other data structure that you're
allocating) and use it. You don't have to ask for it to
be allocated, and you don't have to release it when
you're done. That simplicity is important to preserve.

Well, I think I can get away with lazy initialization + finalization. However I think finalization must be significantly optimized then, currently it forces me to switch from chans to WaitGroups on Darwin, because a single chan in GC pass forces second GC pass over all memory just to collect a dozen of bytes...

Russ Cox

unread,

Aug 9, 2011, 11:40:18 AM8/9/11

to Dmitry Vyukov, golan...@googlegroups.com

> May you please elaborate here? Elsewhere you said "runtime is for the lowest
> of the lowest"... proc local data in runtime is like Semacquire/release not
> intended for direct use in application code.

That's true but people still use it even though they shouldn't.
I still want to limit the attack surface here. (Attacking the
future flexibility of the implementation.)

>> The particular API you've used for proc-local data in this code
>> is a problematic one to commit to, because goroutines are not
>> guaranteed to be cooperatively scheduled. It has always been
>> the plan to introduce preemption, and the API can't preclude that.
>
> Well... I understand the problem... any suggestions?
> Perhaps we can only provide a guarantee that proc local data won't go away
> under the feet, that is, that the pointer will remain valid pointer to some
> proc local data (proc local data of some thread, same slot of course). Some
> components are OK with such guarantee (DistributedRWMutex). Others can
> switch to either atomic modification of the data (StatCounter can do
> atomic.Add on the var), or use temporal goroutine to thread locking. Does it
> sound reasonable?

I don't want to encouraging runtime.LockOSThread.
Having something that is 'mostly' local seems fine.

> I think some simplifications can be made (that's only a quick
> proof-of-concept). However regarding cache, I really have to idea as to how
> to setup it. I depends on element "cost" and usage pattern, both can't be
> estimated/predicted. Perhaps it be parametrized with a single parameter -
> maximum number of elements cached - and then it will be distributed across
> thread-local caches and the global cache.
> Regarding ctor/dtor, I can manage w/o ctor (return nil), but what if element
> contains a file or a connection which requires explicit closing?.. How to
> handle it w/o a dtor?

All the files and connections are already finalized.
They are just memory. Or Put could return an evicted
element for people who really care.

The important point here is that the data structures should
be arranged so that the zero value is useful and there is no
destructor needed beyond simply garbage collecting the
memory. Finalization should be a rare case, not the
common one.

> Well, I think I can get away with lazy initialization + finalization.
> However I think finalization must be significantly optimized then, currently
> it forces me to switch from chans to WaitGroups on Darwin, because a single
> chan in GC pass forces second GC pass over all memory just to collect a
> dozen of bytes...

I am okay with telling people that Darwin has bad APIs
that make Go run slower. I am also okay with reviewing
CLs that make it better. :-)

Russ

Dmitry Vyukov

unread,

Aug 9, 2011, 12:36:51 PM8/9/11

to r...@golang.org, golan...@googlegroups.com

On Tue, Aug 9, 2011 at 7:40 PM, Russ Cox <r...@golang.org> wrote:

> May you please elaborate here? Elsewhere you said "runtime is for the lowest
> of the lowest"... proc local data in runtime is like Semacquire/release not
> intended for direct use in application code.

That's true but people still use it even though they shouldn't.
I still want to limit the attack surface here. (Attacking the
future flexibility of the implementation.)

>> The particular API you've used for proc-local data in this code
>> is a problematic one to commit to, because goroutines are not
>> guaranteed to be cooperatively scheduled. It has always been
>> the plan to introduce preemption, and the API can't preclude that.
>
> Well... I understand the problem... any suggestions?
> Perhaps we can only provide a guarantee that proc local data won't go away
> under the feet, that is, that the pointer will remain valid pointer to some
> proc local data (proc local data of some thread, same slot of course). Some
> components are OK with such guarantee (DistributedRWMutex). Others can
> switch to either atomic modification of the data (StatCounter can do
> atomic.Add on the var), or use temporal goroutine to thread locking. Does it
> sound reasonable?

I don't want to encouraging runtime.LockOSThread.
Having something that is 'mostly' local seems fine.

OK

> I think some simplifications can be made (that's only a quick
> proof-of-concept). However regarding cache, I really have to idea as to how
> to setup it. I depends on element "cost" and usage pattern, both can't be
> estimated/predicted. Perhaps it be parametrized with a single parameter -
> maximum number of elements cached - and then it will be distributed across
> thread-local caches and the global cache.
> Regarding ctor/dtor, I can manage w/o ctor (return nil), but what if element
> contains a file or a connection which requires explicit closing?.. How to
> handle it w/o a dtor?

All the files and connections are already finalized.
They are just memory.

I missed it somehow. Now I see SetFinalizer() in Files.

Or Put could return an evicted
element for people who really care.

The problem is with finalization. Finalizer won't be able to properly close cached resources.

However, well, since files and conns does not require explicit close, I think, we can assume that everything does not require close, if not - roll out your own cache :)

Another option is to provide optional Close (like os.File do), and Close will optionally return a slice with all the resources, so that a user can close them.

Dmitry Vyukov

unread,

Aug 9, 2011, 1:09:39 PM8/9/11

to r...@golang.org, golan...@googlegroups.com

On Tue, Aug 9, 2011 at 8:36 PM, Dmitry Vyukov <dvy...@google.com> wrote:

The problem is with finalization. Finalizer won't be able to properly close cached resources.
However, well, since files and conns does not require explicit close, I think, we can assume that everything does not require close, if not - roll out your own cache :)

Another option is to provide optional Close (like os.File do), and Close will optionally return a slice with all the resources, so that a user can close them.

Here is one problem - how to collect proc local slots?

In order to collectable the slot be a pointer, right? But then there are also pointers to proc local data itself. Consider the following code:

func foo() *uint64 {

slot := runtime.AllocProcLocal()

data := runtime.GetProcLocal(slot)

return data;

}

data := foo()

// at this point slot is collected and reused

*data = 42 // ouch!

I have a slightly different proposal for proc local data:

package runtime

func AllocProcLocal() (slot uint32)

func GetProcLocal(slot uint32) uintptr

That is, slots are not reusable/collectable + no iteration. It makes runtime interface considerably simpler. All burden is now on libraries that want to use proc local data, that is, it's their responsibility to manage "subslots", collection, provide own iteration and so on.

Since the idea is that there is a slot per library, runtime can have very simple implementation for a small fixed number of slots, let's say, 32 to begin with.

I was stumbled over how to support both uint64 and Pointer/interface{} slots in runtime, the proposal solves the problem as well - now it's a client library responsibility.

Kyle Lemons

unread,

Aug 9, 2011, 1:48:56 PM8/9/11

to Dmitry Vyukov, r...@golang.org, golan...@googlegroups.com

I was playing with some code last night that does tons of (custom, not fmt) formatting operations in a second, some of which involve time, which is relatively expensive operation but doesn't need to be recomputed often (the time updates once a second, the date much less often). I created a "cache" for it (I'm not sure if that's the right term) to avoid recomputing the time portion every call, but updates to it had to be guarded with a mutex if I want it to be threadsafe. I started with an rwmutex, but optimizing the normal path prompted me to shift the updating of portions of this cache further along the flow, where I could no longer frontload the decision to lock it for read or write, so I switched back to a mutex. This essentially precludes more than one goroutine in a GOMAXPROCS>1 environment from calling the method at the same time, when (aside from the implementation of my cache) there is nothing inherent about the function that should require that. I was thinking about your thread-local-storage, and thought that this might be an application for it. I imagine doing something like

var threadCache = sync.ThreadLocal(CacheType{}) // the value is copied into each thread's local storage

...

func blah() {

cache := threadCache.Get().(*CacheType) // and a pointer to that thread's copy is retrieved here

// stuff

}

I have no idea if this model works or is dangerous (it would have to be locked, but a lock could cause the goroutine to be rescheduled onto another thread?) but it seems like a fairly "simple" interface to a thread-local-storage API.

~K

Dmitry Vyukov

unread,

Aug 9, 2011, 2:43:49 PM8/9/11

to Kyle Lemons, r...@golang.org, golan...@googlegroups.com

On Tue, Aug 9, 2011 at 9:48 PM, Kyle Lemons <kev...@google.com> wrote:

I was playing with some code last night that does tons of (custom, not fmt) formatting operations in a second, some of which involve time, which is relatively expensive operation but doesn't need to be recomputed often (the time updates once a second, the date much less often). I created a "cache" for it (I'm not sure if that's the right term) to avoid recomputing the time portion every call, but updates to it had to be guarded with a mutex if I want it to be threadsafe. I started with an rwmutex, but optimizing the normal path prompted me to shift the updating of portions of this cache further along the flow, where I could no longer frontload the decision to lock it for read or write, so I switched back to a mutex. This essentially precludes more than one goroutine in a GOMAXPROCS>1 environment from calling the method at the same time, when (aside from the implementation of my cache) there is nothing inherent about the function that should require that. I was thinking about your thread-local-storage, and thought that this might be an application for it. I imagine doing something like

var threadCache = sync.ThreadLocal(CacheType{}) // the value is copied into each thread's local storage
...
func blah() {
cache := threadCache.Get().(*CacheType) // and a pointer to that thread's copy is retrieved here

// stuff
}

I have no idea if this model works or is dangerous (it would have to be locked, but a lock could cause the goroutine to be rescheduled onto another thread?) but it seems like a fairly "simple" interface to a thread-local-storage API.

Thanks for the use case.

Yes, it's dangerous. Such data is not actually thread-local in a C sense (__thread/__declspec(thread)). I think distributed would be more correct term, because *semantically* it is as shared as a plain global var. However of course practically (performance-wise) it's "almost" thread-local.

So it need the same protection as normal global var. If think one will be able to use it as:

cache := threadCache.Get().(*CacheType)

cache.Lock()

...

cache.Unlock()

Another option for you case is to use COW (copy-on-write) atomic update:

cache := threadCache.Get().(*CacheType)

...

if cache.dateNeedsToBeUpdated() {

s := new(string)

// fill s

cache.dateStrPtr = s
}

// use cache.dateStrPtr

The update can happen in 2 goroutines, but thanks to GC one copy of date will be just collected.

Caution: it's not correct code, it should be expressed as:

atomic.StorePtr(&cache.dateStrPtr, s)

...

s := atomic.LoadPtr(&cache.dateStrPtr)

but it's not currently supported by atomic package.

If the cache is updated once per second, then actually thread local caches are not required. We can use global var + COW atomic update:

c := atomic.LoadPtr(&globalCache)

if needsToBeUpdated(c) {

c := new(Cache)

// fill c

atomic.StorePtr(&globalCache, c)
}

// use c

The point is that the global var holds a pointer to the current version of a data structure. Once a new version is published it's immutable, so can be read by all goroutines concurrently w/o protection.

Russ Cox

unread,

Aug 9, 2011, 6:24:25 PM8/9/11

to Kyle Lemons, Dmitry Vyukov, golan...@googlegroups.com

On Tue, Aug 9, 2011 at 13:48, Kyle Lemons <kev...@google.com> wrote:
> I was playing with some code last night that does tons of (custom, not fmt)
> formatting operations in a second, some of which involve time, which is
> relatively expensive operation but doesn't need to be recomputed often (the
> time updates once a second, the date much less often). I created a "cache"
> for it (I'm not sure if that's the right term) to avoid recomputing the time
> portion every call, but updates to it had to be guarded with a mutex if I
> want it to be threadsafe. I started with an rwmutex, but optimizing the
> normal path prompted me to shift the updating of portions of this cache
> further along the flow, where I could no longer frontload the decision to
> lock it for read or write, so I switched back to a mutex. This essentially
> precludes more than one goroutine in a GOMAXPROCS>1 environment from calling
> the method at the same time, when (aside from the implementation of my
> cache) there is nothing inherent about the function that should require
> that. I was thinking about your thread-local-storage, and thought that this
> might be an application for it. I imagine doing something like
> var threadCache = sync.ThreadLocal(CacheType{}) // the value is copied into
> each thread's local storage

Did you have performance problems with the simple approach?

Kyle Lemons

unread,

Aug 9, 2011, 7:12:20 PM8/9/11

to r...@golang.org, Dmitry Vyukov, golan...@googlegroups.com

Did you have performance problems with the simple approach?

I had a few different cases, one of which did the formatted output, and noticed that it was more than an order of magnitude slower, so I spent some time experimenting with it to figure out what parts of it I could optimize. I cut about half of the difference by doing simple optimizations (buffering, using []byte instead of string, culling garbage, etc), and cut the difference in half again or more out of it with caching. It was partly an exercise in optimization techniques, partly an exercise in playing with caching, and part just playing. The code I was playing with was code that I wrote awhile ago, and I've learned a lot since then.

~K

Dmitry Vyukov

unread,

Aug 10, 2011, 9:23:46 AM8/10/11

to r...@golang.org, golan...@googlegroups.com

I've prototyped the following interface:

package runtime

func AllocProcLocal(running bool) (slot uintptr)

func GetProcLocal(slot uintptr) (val uintptr)

func SetProcLocal(slot uintptr, val uintptr)

(it intentionally does not support slot free/reuse)

It seems to be enough to implement higher-level proc local interface in other packages. And it's no more than a dozen of lines of trivial code. What do you think on this interface in runtime?

The only quirk is that it requires usage of the unsafe package to convert to/from uintptr. I've also tried:

func GetProcLocal(slot uintptr) (val interface{})

func SetProcLocal(slot uintptr, val interface{})

but it turned out to be prohibitively slow. However I don't why type assertion for *T is anything different than inlined:

if(src->type != dsttype)

runtime.typeAssertFail(src, dsttype);

dst = (T*)src->data;

Is it possible/feasible to implement type assertions at least for a subset of types that way?

Ian Lance Taylor

unread,

Aug 10, 2011, 10:08:12 AM8/10/11

to Dmitry Vyukov, golan...@googlegroups.com, r...@golang.org

On Aug 10, 2011 9:24 AM, "Dmitry Vyukov" <dvy...@google.com> wrote:

> func AllocProcLocal(running bool) (slot uintptr)
> func GetProcLocal(slot uintptr) (val uintptr)
> func SetProcLocal(slot uintptr, val uintptr)

What does "ProcLocal" mean? Goroutine local or OS thread local?

Ian

Dmitry Vyukov

unread,

Aug 10, 2011, 10:14:52 AM8/10/11

to Ian Lance Taylor, golan...@googlegroups.com, r...@golang.org

OS thread local, like in gomaxPROCs.

A better name is welcome.

John Asmuth

unread,

Aug 10, 2011, 10:31:01 AM8/10/11

to golan...@googlegroups.com, Ian Lance Taylor, r...@golang.org

This seems racy to me... goroutines can be moved from thread to thread with no notification, and the language doesn't require that they be cooperatively scheduled, so it's possible that something you get from GetProcLocal() will be invalid on the next line of code.

Dmitry Vyukov

unread,

Aug 10, 2011, 10:39:31 AM8/10/11

to golan...@googlegroups.com, Ian Lance Taylor, r...@golang.org

On Wed, Aug 10, 2011 at 6:31 PM, John Asmuth <jas...@gmail.com> wrote:

This seems racy to me... goroutines can be moved from thread to thread with no notification, and the language doesn't require that they be cooperatively scheduled, so it's possible that something you get from GetProcLocal() will be invalid on the next line of code.

Youre right. Semantically it's a plain shared var that must be treated by the same rules as plain shared var. We are able to avoid races on plain shared vars.

Another option is to execute runtime.Un/LockOSThread() around the access.

Ian Lance Taylor

unread,

Aug 10, 2011, 11:48:15 AM8/10/11

to Dmitry Vyukov, golan...@googlegroups.com, r...@golang.org

Dmitry Vyukov <dvy...@google.com> writes:

> On Wed, Aug 10, 2011 at 6:31 PM, John Asmuth <jas...@gmail.com> wrote:
>
>> This seems racy to me... goroutines can be moved from thread to thread with
>> no notification, and the language doesn't require that they be cooperatively
>> scheduled, so it's possible that something you get from GetProcLocal() will
>> be invalid on the next line of code.
>
>
>
> Youre right. Semantically it's a plain shared var that must be treated by
> the same rules as plain shared var. We are able to avoid races on plain
> shared vars.

I think that semantically it is worse than a plain shared var. In Go I
can reliably read and write a shared variable provided I synchronize
access using channels or mutexes. The only operation I can do with this
OS-thread-local variable is a single standalone read or write, or one of
the operations in sync/atomic.. In particular I can't read the variable
and then write it, because my write could go to a different variable
entirely. Synchronization is useless here because the goroutine might
be rescheduled on a different OS thread between the read and write.

Ian

Dmitry Vyukov

unread,

Aug 10, 2011, 12:01:24 PM8/10/11

to Ian Lance Taylor, golan...@googlegroups.com, r...@golang.org

I do not agree.

First, one's write won't go to a different variable entirely. It will go the same variable, it just can be accessed by other goroutines concurrently.

Then, one can use a mutex as well:

x := runtime.GetProcLocal(slot).(*X)

x.mtx.Lock()

x.data += 42

x.mtx.Unlock()

Channels make little sense... hey! wait! here is a simple pool:

x := runtime.GetProcLocal(slot).(*X)

select {

case resource := <-x: return resource

default: return nil
}

To name some other potential usages: statistical counters seem to be a perfect fit; distributed rw mutex (a reader read-locks proc local rwmutex, while a write write-locks all rw mutexes).

Ian Lance Taylor

unread,

Aug 10, 2011, 2:53:04 PM8/10/11

to Dmitry Vyukov, r...@golang.org, golan...@googlegroups.com

On Aug 10, 2011 12:01 PM, "Dmitry Vyukov" <dvy...@google.com> wrote:

> I do not agree.
> First, one's write won't go to a different variable entirely. It will go the same variable, it just can be accessed by other goroutines concurrently.
> Then, one can use a mutex as well:
> x := runtime.GetProcLocal(slot).(*X)
> x.mtx.Lock()
> x.data += 42
> x.mtx.Unlock()

I see. You can use them as pointers. You are avoiding trouble by doing only a single read of the OS-thread-local var.

> To name some other potential usages: statistical counters seem to be a perfect fit; distributed rw mutex (a reader read-locks proc local rwmutex, while a write write-locks all rw mutexes).

Would statistical counters require some mechanism for iterating over the slots?

Ian

Dmitry Vyukov

unread,

Aug 10, 2011, 3:14:39 PM8/10/11

to Ian Lance Taylor, r...@golang.org, golan...@googlegroups.com

On Wed, Aug 10, 2011 at 10:53 PM, Ian Lance Taylor <ia...@google.com> wrote:

> I do not agree.
> First, one's write won't go to a different variable entirely. It will go the same variable, it just can be accessed by other goroutines concurrently.
> Then, one can use a mutex as well:
> x := runtime.GetProcLocal(slot).(*X)
> x.mtx.Lock()
> x.data += 42
> x.mtx.Unlock()

I see. You can use them as pointers. You are avoiding trouble by doing only a single read of the OS-thread-local var.

Indeed.

> To name some other potential usages: statistical counters seem to be a perfect fit; distributed rw mutex (a reader read-locks proc local rwmutex, while a write write-locks all rw mutexes).

Would statistical counters require some mechanism for iterating over the slots?

Yes, just as the distributed rw mutex.

My initial implementation (http://codereview.appspot.com/4850045/diff2/1:3002/src/pkg/runtime/proclocal.c) provides iteration capability (func IterProcLocal(slot uint32, valp **uint64, iter *uintptr) bool) as well as basically unbounded number of reusable slots.

However, now I think that potentially it's better to keep the runtime as simple as possible, and as fundamental as possible. So my current implementation is no more than:

void

runtime·AllocProcLocal2(uintptr active, uintptr slot)

{

USED(active);

runtime·lock(&runtime·sched);

if(proclocalseq == nelem(m->localdata2)) {

runtime·unlock(&runtime·sched);

runtime·panicstring("out of proc local slots");

}

slot = proclocalseq++;

runtime·unlock(&runtime·sched);

FLUSH(&slot);

}

void

runtime·GetProcLocal2(uintptr slot, uintptr val)

{

if(slot >= proclocalseq)

runtime·panicstring("invalid local slot");

val = m->localdata2[slot];

FLUSH(&val);

}

void

runtime·SetProcLocal2(uintptr slot, uintptr val)

{

if(slot >= proclocalseq)

runtime·panicstring("invalid local slot");

m->localdata2[slot] = val;

}

It's means to implement proc-local data rather than a final solution.

The idea is that a package that wants to provide some higher-level interface to proc-local data (most likely a high-level component like distributed resource pool) need to allocate 1 slot from runtime, implement "subslots" on top of it, register all it's proc-local objects in a global list and implement iteration on top of it.

That is, I think that the runtime interface is sufficient to implement whatever applied package wants on top of it (including iteration).

Russ Cox

unread,

Aug 10, 2011, 5:50:15 PM8/10/11

to Dmitry Vyukov, Ian Lance Taylor, golan...@googlegroups.com

I don't want to make programmers think about subslots.
This is too hard to use. Can we do better?

Alternately, what is this for? Can we provide more useful
implementations at a higher level and just not expose
any API for this? Would it suffice to have just sync.Cache
and sync.Counter?

Russ

Kyle Lemons

unread,

Aug 10, 2011, 6:00:38 PM8/10/11

to Dmitry Vyukov, golan...@googlegroups.com

I also like the distributed rwmutex, but perhaps the sync.RWMutex should *always* be distributed? I assume it has a nice, clean fast-path when GOMAXPROCS=1 where it'll be the same as the current one.

~K

Dmitry Vyukov

unread,

Aug 11, 2011, 8:26:36 AM8/11/11

to Kyle Lemons, golan...@googlegroups.com

I am not sure here. On one hand, DistributedRWMutex is extremely good
for high read-to-write scenarios, and it's where reader-writer mutexes
should be used in the first place. One the other hand, current
sync.RWMutex performance is easy to understand for everybody who has
experience with pthread/Win32/whatever rw mutexes. While
DistributedRWMutex has considerably higher write lock costs (it's not
uncommon to have 16/32 cores today). Potentially it's possible to
implement an adaptive rw mutex, and it would be an unprecedented
feature; however it's going to complicate implementation
significantly, and more importantly most likely hit fast paths.
There is one more quirk, as it is going to be implemented right now,
proc local data will be allocated for every worker thread rather than
for threads running go code. And number of worker threads in an
application that uses blocking IO/syscalls can be hundreds/thousands.
It can be fixed, but it requires scheduler rewrite. By the way,
currently the same is true for per-thread memory caches - even if
GOMAXPROCS=1 there can be thousands of per-thread memory caches
wasting space.
Personally I would prefer to have separate RWMutex/DistributedRWMutex,
but I should not be considered as a common case.

Dmitry Vyukov

unread,

Aug 11, 2011, 8:30:33 AM8/11/11

to r...@golang.org, Ian Lance Taylor, golan...@googlegroups.com

I am not sure I get you. sync can't implement it w/o proc local data.
Per proc data w/o runtime support is going to be very messy and
unportable. And if it's provided by runtime, then it's a part of it's
pubic API. So what exactly do you mean?

Russ Cox

unread,

Aug 11, 2011, 9:22:49 AM8/11/11

to Dmitry Vyukov, Ian Lance Taylor, golan...@googlegroups.com

> So what exactly do you mean?

I mean that sync can make calls into runtime without
those calls being part of runtime's public API.
For example, reflect's setiface function is a call into
runtime but not part of the public API.

We can make this functionality available to sync
without making it available to all Go programmers.

Russ

Dmitry Vyukov

unread,

Aug 12, 2011, 12:15:30 PM8/12/11

to r...@golang.org, Ian Lance Taylor, golan...@googlegroups.com

Aha! So runtime can implement sync.allocProcLocal!

Reply all

Reply to author

Forward