Everything else is based on runtime extension - proc (OS worker thread) local data:Access to proc local data takes about 8ns and of course is perfectly scalable. The important feature is a possibility to iterate over proc local data of all procs.
And especially to core Go devs - can proc local data eventually make it's way into runtime?
How does per-thread data interact with per-coroutine? Given that
coroutines can jump between OS threads, it seems risky.
John
=:->
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Cygwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
iEYEARECAAYFAk5A+64ACgkQJdeBCYSNAAM7TACgpvPmb/qs3d0LoUkd73fLwXnf
L4QAoKHYB5jrrQ6TBOo41SDXT7Ylr71G
=fawn
-----END PGP SIGNATURE-----
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 8/9/2011 11:04 AM, Dmitry Vyukov wrote:
> On Mon, Aug 8, 2011 at 11:17 PM, Dmitry Vyukov <dvy...@google.com
How does per-thread data interact with per-coroutine? Given that> <mailto:dvy...@google.com>> wrote:
>
> Everything else is based on runtime extension - proc (OS worker
> thread) local data:
> http://codereview.appspot.com/4850045/diff2/1:3001/src/pkg/runtime/mem.go
>
>
Access to proc local data takes about 8ns and of course is perfectly
> scalable. The important feature is a possibility to iterate over proc
> local data of all procs.
>
> And especially to core Go devs - can proc local data eventually make
> it's way into runtime?
>
>
> Per thread data is really crucial for scalability, because
> scalability is distribution. I may try to implement it w/o runtime
> help by means of extremely OS-specific hacks, but that's going to be
> difficult, non-portable (to new supported OS) and most likely slower;
> and I won't be able to do it for Plan9.
>
coroutines can jump between OS threads, it seems risky.
I've start work on the "co" package:
What do you think on this?
The particular API you've used for proc-local data in this code
is a problematic one to commit to, because goroutines are not
guaranteed to be cooperatively scheduled. It has always been
the plan to introduce preemption, and the API can't preclude that.
My biggest concern about the co code you've shown is that
it has too many knobs. If there is a right way to set the
knobs, then the code should do that without being asked.
If there is no right way to set the knobs, then they shouldn't
be there.
I think that some of these kinds of things would be useful
to have in package sync, but they'd have to be simplified.
For example, the cache constructor is too complex.
I want to be able to say
var cache sync.Cache
cache.Put(x)
x := cache.Get().(T)
and that's it. Let the code do the right thing for me.
Similarly
var n sync.Counter
n.Add(10)
fmt.Println(n.Get())
This is similar to how you can just declare a sync.Mutex
(or embed it in some other data structure that you're
allocating) and use it. You don't have to ask for it to
be allocated, and you don't have to release it when
you're done. That simplicity is important to preserve.
Russ
I think eventually we may need some kind of proc-local data.
We already have it, behind the scenes, so the only question
is whether and how to expose it. In general I would rather
expose as little as possible, instead using what's there to
build higher-level, easier to use APIs.
The particular API you've used for proc-local data in this code
is a problematic one to commit to, because goroutines are not
guaranteed to be cooperatively scheduled. It has always been
the plan to introduce preemption, and the API can't preclude that.
My biggest concern about the co code you've shown is that
it has too many knobs. If there is a right way to set the
knobs, then the code should do that without being asked.
If there is no right way to set the knobs, then they shouldn't
be there.
I think that some of these kinds of things would be useful
to have in package sync, but they'd have to be simplified.
For example, the cache constructor is too complex.
I want to be able to say
var cache sync.Cache
cache.Put(x)
x := cache.Get().(T)
and that's it. Let the code do the right thing for me.
Similarly
var n sync.Counter
n.Add(10)
fmt.Println(n.Get())
This is similar to how you can just declare a sync.Mutex
(or embed it in some other data structure that you're
allocating) and use it. You don't have to ask for it to
be allocated, and you don't have to release it when
you're done. That simplicity is important to preserve.
That's true but people still use it even though they shouldn't.
I still want to limit the attack surface here. (Attacking the
future flexibility of the implementation.)
>> The particular API you've used for proc-local data in this code
>> is a problematic one to commit to, because goroutines are not
>> guaranteed to be cooperatively scheduled. It has always been
>> the plan to introduce preemption, and the API can't preclude that.
>
> Well... I understand the problem... any suggestions?
> Perhaps we can only provide a guarantee that proc local data won't go away
> under the feet, that is, that the pointer will remain valid pointer to some
> proc local data (proc local data of some thread, same slot of course). Some
> components are OK with such guarantee (DistributedRWMutex). Others can
> switch to either atomic modification of the data (StatCounter can do
> atomic.Add on the var), or use temporal goroutine to thread locking. Does it
> sound reasonable?
I don't want to encouraging runtime.LockOSThread.
Having something that is 'mostly' local seems fine.
> I think some simplifications can be made (that's only a quick
> proof-of-concept). However regarding cache, I really have to idea as to how
> to setup it. I depends on element "cost" and usage pattern, both can't be
> estimated/predicted. Perhaps it be parametrized with a single parameter -
> maximum number of elements cached - and then it will be distributed across
> thread-local caches and the global cache.
> Regarding ctor/dtor, I can manage w/o ctor (return nil), but what if element
> contains a file or a connection which requires explicit closing?.. How to
> handle it w/o a dtor?
All the files and connections are already finalized.
They are just memory. Or Put could return an evicted
element for people who really care.
The important point here is that the data structures should
be arranged so that the zero value is useful and there is no
destructor needed beyond simply garbage collecting the
memory. Finalization should be a rare case, not the
common one.
> Well, I think I can get away with lazy initialization + finalization.
> However I think finalization must be significantly optimized then, currently
> it forces me to switch from chans to WaitGroups on Darwin, because a single
> chan in GC pass forces second GC pass over all memory just to collect a
> dozen of bytes...
I am okay with telling people that Darwin has bad APIs
that make Go run slower. I am also okay with reviewing
CLs that make it better. :-)
Russ
> May you please elaborate here? Elsewhere you said "runtime is for the lowestThat's true but people still use it even though they shouldn't.
> of the lowest"... proc local data in runtime is like Semacquire/release not
> intended for direct use in application code.
I still want to limit the attack surface here. (Attacking the
future flexibility of the implementation.)
I don't want to encouraging runtime.LockOSThread.
>> The particular API you've used for proc-local data in this code
>> is a problematic one to commit to, because goroutines are not
>> guaranteed to be cooperatively scheduled. It has always been
>> the plan to introduce preemption, and the API can't preclude that.
>
> Well... I understand the problem... any suggestions?
> Perhaps we can only provide a guarantee that proc local data won't go away
> under the feet, that is, that the pointer will remain valid pointer to some
> proc local data (proc local data of some thread, same slot of course). Some
> components are OK with such guarantee (DistributedRWMutex). Others can
> switch to either atomic modification of the data (StatCounter can do
> atomic.Add on the var), or use temporal goroutine to thread locking. Does it
> sound reasonable?
Having something that is 'mostly' local seems fine.
> I think some simplifications can be made (that's only a quickAll the files and connections are already finalized.
> proof-of-concept). However regarding cache, I really have to idea as to how
> to setup it. I depends on element "cost" and usage pattern, both can't be
> estimated/predicted. Perhaps it be parametrized with a single parameter -
> maximum number of elements cached - and then it will be distributed across
> thread-local caches and the global cache.
> Regarding ctor/dtor, I can manage w/o ctor (return nil), but what if element
> contains a file or a connection which requires explicit closing?.. How to
> handle it w/o a dtor?
They are just memory.
Or Put could return an evicted
element for people who really care.
The problem is with finalization. Finalizer won't be able to properly close cached resources.However, well, since files and conns does not require explicit close, I think, we can assume that everything does not require close, if not - roll out your own cache :)Another option is to provide optional Close (like os.File do), and Close will optionally return a slice with all the resources, so that a user can close them.
I was playing with some code last night that does tons of (custom, not fmt) formatting operations in a second, some of which involve time, which is relatively expensive operation but doesn't need to be recomputed often (the time updates once a second, the date much less often). I created a "cache" for it (I'm not sure if that's the right term) to avoid recomputing the time portion every call, but updates to it had to be guarded with a mutex if I want it to be threadsafe. I started with an rwmutex, but optimizing the normal path prompted me to shift the updating of portions of this cache further along the flow, where I could no longer frontload the decision to lock it for read or write, so I switched back to a mutex. This essentially precludes more than one goroutine in a GOMAXPROCS>1 environment from calling the method at the same time, when (aside from the implementation of my cache) there is nothing inherent about the function that should require that. I was thinking about your thread-local-storage, and thought that this might be an application for it. I imagine doing something likevar threadCache = sync.ThreadLocal(CacheType{}) // the value is copied into each thread's local storage...func blah() {cache := threadCache.Get().(*CacheType) // and a pointer to that thread's copy is retrieved here// stuff}I have no idea if this model works or is dangerous (it would have to be locked, but a lock could cause the goroutine to be rescheduled onto another thread?) but it seems like a fairly "simple" interface to a thread-local-storage API.
Did you have performance problems with the simple approach?
Did you have performance problems with the simple approach?
On Aug 10, 2011 9:24 AM, "Dmitry Vyukov" <dvy...@google.com> wrote:
> func AllocProcLocal(running bool) (slot uintptr)
> func GetProcLocal(slot uintptr) (val uintptr)
> func SetProcLocal(slot uintptr, val uintptr)
What does "ProcLocal" mean? Goroutine local or OS thread local?
Ian
This seems racy to me... goroutines can be moved from thread to thread with no notification, and the language doesn't require that they be cooperatively scheduled, so it's possible that something you get from GetProcLocal() will be invalid on the next line of code.
> On Wed, Aug 10, 2011 at 6:31 PM, John Asmuth <jas...@gmail.com> wrote:
>
>> This seems racy to me... goroutines can be moved from thread to thread with
>> no notification, and the language doesn't require that they be cooperatively
>> scheduled, so it's possible that something you get from GetProcLocal() will
>> be invalid on the next line of code.
>
>
>
> Youre right. Semantically it's a plain shared var that must be treated by
> the same rules as plain shared var. We are able to avoid races on plain
> shared vars.
I think that semantically it is worse than a plain shared var. In Go I
can reliably read and write a shared variable provided I synchronize
access using channels or mutexes. The only operation I can do with this
OS-thread-local variable is a single standalone read or write, or one of
the operations in sync/atomic.. In particular I can't read the variable
and then write it, because my write could go to a different variable
entirely. Synchronization is useless here because the goroutine might
be rescheduled on a different OS thread between the read and write.
Ian
On Aug 10, 2011 12:01 PM, "Dmitry Vyukov" <dvy...@google.com> wrote:
> I do not agree.
> First, one's write won't go to a different variable entirely. It will go the same variable, it just can be accessed by other goroutines concurrently.
> Then, one can use a mutex as well:
> x := runtime.GetProcLocal(slot).(*X)
> x.mtx.Lock()
> x.data += 42
> x.mtx.Unlock()
I see. You can use them as pointers. You are avoiding trouble by doing only a single read of the OS-thread-local var.
> To name some other potential usages: statistical counters seem to be a perfect fit; distributed rw mutex (a reader read-locks proc local rwmutex, while a write write-locks all rw mutexes).
Would statistical counters require some mechanism for iterating over the slots?
Ian
> I do not agree.
> First, one's write won't go to a different variable entirely. It will go the same variable, it just can be accessed by other goroutines concurrently.
> Then, one can use a mutex as well:
> x := runtime.GetProcLocal(slot).(*X)
> x.mtx.Lock()
> x.data += 42
> x.mtx.Unlock()I see. You can use them as pointers. You are avoiding trouble by doing only a single read of the OS-thread-local var.
> To name some other potential usages: statistical counters seem to be a perfect fit; distributed rw mutex (a reader read-locks proc local rwmutex, while a write write-locks all rw mutexes).
Would statistical counters require some mechanism for iterating over the slots?
Alternately, what is this for? Can we provide more useful
implementations at a higher level and just not expose
any API for this? Would it suffice to have just sync.Cache
and sync.Counter?
Russ
I am not sure here. On one hand, DistributedRWMutex is extremely good
for high read-to-write scenarios, and it's where reader-writer mutexes
should be used in the first place. One the other hand, current
sync.RWMutex performance is easy to understand for everybody who has
experience with pthread/Win32/whatever rw mutexes. While
DistributedRWMutex has considerably higher write lock costs (it's not
uncommon to have 16/32 cores today). Potentially it's possible to
implement an adaptive rw mutex, and it would be an unprecedented
feature; however it's going to complicate implementation
significantly, and more importantly most likely hit fast paths.
There is one more quirk, as it is going to be implemented right now,
proc local data will be allocated for every worker thread rather than
for threads running go code. And number of worker threads in an
application that uses blocking IO/syscalls can be hundreds/thousands.
It can be fixed, but it requires scheduler rewrite. By the way,
currently the same is true for per-thread memory caches - even if
GOMAXPROCS=1 there can be thousands of per-thread memory caches
wasting space.
Personally I would prefer to have separate RWMutex/DistributedRWMutex,
but I should not be considered as a common case.
I am not sure I get you. sync can't implement it w/o proc local data.
Per proc data w/o runtime support is going to be very messy and
unportable. And if it's provided by runtime, then it's a part of it's
pubic API. So what exactly do you mean?
I mean that sync can make calls into runtime without
those calls being part of runtime's public API.
For example, reflect's setiface function is a call into
runtime but not part of the public API.
We can make this functionality available to sync
without making it available to all Go programmers.
Russ
Aha! So runtime can implement sync.allocProcLocal!