P-local/M-local storage for goroutines?

301 views
Skip to first unread message

Zihan Yang

unread,
Jul 23, 2019, 1:22:09 PM7/23/19
to golang-nuts

I am trying to implement an LRU cache. Several global lru lists could be accessed concurrently by multiple goroutines, which could be a disaster in a machine with 24 or more cores.


Therefore, it would be great if I can add the item to P-local storage and flush the batched item into the lru list as a whole. This should greatly reduce the contention for the global lru list.


How can I do it? I saw some related github issues, #8281 and #21355, which leads me to a project called gls, but the code seems too much to integrate into my project (actually I'd better not include any third-party package to avoid potential law issues). Is there any built-in way to achieve this?


Thanks

Michael Jones

unread,
Jul 23, 2019, 1:48:40 PM7/23/19
to Zihan Yang, golang-nuts
The simple, common way--if I understand you need correctly--is to launch a method goroutine.

type CacheManager struct {
// things a worker needs to know, such as the global cache, the specific worker's local cache, etc.
}

func master() {
  for i := 0; i < workers; i++ {
  m := new(CacheManager)
  m.x = y // set up your thread local storage
    :
  go m.Worker()
  }
}

Unfortunately this does not seem to be in any intro guides, which pushes people to complicated workarounds.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/c79d3801-2f03-43fd-8dd8-35904b481341%40googlegroups.com.


--
Michael T. Jones
michae...@gmail.com

Robert Engels

unread,
Jul 23, 2019, 1:52:35 PM7/23/19
to Zihan Yang, golang-nuts
You will need to use CGo. See Chaney “id” project for ideas. Since the underlying OS threads are shared and will not be switched while in the C code it is fairly straightforward but having a robust typed cache will be problematic. 

There are discussions on golang-dev about named Go routines which might be a workaround, but so far the TLS (or PLS) discussion has been a non starter. 
--

Robert Engels

unread,
Jul 23, 2019, 1:55:11 PM7/23/19
to Zihan Yang, golang-nuts
Also, you will probably fine that a multi level mutex based solution will perform fine. The Go lightweight threads make this very efficient. The sync.Map may help as well but it has some performance issues. 

Pat Farrell

unread,
Jul 23, 2019, 4:44:07 PM7/23/19
to golang-nuts


On Tuesday, July 23, 2019 at 1:22:09 PM UTC-4, Zihan Yang wrote:

I am trying to implement an LRU cache. 


Don't use an LRU, use a working set cache. Proven to be better back in 1969 by Peter J Denning.
 

Zihan Yang

unread,
Jul 23, 2019, 10:49:30 PM7/23/19
to golang-nuts
Thanks for the example code, this is a possible solution. But my goroutine is not long-running. More specifically, each goroutine performs some IO, then returns. Next time, there might be another goroutine accessing the same file.

One workaround is to return the local storage back to CacheManager, but is involves contention on CacheManager, and adds the complexity of CacheManager.

在 2019年7月24日星期三 UTC+8上午1:48:40,Michael Jones写道:
The simple, common way--if I understand you need correctly--is to launch a method goroutine.

type CacheManager struct {
// things a worker needs to know, such as the global cache, the specific worker's local cache, etc.
}

func master() {
  for i := 0; i < workers; i++ {
  m := new(CacheManager)
  m.x = y // set up your thread local storage
    :
  go m.Worker()
  }
}

Unfortunately this does not seem to be in any intro guides, which pushes people to complicated workarounds.

On Tue, Jul 23, 2019 at 10:22 AM Zihan Yang <whois.z...@gmail.com> wrote:

I am trying to implement an LRU cache. Several global lru lists could be accessed concurrently by multiple goroutines, which could be a disaster in a machine with 24 or more cores.


Therefore, it would be great if I can add the item to P-local storage and flush the batched item into the lru list as a whole. This should greatly reduce the contention for the global lru list.


How can I do it? I saw some related github issues, #8281 and #21355, which leads me to a project called gls, but the code seems too much to integrate into my project (actually I'd better not include any third-party package to avoid potential law issues). Is there any built-in way to achieve this?


Thanks

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golan...@googlegroups.com.

Zihan Yang

unread,
Jul 23, 2019, 10:55:47 PM7/23/19
to golang-nuts
CGO sounds good, but I remember CGO invocation is not cheap? (not sure where I saw it).

Would you provide a reference to "Chaney id project"? I don't find much information about it.

I will look into the named goroutine discussion.

Thanks.

在 2019年7月24日星期三 UTC+8上午1:52:35,Robert Engels写道:
You will need to use CGo. See Chaney “id” project for ideas. Since the underlying OS threads are shared and will not be switched while in the C code it is fairly straightforward but having a robust typed cache will be problematic. 

There are discussions on golang-dev about named Go routines which might be a workaround, but so far the TLS (or PLS) discussion has been a non starter. 

On Jul 23, 2019, at 12:10 PM, Zihan Yang <whois.z...@gmail.com> wrote:

I am trying to implement an LRU cache. Several global lru lists could be accessed concurrently by multiple goroutines, which could be a disaster in a machine with 24 or more cores.


Therefore, it would be great if I can add the item to P-local storage and flush the batched item into the lru list as a whole. This should greatly reduce the contention for the global lru list.


How can I do it? I saw some related github issues, #8281 and #21355, which leads me to a project called gls, but the code seems too much to integrate into my project (actually I'd better not include any third-party package to avoid potential law issues). Is there any built-in way to achieve this?


Thanks

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golan...@googlegroups.com.

Zihan Yang

unread,
Jul 23, 2019, 10:59:57 PM7/23/19
to golang-nuts
Actually, I am trying to manage the memory. Segment-based solution is indeed convenient, but the merging of adjacent segments becomes an inconvenience for me now.

For example, I would like to reclaim only a 2M chunk, but the auto-merging of segments finally creates a huge segment of size 64M.. If I use the segment-based solution, I have to either reclaim this big chunk as whole, or not to reclaim it at all. Neither way is friendly.

在 2019年7月24日星期三 UTC+8上午4:44:07,Pat Farrell写道:

robert engels

unread,
Jul 23, 2019, 11:09:08 PM7/23/19
to Zihan Yang, golang-nuts
https://github.com/davecheney/junk/tree/master/id but see the warnings. 

Still, I would use mutex and shared caches until you can prove that the costs matter before resorting to the library.

You might be interested in github.com/robaho/go-concurrency-test

To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/bd54928f-cac5-4886-9252-04812f72b299%40googlegroups.com.

Bakul Shah

unread,
Jul 23, 2019, 11:20:38 PM7/23/19
to Zihan Yang, golang-nuts
Instead of starting new goroutines, send requests to existing goroutines via a channel.
In effect you are simulating an N core processor with per code local caches and one shared cache so you can look at how processors manage cache.
Just create N goroutines at the start and profile to see what happens. If necessary you can adjust the number of goroutines as per load.

Note the cost of managing LRU. If there is no clear pattern of accessing similar items, you may be better off using random replacement. As an example you can use LRU for smaller local caches and random for the global cache.

But if I were you I'd spend time on understanding access patterns before anything else. If I have no idea on the pattern, I'd pick the simplest implementation and evolve it based on profiling. In other words do not assume that "greatly reduce the contention for global lru list" is the right thing to do *unless* profiling tells you so.

To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/9dbe56c7-d0cd-4301-bc37-0560bcd684c4%40googlegroups.com.

Zihan Yang

unread,
Jul 24, 2019, 1:06:11 AM7/24/19
to golang-nuts
Thanks for the link. I just find built-in sync.Pool, but it does not fit into my requirement.

sync.Pool lends objects to application and expects them to return. But I would like to give objects to local storage and query the object later.

Is there anything like a variant of sync.Pool?

在 2019年7月24日星期三 UTC+8上午11:09:08,robert engels写道:
https://github.com/davecheney/junk/tree/master/id but see the warnings. 

Still, I would use mutex and shared caches until you can prove that the costs matter before resorting to the library.

You might be interested in github.com/robaho/go-concurrency-test

On Jul 23, 2019, at 9:55 PM, Zihan Yang <whois.z...@gmail.com> wrote:

CGO sounds good, but I remember CGO invocation is not cheap? (not sure where I saw it).

Would you provide a reference to "Chaney id project"? I don't find much information about it.

I will look into the named goroutine discussion.

Thanks.

在 2019年7月24日星期三 UTC+8上午1:52:35,Robert Engels写道:
You will need to use CGo. See Chaney “id” project for ideas. Since the underlying OS threads are shared and will not be switched while in the C code it is fairly straightforward but having a robust typed cache will be problematic. 

There are discussions on golang-dev about named Go routines which might be a workaround, but so far the TLS (or PLS) discussion has been a non starter. 

On Jul 23, 2019, at 12:10 PM, Zihan Yang <whois.z...@gmail.com> wrote:

I am trying to implement an LRU cache. Several global lru lists could be accessed concurrently by multiple goroutines, which could be a disaster in a machine with 24 or more cores.


Therefore, it would be great if I can add the item to P-local storage and flush the batched item into the lru list as a whole. This should greatly reduce the contention for the global lru list.


How can I do it? I saw some related github issues, #8281 and #21355, which leads me to a project called gls, but the code seems too much to integrate into my project (actually I'd better not include any third-party package to avoid potential law issues). Is there any built-in way to achieve this?


Thanks


--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golan...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/c79d3801-2f03-43fd-8dd8-35904b481341%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golan...@googlegroups.com.

Zihan Yang

unread,
Jul 24, 2019, 1:16:24 AM7/24/19
to golang-nuts
My use case is a little bit complicated. The goroutine are running some user-defined applications, and they might concurrently access the same os file. Also, I cannot limit the number of goroutines because I cannot control the user code.

I tried to evaluate the performance with lru & without lru, the performance with a global lru is 10x slower than without lru at all.

So the contention of the global lru list is indeed a trouble at least to me.

在 2019年7月24日星期三 UTC+8上午11:20:38,Bakul Shah写道:
Instead of starting new goroutines, send requests to existing goroutines via a channel.
In effect you are simulating an N core processor with per code local caches and one shared cache so you can look at how processors manage cache.
Just create N goroutines at the start and profile to see what happens. If necessary you can adjust the number of goroutines as per load.

Note the cost of managing LRU. If there is no clear pattern of accessing similar items, you may be better off using random replacement. As an example you can use LRU for smaller local caches and random for the global cache.

But if I were you I'd spend time on understanding access patterns before anything else. If I have no idea on the pattern, I'd pick the simplest implementation and evolve it based on profiling. In other words do not assume that "greatly reduce the contention for global lru list" is the right thing to do *unless* profiling tells you so.

Robert Engels

unread,
Jul 24, 2019, 7:52:35 AM7/24/19
to Zihan Yang, golang-nuts
The OS file cache is almost certainly going to better than rolling your own. You can probably still find the studies that the Lucene team did in this regard. 
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/e3a12b07-21b2-400e-ab2a-5ee2a07b23eb%40googlegroups.com.

Zihan Yang

unread,
Jul 24, 2019, 8:47:12 AM7/24/19
to golang-nuts
As I said before, my use case is a little complicated, let me explain it.

I am trying to add more cache inside a project called gVisor, which is an open source user-space kernel from google. Simply speaking, it is a secure container sandbox runtime that leverages hardware-assisted virtualization(VT-x for Intel). Sentry, the core part of gVisor, could run in GR0 or HR3 (it usually runs in GR0, unless it encounters privileged instructions that cannot be executed in non-root mode).

For IO operations, most of times, sentry will get trapped down to HR3 and let the host kernel (e.g., linux) deal with the OS cache, exactly what you said. However, due to the complex architecture and sooooo many layers of abstractions in gVisor, the performance is bad. Therefore, we would like to allow more page cache in gVisor to avoid trapping to HR3 in each IO operations.

Since we are adding cache in sentry, there must be a way to manage and reclaim them. LRU is the most intuitive strategy, but I find that contention overhead completely diminishes the performance improvement brought by cache. So I would like a way to avoid contentions on the same lru list.

在 2019年7月24日星期三 UTC+8下午7:52:35,Robert Engels写道:
The OS file cache is almost certainly going to better than rolling your own. You can probably still find the studies that the Lucene team did in this regard. 

Robert Engels

unread,
Jul 24, 2019, 8:58:50 AM7/24/19
to Zihan Yang, golan...@googlegroups.com
But if the user apps are arbitrary and random how will a cache be of any benefit. It is only a benefit if the programs are reading the same files and in that case you could use a logical file manager that keeps common files in memory. 

Or as another pointed out, use a service worker model and partition the workers based on files accessed (if known ahead of time) and use a local cache. 

Lastly, use CGo or ASM. You can get access to the P struct and G Id easily. 

On Jul 24, 2019, at 7:41 AM, Zihan Yang <whois.zi...@gmail.com> wrote:

As I said before, my use case is a little complicated, let me explain it.

I am trying to add more cache inside a project called [gVisor][1], which is an open source user-space kernel from google. Simply speaking, it is a secure container sandbox runtime that leverages hardware-assisted virtualization(VT-x for Intel). Sentry, the core part of gVisor, could run in GR0 or HR3 (it usually runs in GR0, unless it encounters privileged instructions that cannot be executed in non-root mode).

For IO operations, most of times, sentry will get trapped down to HR3 and let the host kernel (e.g., linux) deal with the OS cache, exactly what you said. However, due to the complex architecture and sooooo many layers of abstractions in gVisor, the performance is bad. Therefore, we would like to allow more page cache in gVisor to avoid trapping to HR3 in each IO operations.

Since we are adding cache in sentry, there must be a way to manage and reclaim them. LRU is the most intuitive strategy, but I find that contention overhead completely diminishes the performance improvement brought by cache. So I would like a way to avoid contentions on the same lru list.


Robert Engels <ren...@ix.netcom.com> 于2019年7月24日周三 下午7:52写道:

Zihan Yang

unread,
Jul 24, 2019, 9:11:36 AM7/24/19
to Robert Engels, golan...@googlegroups.com
The user apps of one container is usually certain, but there are many kinds of user apps. One of the most situations is the nginx server that would access files repeatedly. But the access pattern is not solely decided by the server itself, but also the requests from clients. Even with same client, it might access file A today, but heavily changes to file B tomorrow. Therefore, it is not easy to predict the access pattern, so I can only try to leverage the time and space locality of a certain range of access as much as possible. But of course, some extremely common files like index.html should better always stay in memory.

I will look at the CGO and ASM, thanks for the clue.

Robert Engels <ren...@ix.netcom.com> 于2019年7月24日周三 下午8:58写道:

Jesper Louis Andersen

unread,
Jul 24, 2019, 10:44:49 AM7/24/19
to Zihan Yang, golang-nuts
On Wed, Jul 24, 2019 at 7:16 AM Zihan Yang <whois.zi...@gmail.com> wrote:

I tried to evaluate the performance with lru & without lru, the performance with a global lru is 10x slower than without lru at all.


This would make me step back a bit and revisit my initial plan. If a global cache isn't helping, what evidence is there a thread-local cache would? Said another way, the work you are caching must have a certain computational overhead to it, which warrants a cache in the first place. 

Zihan Yang

unread,
Jul 24, 2019, 12:05:56 PM7/24/19
to Jesper Louis Andersen, golang-nuts
I should have said that my evaluation is just self-written cycle measurement, which is very rough and lack of repeated experiment. So the factor number might be different in your case. But the contention for the single sync.Mutex really hinders the performance in my case.

> "what evidence is there a thread-local cache would?"

Strictly speaking, I'm not looking for thread-local, but P-local (similar to per-cpu data in linux). Since a P can contain only one goroutine at any time, the P-local storage needs not locking. Only when I flush the batched items into the global lru cache, the exclusive locking is needed.

Now that P-local storage does not exist, Im thinking about reducing the operations on the list. Not every operation needs to delete-and-reinsert element from the list, but just changes some attributes of the item itself. That is a compromise of LRU strategy, but in this way, we only need read lock if we operate on the item itself instead of the list.

I am not an expert in caching and golang, so please correct me if I misunderstand anything.

Jesper Louis Andersen <jesper.lou...@gmail.com> 于2019年7月24日周三 下午10:44写道:

Tamás Gulácsi

unread,
Jul 24, 2019, 1:36:23 PM7/24/19
to golang-nuts
2019. július 24., szerda 18:05:56 UTC+2 időpontban Zihan Yang a következőt írta:
I should have said that my evaluation is just self-written cycle measurement, which is very rough and lack of repeated experiment. So the factor number might be different in your case. But the contention for the single sync.Mutex really hinders the performance in my case.

> "what evidence is there a thread-local cache would?"

Strictly speaking, I'm not looking for thread-local, but P-local (similar to per-cpu data in linux). Since a P can contain only one goroutine at any time, the P-local storage needs not locking. Only when I flush the batched items into the global lru cache, the exclusive locking is needed.

Now that P-local storage does not exist, Im thinking about reducing the operations on the list. Not every operation needs to delete-and-reinsert element from the list, but just changes some attributes of the item itself. That is a compromise of LRU strategy, but in this way, we only need read lock if we operate on the item itself instead of the list.

I am not an expert in caching and golang, so please correct me if I misunderstand anything.

Jesper Louis Andersen <jesper.lo...@gmail.com> 于2019年7月24日周三 下午10:44写道:
On Wed, Jul 24, 2019 at 7:16 AM Zihan Yang <whois.z...@gmail.com> wrote:

I tried to evaluate the performance with lru & without lru, the performance with a global lru is 10x slower than without lru at all.


This would make me step back a bit and revisit my initial plan. If a global cache isn't helping, what evidence is there a thread-local cache would? Said another way, the work you are caching must have a certain computational overhead to it, which warrants a cache in the first place. 



@dgryski has several different cache eviction policies implemented - https://github.com/dgryski/go-tinylfu and https://godoc.org/github.com/dgryski/go-clockpro for example.
https://github.com/golang/groupcache has an alru subpackage, too.
Also, groupcache may help as a cache server, too.

What is the lifecycle of a client application?
One goroutine per app? Then you could create a cache for the goroutine, mmaping a common seed, for example.
One goroutine per syscall? Then have NumCPU caching goroutines and communicate with them through channels.
Contention is making you slower then using the host OS' block cache? Then it's doing it's job, and you don't need a cache.

Tamás Gulácsi

Robert Engels

unread,
Jul 24, 2019, 2:08:04 PM7/24/19
to Tamás Gulácsi, golang-nuts
As I pointed out you can add code similar to the id project to also get the P id and use that as a sync.Map key. 
--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/a7669373-cc10-436c-bf17-78ea13a7b8e2%40googlegroups.com.

Jesper Louis Andersen

unread,
Jul 25, 2019, 7:11:33 AM7/25/19
to Zihan Yang, golang-nuts
On Wed, Jul 24, 2019 at 6:05 PM Zihan Yang <whois.zi...@gmail.com> wrote:
I should have said that my evaluation is just self-written cycle measurement, which is very rough and lack of repeated experiment. So the factor number might be different in your case. But the contention for the single sync.Mutex really hinders the performance in my case.


A long shot, but maybe worth trying: If the lock is the contention point, you create N locks, preferably one per core, making sure they are not in the same cache line. Read locking is now "pick random lock, and grab it". Write locking is "pick every lock in the same order for exclusive access". Or "hash the key, select randomly from H, H+1, H+2" and so on. This avoid the single lock contention for a high-core case. But it trades that for more expensive inserts.

Of course, having a "per-P" lock would be more efficient, but this converges toward the idea. So if there is a benefit, it is likely to show up.

Robert Engels

unread,
Jul 25, 2019, 7:48:18 AM7/25/19
to Jesper Louis Andersen, Zihan Yang, golang-nuts
A sync.Map is lock free on read, and a single lock on writes and so will out perform this solution in all cases. 

That being said, there are some perf issues with sync.Map - see the project I referred to earlier. 

The OP still hasn’t provided detailed information on the access patterns - at least that I can understand. - so this conversation is just a guessing game. 
--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/golang-nuts/CAGrdgiW%2BqiwccrwakDxy4xZ-dY%3DchwQZQ9_wv2Z%2Bxc9-Sbt__A%40mail.gmail.com.

Jesper Louis Andersen

unread,
Jul 26, 2019, 5:50:10 AM7/26/19
to Robert Engels, Zihan Yang, golang-nuts
On Thu, Jul 25, 2019 at 1:47 PM Robert Engels <ren...@ix.netcom.com> wrote:
A sync.Map is lock free on read, and a single lock on writes and so will out perform this solution in all cases. 


I wouldn't just matter-of-factly place lock-free algorithms above classical mutex locks in speed. 

Robert Engels

unread,
Jul 26, 2019, 7:49:32 AM7/26/19
to Jesper Louis Andersen, Zihan Yang, golang-nuts
Well, you can on modern commercial processors for local guards. 

It is also very easy to reason why - most lock structures use the same CAS constructs behind the scenes so you are eliminating operations in all cases. 

When you say “faster” you probably mean “fairer” - as most lock free algorithms can starve /bias actors and in many cases you don’t this. For a write through cache (which this most likely is) you definitely want lock free structures (unless it is biased towards writes and in this case you probably don’t want a cache at all). 

This doesn’t even get into the performance possibilities of using eventually consistent/lazy methods which are even faster. 

All of this is predicated on proper implementation - you can review the github/robaho/go-concurrency-test to see where Go falls down a bit here. 

ben....@gmail.com

unread,
Jul 28, 2019, 7:30:04 PM7/28/19
to golang-nuts
For reads,
1. Use lock-free ring buffers to record the read events, don't update the eviction policy.
2. Have an array of buffers and use the key's hash to select one
3. Drop the read event if contention or the buffer is full
4. When full, use a try-lock to replay the events against the eviction policy

For writes,
1. Use a single ring buffer to record the write event.
2. If appended, schedule the replay work immediately
3. If full, block on the lock and replay.

The lack of a processor / thread / goroutine id means that selecting the read buffer isn't ideal. The accesses to a hot key will contend on the ring buffer, similar to if you use lock striping. However the work involved is cheaper than locking, as you only need to acquire an array index to set its value. Because caches follow a pareto distribution the popular items will be accessed more frequently, so a few dropped events won't change this and the eviction policy should be able to prefer them. I believe the naive hash selection will be faster than sync.Pool, but the two should be compared.

This scheme allows you to replace LRU with more advanced policies such as W-TinyLFU, and incorporate other features like very cheaply.

I wrote a short overview of this in use,
and there is an early attempt to port it into Golang in https://github.com/dgraph-io/ristretto/commits/master
Reply all
Reply to author
Forward
0 new messages