How Go GC perform with 128GB+ heap

892 views
Skip to first unread message

almeida....@gmail.com

unread,
Jul 31, 2016, 9:26:13 AM7/31/16
to golang-nuts
I'm starting a proof of concept project to the company i work. The project is a http proxy with a smart layer of cache (Varnish, Nginx and others don't work because we have business rules on cache invalidation) for a very big microservice architecture (300+ services).

We have 2x128GB machines available today for this project. 
I don't have any doubt that Go has amazing performance, used in other projects, and they are rock solid, very fast and consuming very little memory.
But i'm afraid to use Go at this project because of the GC. I'm planning to use all the available memory on cache. Isn't all this memory on heap be a problem?

It's a new area to me, store tons of GB in a GC language.
What is my options? Use a []byte and or mmap to stay out of GC?
Lots and lots of code to reimplement this datastructures on top of slices just to avoid the GC, not counting all the encoding/decoding to get/set the values.

Stick with the raw slices?
Didn't used Cgo before, but it is a viable option? 
Or should i go 100% offheap with something like Rust or C?

I hope to add as little overhead as possible.

Brad Fitzpatrick

unread,
Jul 31, 2016, 11:44:29 AM7/31/16
to almeida....@gmail.com, golang-nuts
You should expect at most 10ms pauses for large heaps as of Go 1.6, and especially in Go 1.7.

See https://talks.golang.org/2016/state-of-go.slide#37 (for Go 1.6; Go 1.7 is more consistently lower)


--
You received this message because you are subscribed to the Google Groups "golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email to golang-nuts...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jan Mercl

unread,
Jul 31, 2016, 12:14:07 PM7/31/16
to Brad Fitzpatrick, almeida....@gmail.com, golang-nuts

On Sun, Jul 31, 2016 at 5:44 PM Brad Fitzpatrick <brad...@golang.org> wrote:

> You should expect at most 10ms pauses for large heaps as of Go 1.6, and especially in Go 1.7.

I'm assuming those 10ms are valid for most/typical programs and that the worst case of some, still perfectly reasonable programs[0], cannot be guaranteed to be on average that low[1]. Is that assumption correct?

  [0]: Let's imagine for example a program repeatedly producing a tiny sized node single linked list of size, say 32GB, doing something with it and throwing it away afterwards for the GC to deal with it.
  [1]: Or the pause must move instead to malloc waiting for the GC to free memory, so technically it's not [directly] a GC pause.

--

-j

Brad Fitzpatrick

unread,
Jul 31, 2016, 1:13:26 PM7/31/16
to Jan Mercl, almeida....@gmail.com, golang-nuts
On Sun, Jul 31, 2016 at 9:13 AM, Jan Mercl <0xj...@gmail.com> wrote:

On Sun, Jul 31, 2016 at 5:44 PM Brad Fitzpatrick <brad...@golang.org> wrote:

> You should expect at most 10ms pauses for large heaps as of Go 1.6, and especially in Go 1.7.

I'm assuming those 10ms are valid for most/typical programs and that the worst case of some, still perfectly reasonable programs[0], cannot be guaranteed to be on average that low[1]. Is that assumption correct?

The design specifies a 10ms worst case STW pause for all types of load.

If needed, your program's allocations (producing your example of a 32GB singly-linked list) will start participating in small pieces of the concurrent garbage collection themselves, thus slowing down the rate of allocation and speeding up the collection.

If you have a program that exhibits over 10ms pauses, Rick & Austin would be interested in debugging it. They have such worst-case programs themselves which they run regularly but still enjoy bug reports. Include the output of GODEBUG=gctrace=1 in your bug report.

Go 1.5 had the 10ms goal but had known deficiencies late in the cycle which prevented it from hitting the goal in some extreme cases. Go 1.6 fixed those, but some other rare cases were found which prevented hitting the 10ms goal in other extreme cases. Go 1.7 fixed most of those and started working more on throughput, using less CPU to achieve the same goal. An additional optimization atop the existing design is potentially landing in Go 1.8.

The GC people can correct me if I got this wrong.

Konstantin Shaposhnikov

unread,
Jul 31, 2016, 2:09:27 PM7/31/16
to golang-nuts, 0xj...@gmail.com, almeida....@gmail.com
Even if STW pauses are under 10ms they are not the only source of latency introduced by GC (see for example https://github.com/golang/go/issues/15847https://github.com/golang/go/issues/16293https://github.com/golang/go/issues/16432). Some of these issues will be addressed in Go 1.7, others probably in Go 1.8.

Also the size of the heap is generally twice the size of live objects. So a Go program that uses 128GB of RAM can have only have ~64GB of live data.

A good way to decide if Go will be good for your use case is to implement a prototype and benchmark it.

Sugu Sougoumarane

unread,
Aug 1, 2016, 12:38:49 AM8/1/16
to golang-nuts
If you ask me, it's not a good idea to try and fill a machine with a single program. Even if you succeed in doing so, your program will be too complex because you'll encounter all kinds of bottlenecks. If 48 cores are trying to obtain a lock, things are going to get contentious even with Go. Then you'll start changing your code to be lock free and introduce many bugs. And there's the fact that the libraries you depend one use mutexes.

Instead, if you ran something like 10x instances of the same program, you'll end up using the same resources, but will get much better performance, and your code will be much simpler.

As your hardware scales, you can tune more easily by varying the number of instances per machine.

andrey mirtchovski

unread,
Aug 1, 2016, 12:45:44 AM8/1/16
to Konstantin Shaposhnikov, golang-nuts, Jan Mercl, almeida....@gmail.com
> Also the size of the heap is generally twice the size of live objects.

the GOGC tunable (SetGCPercent in the runtime) can change these ratios.

Sokolov Yura

unread,
Aug 1, 2016, 2:13:09 AM8/1/16
to golang-nuts
Go 100 off-heap.

You may use other in-memory database for data, running on a same machine. I recomend Tarantool: http://tarantool.org - it is capable to handle hundreds of thousands (up to million) requests per second on just one CPU core. If you need more, then you may consider sharding. If you need no persistency, you may disable logging.

Henry

unread,
Aug 1, 2016, 3:19:00 AM8/1/16
to golang-nuts
The best way is to create a prototype, mock your worst situation, and measure it.

Peter Herth

unread,
Aug 1, 2016, 5:18:26 AM8/1/16
to Sugu Sougoumarane, golang-nuts
If you can perform your task with 10 independent processes, it means that they are not competing for the same resource, and consequently, it should run as well, if run within the same process. It all depends whether there are critical paths where a common resource needs to be locked. And a program does not need to be complex to require huge memories and lots of cores. So the question of where the limits of the Go GC are is very relevant. And the fact that it seems to be up to handing 100+ GB heaps was quite a "selling point" of Go for me.

Michael Jones

unread,
Aug 1, 2016, 6:22:55 AM8/1/16
to Peter Herth, golang-nuts, Sugu Sougoumarane

The nature of the cache is very important. It is easy to create and manage your own memory arena if that is a comfortable solution. (As in, do you want/need full Go GC generality in the cache contents? If so, the Go overhead may be best for that task--it is very good--but if not, say 100M 1k byte slots, then rolling your own would be easy and optimal. This last way gives you all of Go's magic plus any application-driven special efficiencies.

almeida....@gmail.com

unread,
Aug 1, 2016, 7:41:29 AM8/1/16
to golang-nuts, 0xj...@gmail.com, almeida....@gmail.com
Tested some offheap implementations, bigcache (https://github.com/allegro/bigcache) has a bench inside the project.
Running on my bride veryyyyyy old i5 notebook, this is the results with GODEBUG=gctrace=1: https://gist.github.com/anonymous/53dfb936e32a4755b7cb3a6695f66548
The test allocate some memory and call 'runtime.GC()', it block the execution until the GC ends, so even with the 10ms max STW time, the whole GC time gets higher.
Using bigcache and freecache the GC time stays around 200ms and using the standard map is 100x higher, around 20s.

Another thing is, the GC max stop time is 10ms, but how many times does this occur? It's 10ms every 50ms, right?

How does all this new DB's in Go use to hold all memory? (InfluxDB, Prometheus, Cockroach, etc...) 
Is with mmap and or a giant []byte?

Chris Randles

unread,
Aug 1, 2016, 9:06:13 AM8/1/16
to golang-nuts
Consul and etcd use Boltdb (https://github.com/boltdb/bolt) which uses an mmap-ed file.

r...@golang.org

unread,
Aug 1, 2016, 10:14:02 AM8/1/16
to golang-nuts
I think the high bit here is that the Go community is very aggressive about GC latency. Go has large
users with large heaps, lots of goroutines, and SLOs similar to those being discussed here. When 
they run into GC related latency problems the Go team works with them to root cause and address
the problem. It has been a very successful collaboration. 

Stepping back, work on large heaps is being motivated by the fact that RAM hardware, due to its
thermal characteristics, is still doubling byte/$ every 2 years or so. As heaps grow GC latency needs
to be independent of heap size if Go is going to continue to scale over the next decade. The Go 
team is well aware of this, is motivated by it, and continues to design a GC to address this trend. 

Jesper Louis Andersen

unread,
Aug 1, 2016, 10:27:47 AM8/1/16
to almeida....@gmail.com, golang-nuts

On Sun, Jul 31, 2016 at 5:31 AM, <almeida....@gmail.com> wrote:
It's a new area to me, store tons of GB in a GC language.

Set up an SLA for the service:

* 99th percentile: 99 of 100 requests are under 5ms processing time.
* 99.99th percentile:  9999 of 10.000 requests are under 25ms processing time.
* 99.9999th percentile: 999.999 of 1.000.000 requests are under 40ms processing time.

By coming up with a modal latency rate like the above, you avoid several problems:

* If you say "NO requests must be slower than 10ms" you are making a claim you cannot guarantee. There is always a larger doomsday scenario which you didn't account for. And engineering a system with enough leverage to never hit a doomsday scenario is almost always a waste of programming resources.

* Even in the no-doomsday game, you will have many requests arriving in spikes. You generally don't have enough cores to process all of these, so you will have to queue them. Which means you need more latency in your SLA. Most people just cram this under the idea of "if my service is blazing fast, problems doesn't happen". This strategy is only viable in the most naive systems. Stability comes from proactive queue management and clever spike handling. This is one of the places where the preemption capabilities of Go tend to help.

* You establish a base, GC'ed language or not. When I sustain-loaded Varnish with 30k req/s for 5 minutes, the 99th percentile for it was well in the "several second" ballpark. The reason is that Varnish 500 default threads can't keep up and latency queue buildup occurs. I note Varnish has no GC and uses mmap()'ed files. In other words, ripping out the GC is not a sufficient condition for solving the latency problems. And I don't think it is a necessary condition either.

* You get a good acceptance criteria. You should also make ballpark napkin math on the desired levels of latency. How many microseconds do you have per request to burn on the machine? If this looks completely impossible from the get go, you need to adjust your SLA latencies.

Also, the size of the heap is not everything. Large blocks of memory with no pointers tend to be fast to scan, so the "pointer density" will tell you a lot about the latencies. But to make this work, you need to conduct experiments.

If you have lots of additional machinery to burn, you can also use an old trick: Send the request to N servers, with a 2ms delay between each send. Pick the first response arriving and have the first processing server cancel the request at the other servers. This can often hide a latency spike by bounding it from above by 2ms.



--
J.

Jesper Louis Andersen

unread,
Aug 1, 2016, 10:33:58 AM8/1/16
to Sugu Sougoumarane, golang-nuts
On Mon, Aug 1, 2016 at 6:38 AM, 'Sugu Sougoumarane' via golang-nuts <golan...@googlegroups.com> wrote:
Instead, if you ran something like 10x instances of the same program, you'll end up using the same resources, but will get much better performance, and your code will be much simpler.

This is the Erlang solution to the problem. For each proxy-request, run a server, in isolation. You don't run one process handling a million requests. You run a million processes each handling one request. It is literally what happens in the VM internals as well. Each process has its own heap and can be GC'ed in isolation. Of course this requires you copy data into the process for communication, and this is the secret of the Erlang VM. In fact, everyone else is doing it wrong, often citing odd claims of "efficiency" as a reason. You end up having to build rather complex GC schemes, whereas Erlang just uses a tried and true 2-space copying collection by Cheney.


--
J.
Reply all
Reply to author
Forward
0 new messages