Memory profiling: pprof heap profile vs. UNIX

672 views
Skip to first unread message

Brad Heller

unread,
Jul 5, 2014, 8:32:11 PM7/5/14
to golan...@googlegroups.com
Hey all,

I've got a service I'm writing in Go receives a bunch of data, writes it to disk, and does some calculations on the data while it caches a subset of the data in memory (obviously the data set is bigger than memory).

I've created an abstraction that basically allows me to start evicting stuff from cache once the total cache size has reached a certain threshold (e.g. 2GB). This should really be the majority of the what I'm caching.

However, I've noticed that the service is consuming significantly more memory (note: following was on Darwin):

$ ps -o rss=  42724
5418420 (kilobytes)

So I used pprof to figure out WTF is going on. To do that, I basically just made my app respond to the USR1 signal by dumping the pprof heap profile. But, it doesn't look like pprof is giving me a full snapshot.

>> go tool pprof bin/my-app /tmp/my-app-heap.pprof
Adjusting heap profiles for 1-in-524288 sampling rate
Welcome to pprof!  For help, type 'help'.
(pprof) top
Total: 544.0 MB

Note that the ps reading and the pprof reading were taken while the app was in the same state.

1. Am I missing something about pprof here?
2. Am I missing something about go's GC, or at least how the GC subsequently releases allocations?

Thanks,
Brad Heller

Tamás Gulácsi

unread,
Jul 6, 2014, 2:03:48 AM7/6/14
to golan...@googlegroups.com
Use net/http/pprof,but on macosx profiling is lacking

Carlos Castillo

unread,
Jul 6, 2014, 9:24:21 PM7/6/14
to golan...@googlegroups.com
Only cpu profiles on OSX are broken (and can be fixed with a kernel patch), all other profiles are fine.

As far as the questions of the OP:
  1. pprof can track memory a few different ways, but I believe the default is "memory in use by go at the time the profile was made". This does *NOT* include memory collected by the GC. The --help flag to pprof can show the various memory profile display options.
  2. The GC never releases memory, it only re-uses it. From the outside, memory use will only grow or stay level. Also the GC is only automatically run on allocation, when the total in-use memory exceeds a previously set limit, not when objects become inactive.
If you write a program which processes data in 512MB chunks, the profiler may only list 512MB in use (depends on how the program holds on to memory), but when the GC runs it finds all the memory currently in use, and sets a limit of double that for running the next GC cycle. So if you happened to have 2 chunks active at the time, then at least 1GB is in use, and the GC will be next run when a total of 2GB has been allocated. This means that unless your app ends before the next GC, or the GC is called manually, those 2GB of memory will eventually be used. If only one 512MB chunk is active at a later GC, the limit is set to 1GB, but the extra 1GB allocated previously is *NOT* returned to the OS by the GC.

The scavenger runs every 5 minutes and tries to find blocks of memory (allocations are made in blocks) which are completely unused by go code. This can be difficult, as if any byte in the block is active, the whole block remains active. If it finds any inactive blocks, it tells the OS that they are "Unused", which is slightly different then "Free". Unfortunately the OS still counts that memory against the process, but if it needs memory, it will use the "Unused" group first, and the contents don't need to be preserved (ie: swapped out to disk), so the operation is nearly as cheap as on "Free" memory.

Also, maps and slices are easy ways to lose track of memory, since you may only be using len() elements, but the allocations can be larger. Eg: Slicing an array of length 21 to take only the middle 3 elements doesn't make the remaining 18 elements disappear in any way.

A more concrete example: http://play.golang.org/p/-K8wbHTJmH

The value of Sys is the total number of bytes Go has asked from the system, which in normal (non-cgo) programs should roughly match the system's measurement, Alloc is the number of bytes Go believes is active at the moment. Even though the example functions only generate slices/map that contain 10*1MB of data, the program stabalizes at a Sys value more than 10 times that. Some of the extra space makes sense, as there is more memory in use than just the slices/maps, we are storing at least one 1MB element on the stack before copying it to the slice/map, and some extra allocations are needed due to re-sizing of the slices and maps. On re-size, both versions must exist at the same time, so at best with a doubling algorithm for slices, a slice of 0.5n elements and one of n must both be active at some point in the function, at worst it can be 2n-1 elements worth. Adjusting the boolean options in main show the results if you pre-allocate the slices, and/or call the gc at the end of each operation. With both active, the memory use as seen in the value Sys, is much better.

So in short, from the outside, Go programs never shrink in memory use unless the OS takes memory back, and may be larger than you expect based on how references to data are held in your program. 

Carlos Castillo

unread,
Jul 6, 2014, 10:24:06 PM7/6/14
to golan...@googlegroups.com
There's one other important factor I forgot to mention that contributes to extra memory use: memory fragmentation.

If you allocate a 10MB block, free it, and then make some small allocation, the allocation could use part of the freed 10MB block. Even if the unused portion is made available for future allocations it will be smaller than 10MB, and so can't be used to satisfy the next 10MB allocation (which must be contiguous), so more memory may need to be requested from the OS.

IE: In the code I posted, simply because Sys-Alloc = ~20MB doesn't mean that there are 10MB in a row to store a slice of 10 1MB elements.

Carlos Castillo

unread,
Jul 6, 2014, 11:06:01 PM7/6/14
to golan...@googlegroups.com
Also note, adjusting the program to store slices/maps of pointers to 1MB blocks: http://play.golang.org/p/eCRmDQ-vvk

The total memory use shown by Sys is now only ~2x the largest working set size (eg: 10MB), which falls in line with what's expected based on the behaviour of the GC (see: http://golang.org/pkg/runtime/#pkg-overview). The down side to this is that you are giving the GC more work, as it now needs to track each independent 1MB block. Using the prealloc var makes things slightly worse (although probably faster), and manually calling GC after each operation drops total use down to ~15MB. All these cases are still better than the best case previously (~30MB).

Another way to decrease memory use is to manually re-use the memory yourself. This way, the large blocks you allocate don't go back to the GC where they could get fragmented, and so no new ones need to be made.

Dmitry Vyukov

unread,
Jul 7, 2014, 7:09:29 AM7/7/14
to Brad Heller, golang-nuts
What version of Go are you using? There were some fixes in heap pprof in 1.3.
Does the program use cgo/swig? pprof does not see C heap.
Print runtime.MemStats, pprof counts only Go heap, not all memory in a
Go program is heap.
Reply all
Reply to author
Forward
0 new messages