Go 1BRC Challenge and GC Overhead

208 views
Skip to first unread message

Brendan Goldacker

unread,
Jul 22, 2024, 4:05:33 PM (3 days ago) Jul 22
to golang-nuts

I've been using the 1 billion row challenge as a way to learn go over the past few days. With my limited go experience, I've seem to hit a wall. 

I have my submission here. On my M1 macbook pro with 16GB of memory, it runs in ~39s. I can get it a few seconds faster if I play around with some parameters (numWorkers and chunkSizes). I'm trying to make it a goal to get <20s. However, I'm stuck.

I came across another submission from this blog post . I've copied over the submission to my repo (otherSubmission.go) and it runs in ~19s. I was looking over their solution and it appears very similar to mine in apprach. The only difference is that they use os.Read to read the file sequentially, and my submission uses MMap. Plus a few other minor differences. Overall, both approaches look to be about the same. However, theirs runs in about half the time. I can't find anything that stands out to why there would be such a large difference in runtime.

I've been trying to use the profile and execution traces to figure out why. Below are my flameGraphs and execution traces

Screenshot 2024-07-22 at 1 19 40 PM Screenshot 2024-07-22 at 1 21 47 PM

Below are the otherSubmission profile and execution traces:

Screenshot 2024-07-22 at 1 23 39 PM Screenshot 2024-07-22 at 1 22 47 PM

The first thing that stands out to my are the # of GC collections. My submission seems to be running GC much more frequently - 100,000 incremental sweeps, 32,000 background sweeps, and around 1000 stop the world GCs. The heap only grows to 180MB at its peak, and typically stays around 72MB. In the otherSubmission, incremental GC only runs 6 times, and stop the work is ~30 times. Their heap grows much larger (up to 3.5GB), and GC happens less frequently.

Also, looking at the flame graphs, it looks like there is much more time being spent on managing the stack systemstackruntime.morestack, and runtime.mcall. I think this may be due to more scheduler overhead due to the frequent GCs (waiting for routines to get to a stopping point, running GC, then re-scheduling threads to start again).

I believe the time discrepancy is mostly due to GC overhead and go scheduler overhead. Although, I'm not sure why.

Could anyone help a newbie figure out where I'm going wrong? Also could share any tools or tricks that could help me figure this out?

Ian Lance Taylor

unread,
Jul 22, 2024, 4:17:57 PM (3 days ago) Jul 22
to Brendan Goldacker, golang-nuts
I have not looked at your code.  But if the GC is running a lot, the first step is to look at where you are allocating memory, using Go's memory profiler.  It may help to read https://go.dev/blog/pprof.  See if you can adjust your code to allocate less memory.

Ian

ben...@gmail.com

unread,
Jul 22, 2024, 8:22:25 PM (3 days ago) Jul 22
to golang-nuts
Hi Brendan, this is a fun problem. I'm looking at https://github.com/Goldabj/1brc-go/blob/main/cmd/brc/log_processor.go, and I suspect the main thing is that you're converting []byte to string, which almost certainly allocates. You're also scanning through the bytes several times (first to find the '\n', then the ';', then the '.'). And you're calling ParseInt (twice), which does more work than you need to for this problem. If you want to go further, I'd suggest keeping things as []byte, and making the code search for the various delimiters *and* parse at the same time.

If you want to "cheat" :-), you could look at my write-up, where I present 9 (incrementally faster) solutions to this problem: https://benhoyt.com/writings/go-1brc/

-Ben

Marcos Issler

unread,
Jul 23, 2024, 9:46:42 AM (2 days ago) Jul 23
to golang-nuts
Hi Brendan, take a look of this amazing event of Ardan Labs where they explore in depth how to analyze the GC. I think you will find your answer how the GC works and the best way to know how much goroutines vs GC memory use for your best performance. 

You need to sign in into Ardan Labs but it is free. I didn't find the video outside. 

They have some interesting info about it in the blogs page too: 


Bests,
Marcos Issler 

Brendan Goldacker

unread,
Jul 24, 2024, 4:40:28 PM (20 hours ago) Jul 24
to golang-nuts
Thanks everyone for the leads and resources. After a few more days hacking away and experimenting with changes I was able to get a solution that runs under 20s (see here). Turns out GC wasn't a problem. Its was mostly issues I created. There are a few things that really helped me get there: 

1. In place parsing lines -- As Ben mentioned, I was doing duplicate work for each line. I would run `bytes.IndexByte(buffer '\n')` to find the end of the current line. Then I would pass the line to a `lineToMeasurement(line)` method which would iterate through each character to parse out the city and the temperature sample. So I was reading each character twice. I knew this duplication existed, but I didn't think much about it. Especially since `IndexByte` uses a SSE instruction (or so I believe it is), so I assumed this was negligent. However, when processing 1 billion lines, a few nanoseconds per line really adds up.  Therefore I updated the lineToMeasure to handle the parsing in a single pass. This cut down a few seconds of processing. 

2. Lower amount of consumer routines -- This was just a dumb oversight by me. I was running `runtime.NumCPU()` consumer go routines. However, there is an additional producer go routine, and a reducer go routine. Just changing the workers to NumCPU - 1 removed all that stack management and go scheduler overhead I was seeing in the CPU profile. I think this saved ~6s. 

3. Sequentially reading the file with `file.Read` instead of MMapping -- Originally I MMapped the file since It make the code simpler to write at first, and also because I wanted to learn how to do this in Go. I knew sequential reading was likely to be slightly faster since mmaping tends to only be better if you're sharing data across process or threads, or are randomly accessing the file. However, I copied the approach from the blog post to read the file sequentially with `file.Read`, then truncate the chunk on a '\n' char with `bytes.LastIndex(buf, '\n\)`. This approach cut down the runtime from ~30s to ~20s. I didn't expect this to be so drastically different. Prior, I had written a small benchmark to compare the speed of mmaping versus reading sequentially. Mmaping took ~5s to read all bytes, and the sequential approach took ~25s. However, the sequential approach uses a `Scanner`. So I'm assuming most of the additional time is due to the 'split' function, and not really with the `os.Read` calls. I'm still not sure why there is such a large difference in approaches. It could be because MMapping performs poorly on my platform? Or `bytes.LastIndex` is somehow more performant than `bytes.IndexByte`? Or sequential reading has some underlying benefit (like better page cache hit rate, or L1/L2 cache hit rate)? 


There are some additional things I've noticed, but I don't have answers for: 

1. Large buffer channels -- Originally I was allocating a chunks channel and a results channel with a larger buffer (128). I assumed having a large buffer would avoid the risk of the buffers becoming full, and the channel become blocking. However, after lowering the buffer to ~8 - 12 I saw some significant reductions in runtimes, especially with the variance around runtimes.  With large buffers, I would get ~35s runtimes (and sometimes up to 45s). But when I lowered the buffer size to 12, then the runtimes were between 19s and 22s every time. This seems counter intuitive to me. I assumed a channel was just some queue/array with a mutex. I think I'll need to do some more reading up on channels. 
I will mention that I was running this while my laptop was low on battery, and some things were running in the background. So this may have been just a coincidence and channel size doesn't make any difference. 

2. Seg faults when not copying slices -- I threw away this code, but in one attempt I was seeing Seg faults when trying to read a slice from a struct, and I'm not sure why. I had updated my `Measurement` struct to store a `cityBytes []byte`. Then on this line I would set `citybytes: line[:idx]`. However, at a later point I would later try to reference `measurement.CityBytes[0]` and I would get a seg fault. The only fix was to copy the slice with `city := make([]byte, len(line[:idx])`, then `cityBytes: copy(city, line[:idx])` . However, I don't understand why this is needed. I though the underlying array should still exist if at least one slice still referenced it. I couldn't come up with a simple example to reproduce it, so :shrug: . 


Thanks all, Brendan

Brendan Goldacker

unread,
Jul 24, 2024, 5:57:40 PM (18 hours ago) Jul 24
to golang-nuts
A follow up on the Large Buffer Channels observation. It looks like when the `chunksChannel` is size 8, the program runs in 20s. If I increase the channel size to 128, then the execution time is around 55s. 

Looking at the CPU profile, most of the additional time spend is when iterating over the buffer data (this line ). This range is about 4x slower with a larger channel. 

I looked a little more into channels. I'm not super confident, but I think I have an explanation. 

Looking at the heap size traces form the small buffer:
Screenshot 2024-07-24 at 4.41.43 PM.png

and the larger buffer: 
Screenshot 2024-07-24 at 4.41.38 PM.png

With the smaller buffer, the heap seems to cap at ~2.2GB and has regular GCs to clean it up. However, with the larger buffer the heap continues to grow up to ~18GB of memory. With only 16GB of memory on my machine, I'm assuming the additional time spend is due to memory thrashing. 

So why is the large buffered channel causing the heap to grow unbounded? 

I took a look over the channels code. I can't say I follow it 100%, but it seems like there is a circular buffer which would be of size 128 in the larger buffer case. Then there are two offset pointers for the heads and tails of the queue. It seems like the heap grows unbounded since when receiving from the channel, it doesn't actually delete the item form the circular queue. Instead it just increments our head pointer. Therefore, a channel holds a reference to a chunk buffer (up to 128 of them) even if they've already been read/received by a worker. Thus, these 'old' buffers cannot be GC'ed. The reference is only removed from the circular queue once the channel gets enough sends to overwrite the slot. 

Thanks, Brendan
Message has been deleted

Ian Lance Taylor

unread,
Jul 24, 2024, 6:55:04 PM (17 hours ago) Jul 24
to Brendan Goldacker, golang-nuts
On Wed, Jul 24, 2024 at 2:58 PM Brendan Goldacker <brend...@gmail.com> wrote:

I took a look over the channels code. I can't say I follow it 100%, but it seems like there is a circular buffer which would be of size 128 in the larger buffer case. Then there are two offset pointers for the heads and tails of the queue. It seems like the heap grows unbounded since when receiving from the channel, it doesn't actually delete the item form the circular queue. Instead it just increments our head pointer. Therefore, a channel holds a reference to a chunk buffer (up to 128 of them) even if they've already been read/received by a worker. Thus, these 'old' buffers cannot be GC'ed. The reference is only removed from the circular queue once the channel gets enough sends to overwrite the slot. 

That's interesting.  Would you mind filing an issue at https://go.dev/issue so that the runtime team can take a look?  If you are describing the issue correctly--I have no reason to think you aren't--perhaps it would be relatively cheap to zero out the entry in the channel buffer after the value is delivered to a goroutine.  Thanks.

Ian

Brendan Goldacker

unread,
Jul 24, 2024, 7:37:34 PM (17 hours ago) Jul 24
to golang-nuts
Sorry I was wrong. I posted a correction, but it as deleted for some reason. 

Channels do delete the element from the circular buffer on receive. So my conclusion was wrong. 

The reason for the heap growth is that the producer go routine can send chunks faster than the consumer go routines can process them. So by keeping the channel's buffer size smaller we essentially rate limit the producer and keep the heap bounded. Additionally, I learned that an unbuffered or 'full' channel has some slight performance benefit. A buffered channel a send requires copying the chunk to the channels queue, then a receive will copy it from the channel's queue to the consumers stack. However, if the buffer is full (or unbuffered), then send will copy the item directly to the receivers' stack (if there is a receiver waiting). Which also reduces heap usage. 

Thanks, Brendan
Reply all
Reply to author
Forward
0 new messages