I've been using the 1 billion row challenge as a way to learn go over the past few days. With my limited go experience, I've seem to hit a wall.
I have my submission here. On my M1 macbook pro with 16GB of memory, it runs in ~39s. I can get it a few seconds faster if I play around with some parameters (numWorkers and chunkSizes). I'm trying to make it a goal to get <20s. However, I'm stuck.
I came across another submission from this blog post . I've copied over the submission to my repo (otherSubmission.go) and it runs in ~19s. I was looking over their solution and it appears very similar to mine in apprach. The only difference is that they use os.Read to read the file sequentially, and my submission uses MMap. Plus a few other minor differences. Overall, both approaches look to be about the same. However, theirs runs in about half the time. I can't find anything that stands out to why there would be such a large difference in runtime.
I've been trying to use the profile and execution traces to figure out why. Below are my flameGraphs and execution traces
Below are the otherSubmission profile and execution traces:
The first thing that stands out to my are the # of GC collections. My submission seems to be running GC much more frequently - 100,000 incremental sweeps, 32,000 background sweeps, and around 1000 stop the world GCs. The heap only grows to 180MB at its peak, and typically stays around 72MB. In the otherSubmission, incremental GC only runs 6 times, and stop the work is ~30 times. Their heap grows much larger (up to 3.5GB), and GC happens less frequently.
Also, looking at the flame graphs, it looks like there is much more time being spent on managing the stack systemstack, runtime.morestack, and runtime.mcall. I think this may be due to more scheduler overhead due to the frequent GCs (waiting for routines to get to a stopping point, running GC, then re-scheduling threads to start again).
I believe the time discrepancy is mostly due to GC overhead and go scheduler overhead. Although, I'm not sure why.
Could anyone help a newbie figure out where I'm going wrong? Also could share any tools or tricks that could help me figure this out?
I took a look over the channels code. I can't say I follow it 100%, but it seems like there is a circular buffer which would be of size 128 in the larger buffer case. Then there are two offset pointers for the heads and tails of the queue. It seems like the heap grows unbounded since when receiving from the channel, it doesn't actually delete the item form the circular queue. Instead it just increments our head pointer. Therefore, a channel holds a reference to a chunk buffer (up to 128 of them) even if they've already been read/received by a worker. Thus, these 'old' buffers cannot be GC'ed. The reference is only removed from the circular queue once the channel gets enough sends to overwrite the slot.