First time posting on this list, but I've gone through a very similar exercise the last few months and might have some good insight for you.
Learning how to interpret profiles is extremely useful here. For example, capturing a cpu profile, heap profile, and trace will all provide different facets of what's going on under the hood. Luckily since you have high CPU usage the cpu profile alone will still be very useful.
For a couple low hanging fruit: armed with a CPU and heap profile, take a look at both. You say the runtime and gc dominate the CPU pprof: this likely points to memory issues as you mentioned. Open up a heap profile, switch the `sample_index` to `alloc_space` or `alloc_objects`, and take a look to see who the largest offenders are. For a more clear pointer to the offending code's callstack, set `call_tree`, then take another look.
I believe that spending a few hours/days on learning the pprof and trace tools would pay dividends given the scope of your task. It's hard to give any more detailed suggestions for performance while flying blind. Personally when I had a similar looking pprof, the two low hanging fruit were goroutine management (some were lasting longer than expected), and reduced memory usage (work / allocation was needlessly duplicated in several places).
if you truly have such a large amount of in use objects (noted by inuse_space / inuse_objects in the heap's pprof), then I agree some form of sync.Pool or memory reuse may be beneficial.