Yes, branch buffers are per core.
It's hard to say what the exact reason is, but some observations:
1) lots more front end starvation in unpinned case. This is typically all the phases needed to get microops to the execution units (instruction fetching and decoding mostly). If you move around cores, then you lose out on icache and any core microop caches.
2) lots more backend starvation, so for identical code, this is probably data cache misses and branch misprediction stalls (branch is in your perf results - can you run with LLC cache miss reporting?).
Besides the I/o part, is the parser mostly cpu bound or does it also walk quite a bit of memory?
Getting migrated to a different core (or worse, different socket) is going to hurt as pretty much all resources (e.g. data and instruction caches, cpu buffers/caches, etc) need to be warmed up again. Similarly, unless you partition all cpu heavy workloads on the entire machine properly, another thread can get scheduled on the core you're pinned to and trash things up a bit.
Having said all that, linux scheduler shouldn't be moving tasks around unnecessarily (it clearly knows all of the above, even though it doesn't know the user land app specifics). There are still cpu-migrations reported in your perf output, but perhaps those are from before taskset is issued. It'd be interesting to see results on the server machine you mentioned, especially if you can keep it quiet with other activity while you run the benchmark.
Sent from my phone
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.