Surprising performance boost by pinning single threaded application to a core?

1,372 views
Skip to first unread message

Chris Newland

unread,
Oct 17, 2014, 7:36:49 PM10/17/14
to mechanica...@googlegroups.com
Hi,

I've been tuning our market data feed processor which is single threaded, straight-line Java with a hot loop and no shared state.

System calls at both ends (buffered NIO network read input, buffered NIO file write output) with a parser in the middle.

Timings are very stable with a benchmark load and I've done as much as I can think of with the code (object allocation is minimal, only minor GCs) so wanted to try some OS-level tuning and tried pinning the Java main thread to a core using Linux taskset.

Blogged it here: https://www.chrisnewland.com/cpu-pinning-java-threads-with-jstack-and-taskset-380 and on a desktop machine (i5 1 socket, 2 cores, 4 threads, 4MB cache) pinning the process reliably reduces the time taken by around 23% (293 seconds to 225 seconds) which seems suspiciously high to me.

Haven't tried on our server class boxes yet (2 socket, hex core, HT) where I plan to run multiple feed processor JVMs.

Can someone help me understand the speed up or think of any benchmarking traps I've stumbled into? I notice far less branch misses on the pinned run (same workload). Are branch prediction buffers per core?

Many thanks,

Chris
@chriswhocodes

Vitaly Davidovich

unread,
Oct 17, 2014, 8:53:14 PM10/17/14
to mechanica...@googlegroups.com

Yes, branch buffers are per core.

It's hard to say what the exact reason is, but some observations:

1) lots more front end starvation in unpinned case.  This is typically all the phases needed to get microops to the execution units (instruction fetching and decoding mostly).  If you move around cores, then you lose out on icache and any core microop caches.

2) lots more backend starvation, so for identical code, this is probably data cache misses and branch misprediction stalls (branch is in your perf results - can you run with LLC cache miss reporting?).

Besides the I/o part, is the parser mostly cpu bound or does it also walk quite a bit of memory?

Getting migrated to a different core (or worse, different socket) is going to hurt as pretty much all resources (e.g. data and instruction caches,  cpu buffers/caches, etc) need to be warmed up again.  Similarly, unless you partition all cpu heavy workloads on the entire machine properly, another thread can get scheduled on the core you're pinned to and trash things up a bit.

Having said all that, linux scheduler shouldn't be moving tasks around unnecessarily (it clearly knows all of the above, even though it doesn't know the user land app specifics).  There are still cpu-migrations reported in your perf output,  but perhaps those are from before taskset is issued.  It'd be interesting to see results on the server machine you mentioned, especially if you can keep it quiet with other activity while you run the benchmark.

Sent from my phone

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Chris Newland

unread,
Oct 18, 2014, 8:22:26 PM10/18/14
to mechanica...@googlegroups.com
Thanks for the explanations Vitaly.

Pinned:
171,767,915 LLC-loads                                                  
 34,892,333 LLC-load-misses           #   20.31% of all LL-cache hits  

 223.457563998 seconds time elapsed


Not Pinned:
193,523,986 LLC-loads                                                  
 45,254,685 LLC-load-misses           #   23.38% of all LL-cache hits  

286.884283997 seconds time elapsed


I think the workload is CPU bound and the working set pretty small. Each incoming message does a symbol lookup in a Map and updates some state on a stock object.

It was taking me around 10 seconds to grab the main thread NID and run taskset which I think explains the CPU migrations on the pinned run.

Setting taskset on the jre/bin/java pid drops the CPU migrations to < 20 but doesn't seem to impact performance much so I think the previous migrations happened very early in the run.

This desktop i5 has 2 HT cores (reported as 4 CPUs) and when testing 2 simultaneous feed processes I got the best results by pinning them to logical CPUs 0 and 2 as they are on separate cores. Pinning to logical CPUs 0 and 1 (the hardware threads on the first core) was far worse than leaving it to the OS due to the resource sharing.

I don't expect I'll see such a large boost on the server hardware (2S 6C 2T) as the Linux scheduler probably does a much better job at keeping the workloads apart but I'll test on Monday and expect to see the best results pinning to different sockets.

Regards,

Chris

Francis Stephens

unread,
Oct 29, 2014, 10:57:40 AM10/29/14
to mechanica...@googlegroups.com
I would expect that on a 2 core CPU with hyperthreading that cores 0 and 2 would be the same physical core. If that were so, it would be very interesting. Can you post your /proc/cpuinfo.

Chris Newland

unread,
Oct 29, 2014, 11:37:27 AM10/29/14
to mechanica...@googlegroups.com
Hi Francis,

cat /proc/cpuinfo | egrep "processor|core id"

processor    : 0
core id      : 0
processor    : 1
core id      : 0
processor    : 2
core id      : 2
processor    : 3
core id      : 2


It looks like logical CPUs 0,1 are on core 0 and logical CPUs 2,3 are on core 2.

Full cpuinfo here: https://gist.github.com/chriswhocodes/310457b10cab83cb0d62

Still need to try pinning on our production 2s6c2t boxes (am expecting much less of an improvement).

Regards,

Chris
@chriswhocodes

Francis Stephens

unread,
Oct 29, 2014, 1:04:36 PM10/29/14
to mechanica...@googlegroups.com
Your previous email was right. That is interesting. I usually have a look at the CPU setup of boxes I work on and I had come to think that they were always arranged the other way. Good to know, thanks.
Reply all
Reply to author
Forward
0 new messages