conceptual tradeoffs re: use of taskset(1) to permute/pin 8xJVM grid's w CPU affinity

786 views
Skip to first unread message

u935903.brown.edu

unread,
Oct 17, 2013, 5:37:49 PM10/17/13
to mechanica...@googlegroups.com

Considering a project to do the followng:

Given,

1.  have access to a single (fully isolated, Ionly user)  8xCPU Linux instance (128gb RAM)

2. a 8xJVM data grid process set (built using solutions at www.infinispan.org and www.jgroups.org)

3. each JVM booted with-Xms15g -Xmx15g -XX:+UseConcMarkSweepGC -XX:+UseTLAB -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing -XX:CMSIncrementalDutyCycleMin=0 -XX:CMSIncrementalDutyCycle=10 -XX:MaxTenuringThreshold=0 -XX:SurvivorRatio=128 -XX:CMSInitiatingOccupancyFraction=50

4. a  long running MapReduce task that is broken up as 2 sub-tasks
      4.1   "prepare data" step = extract from an RDBMS, transform into a large  CacheMap<K,V> abstraction, and distribute (load) onto my 8xJVM grid   (all I/O-bound to construct the MapReduce operand)
      4.2   "render result" step =  use Infinispan's DistributedExecutorServiceMap API and framework to initiate the processing (outside of network loop-back hops, no other I/O-bound to affect the MapReduce operation)

ambition is to then compare/contrast the performance of 1-4 above being processed on  permutations of the following:

P0.   use taskset(1) to boot all 8xJVM process set  with  complete affintiy to CPU 0 (fully aware 7 other CPUs effectiveyly unused)
P1    use taskset(1) to boot 7xJVM process set  with affinity to CPU 0, 1 xJVM process set to CPU 1
P2.   use taskset(1) to boot 6xJVM process set with affinity to CPU 0, 1 x JVM process set to CPU 1, 1xJVM process set to CPU 2

....

P7  use taksset(1) to boot 1x JVM process with affinity to CPU 0, 1 x JVM process set to CPU 1, ..., 1xJVM process set to CPU 7  (fully aware I may be overly distributing wrt to CPUs' local physical cache(s) not being best effectiveyly utilized)

of course I will also want to compare/contrast the performance of 1-4 above without any taskset(1) affinity influence, completely allowing the default OS to handle all distribution and scheduling.

I may also consider running other variants of JVM:CPU_affintinity configurations (maybe even all possible members of the full permutation set).

Independent of any actual performance metrics produced, what should be expected conceptually when considering possible merits/delinquencies/trade-offs of any specific point of the permutations making up  JVM:CPU_affinity distribution curve (especially interested to know what tradeoffs to consider re  CPU locality to physical cache(s) usage) ?


u935903.brown.edu

unread,
Oct 18, 2013, 10:04:42 AM10/18/13
to mechanica...@googlegroups.com

Simplifying my original post,


1.  On an 8xCPU Linux instance,  is it at all advantageous to use the Linux taskset(1) command to pin an 8xJVM process set (co-ordinated as a www.infinispan.org distributed cache/data grid) to a specific CPU affinity set  (i.e. pin JVM0 process to CPU 0, JVM1 process to CPU1, ...., JVM7process to CPU 7) vs. just letting the Linux OS use its default mechanism for provisioning the 8xJVM process set to the available CPUs?

2.  In effrort to seek an optimal point (in the full event space), what are the conceptual trade-offs in considering "searching" each permutation of provisioning an 8xJVM process set to an 8xCPU set via taskset(1)?

Thanks,
Ulysses

Paul de Verdière

unread,
Oct 18, 2013, 5:20:06 PM10/18/13
to mechanica...@googlegroups.com
1. In my somewhat empirical experience with CPU pinning, I observed that pinning an entire JVM (single thread, cpu-intensive application) to a single core gave not as good performance as letting the OS choose CPUs with its default scheduler. This is probably due to misc housekeeping threads competing with the applicative threads.

(I guess pinning a heavily multithreaded application to only one CPU would be even worse.)

Another observation I made was that Linux did a good job at distributing work across CPUs, and as long as processes maintained a constant pressure on the CPU, no context switching occurred.

So my conclusion was: when in doubt, let the OS do the scheduling.

CPU-pinning on a per-thread basis(*) makes sense when low-latency/high responsiveness is involved, in which case CPU isolation should also be used to avoid pollution by other processes. For heavy parallel computations, I tend to think this is not really necessary.

Paul

(*) taskset is probably not the best tool for the job, better look at https://github.com/peter-lawrey/Java-Thread-Affinity

u935903.brown.edu

unread,
Oct 18, 2013, 6:29:02 PM10/18/13
to mechanica...@googlegroups.com
> (*) taskset is probably not the best tool for the job, better look at https://github.com/peter-lawrey/Java-Thread-Affinity

So cool.  Checking out the Java-Thread-Affinity solution's ambitions and capabilities right now.  

Wondering (out loud) if I could use Java-Thread-Affinity solution to do something like this?   -->

First thing to maybe try = comparing/contrasting perfomance of my ThreadPoolExecutor(s)' submitted Callable(s) at the "extremes" of possible segregation/integration.  E.g.,  If I have say  8  (ThreadPoolExecutor,Callable) pairs, each TPE with poolSz=8,  then on one extreme I could have all Threads from pair 0 being pinned to CPU 0, and on the other extreme (for pair 0)  I could have it so that each Thread in that pair's pool executed its Callables only on distinct CPUs, ie t0's Callable @CPU0, t1's Callable@CPU1, ..., t7's Callable@CPU7.  Wow, all of that independent of the JVM on which each TPE was constructed?  So cool. Thanks.

P.S.  I have no idea (yet) - other than self-educating - why I might want to do anything like exploring segregation/integration extremes of Thead-->CPU affintiy.  I'm instinctively betting that  there is not  a real use-case for taking on such an exercise  .... I just thing it is interesting (very!) that it is even possible to survey these potentials. 

Michael Barker

unread,
Oct 18, 2013, 9:52:40 PM10/18/13
to mechanica...@googlegroups.com
We currently use taskset at LMAX, the biggest win is not in locality, but simply separation of the cores used for the application and the cores use for handling OS interrupt requests.  We use irqbalance and the IRQBALANCE_BANNED_CPUS option, others advocate disabling irqbalance to configuring the affinity via the /proc filesystem.  Also you can use taskset in the init process to move all of the system daemons to a different set of cores too.

Taskset is a fairly blunt tool, thread affinity will give you finer grained control and will probably be more useful if you are trying to exploit memory locality.  As Peter himself also points out (http://vanillajava.blogspot.co.nz/2013/07/micro-jitter-busy-waiting-and-binding.html), if you goal is to eliminate latency jitter, thread affinity is best combined with isolcpus.  While using thread affinity will prevent your thread from being scheduled elsewhere, it doesn't preclude the OS from scheduling something else on the bound CPU potentially introducing jitter.

Mike.


--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Vladimir Rodionov

unread,
Oct 23, 2013, 7:50:39 PM10/23/13
to mechanica...@googlegroups.com

All these extreme optimizations make sense if you write and own the whole code you use in your application, otherwise the benefits will largely depend on how optimal {infinispan,jgroups} code is from both: performance and memory usage pov.

Paul de Verdière

unread,
Oct 23, 2013, 11:55:13 PM10/23/13
to mechanica...@googlegroups.com
I agree. It can be difficult enough to control what your own code is doing. If you use 3rd party software, allow users to configure the system, use plugins etc. the pinning strategies that you choose can often become counterproductive. Pinning threads to a single CPU is probably a last-resort optimization, when you are sure that the latency spikes come from the OS and not from your application.

Using tasket to set affinity to a set of CPUs or a physical socket is slightly different and easier to justify. It allows sharing resources (memory zones, cores) in an efficient way among groups of processes while letting the task scheduler do its job.

Peter Lawrey

unread,
Oct 24, 2013, 2:34:20 AM10/24/13
to mechanica...@googlegroups.com
AFAIK, CPU pinning doesn't have a noticeable benefit for throughput. When pinning helps is in reducing jitter *provided* you pin to an *isolated* CPU or core.  In one of th tests I show that binding to an non-isolated CPU is very similar to not binding at all.

Note: I didn't play with the scheduling options in these tests.


Nitsan Wakart

unread,
Oct 24, 2013, 4:22:22 AM10/24/13
to mechanica...@googlegroups.com
From my own experience:
1. Using taskset/numactl makes a big difference when you are running on a multi-socket machine. If you leave your process to roam from socket to socket, or leave it's threads to be split across sockets, your performance will suffer. I have not seen the scheduler make consistently good choices there.
2. I completely agree with Mike, using taskset for coarse segregation offers the OS some useful hints. Be aware of any JVM/application mechanics that rely on the number of available cores to assist in choosing the number of threads to run as they make bad choices when pinned (tune the number of GC threads for instance)
3. I've seen pinning to cores (at least for rather crude benchmarks) improve the throughput and latency under measurement, even with non-isolated CPUs. It didn't make jitter disappear completely, but the histogram was much tighter.

So, where possible, I think you can get better performance for your system if you can split it across numa nodes (i.e 1 instance per node, a node would have 4/6/8/12 cores). Pinning particular threads to particular CPUs is appropriate where you have very good understanding and control over the threads in your application, their relationships, the code they use (shared state/lock) etc, and the performance is a definite requirement. Isolating the OS(and other casual processes) from your application makes allot of sense, so if it's your machine to manage I'd definitely do that.

BTW: I have recently (by mistake) stumbled across a benchmark that demonstrates better results when pinning the process to a single core. That benchmark is the 3 producers,1 consumer benchmark for the Disruptor, where running all the threads on a single core removes the contention on the tail counter :-) so there are at least some cases where less is more.  

Martin Thompson

unread,
Oct 24, 2013, 5:21:30 AM10/24/13
to

BTW: I have recently (by mistake) stumbled across a benchmark that demonstrates better results when pinning the process to a single core. That benchmark is the 3 producers,1 consumer benchmark for the Disruptor, where running all the threads on a single core removes the contention on the tail counter :-) so there are at least some cases where less is more.  

I find such benchmarks very naive and frustrating. When doing such a benchmark two major effects are happening. Firstly data is exchanged between threads typically in the store buffer or L1 cache which massively reduces the latency in a manner that virtually never happens in a real world application. Secondly, the removal of contention on the tail happens because threads tend to run to exhaustion of the ring buffer before yielding without contention because no other threads can be running at the same time.  Think one fills the buffer then the other drains it. Looks like a lot of throughput but with horrible latency.

As soon as some real-world application logic is introduced to make a realistic application then the batch-exchange-of-full-buffers-effect as I call it from these micro benchmarks disappear. 

When dealing with realistic applications it is the contention points and cache-misses that dominate performance for in-memory applications. Going to disk, network, or dealing with GC are their own special worlds as separate subjects.

Martin...

Nitsan Wakart

unread,
Oct 24, 2013, 10:30:27 AM10/24/13
to mechanica...@googlegroups.com
I never said it was realistic behaviour, nor recommended practice. But it demonstrates that the degenerate case of getting a better result with less CPUs exists. The particular benchmark is about contention, using a single core removes the contention and makes it to a large degree meaningless. In fact in the benchmark itself there is a comment which states you should not run it with less cores than threads.
To clarify I did not offer the example to suggest the benchmark was at fault, or the Disruptor, I only found this anomaly because of an error on my part. I offered it as an example of how things work differently when allocated different numbers of CPUs.


On Thursday, October 24, 2013 11:19 AM, Martin Thompson <mjp...@gmail.com> wrote:

BTW: I have recently (by mistake) stumbled across a benchmark that demonstrates better results when pinning the process to a single core. That benchmark is the 3 producers,1 consumer benchmark for the Disruptor, where running all the threads on a single core removes the contention on the tail counter :-) so there are at least some cases where less is more.  

I find such benchmarks very naive and frustrating. When doing such a benchmark two major effects are happening. Firstly data is exchanged between threads typically in the store buffer or L1 cache which massively reduces the latency in a manner that virtually never happens in a real world application. Secondly, the removal of contention on the tail happens because threads tend to run to exhaustion of the ring buffer before yielding without contention because no other threads can be running at the same time.  Think one fills the buffer then the other drains it. Looks like a lot of throughput but with horrible latency.

As soon as some real-world application logic is introduced to make a realistic application then the batch-exchange-of-full-buffers-effect as it call it from these micro benchmarks disappear. 

When dealing with realistic applications it is the contention points and cache-misses that dominate performance for in-memory applications. Going to disk, network, or dealing with GC are their own special worlds as separate subjects.

Martin...

Martin Thompson

unread,
Oct 24, 2013, 10:38:00 AM10/24/13
to mechanica...@googlegroups.com, Nitsan Wakart
I know you are not recommending such practice :-) I know you well enough to know better. I just want to point out that so many take this sort of result seriously and do daft things.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages