Worse jHiccup results when running on an isolated CPU?

592 views
Skip to first unread message

Attila-Mihaly Balazs

unread,
Jun 24, 2013, 6:12:45 AM6/24/13
to mechanica...@googlegroups.com
I'm playing around with jHiccup (http://www.azulsystems.com/jHiccup) and my assumption was that I can get a "flat line" by isolating CPUs and running it on those CPUs. Curiously I get worse spikes when running on those CPUs.

- Running on non-isolated cores: http://postimg.org/image/8n1msxerp/
- Running on isolated cores: http://postimg.org/image/o217w8s4f/

As you can see from the graphs, both the maximum value and the variance is considerably worse in the isolated case. BTW, the XLS file in the jHiccup archive only works with MS Office 2010 and up. No LibreOffice / older MS Office support which is sad for us Linux folks :-(

Getting back to the test procedure:
- OS: Ubuntu 13.04, Kernel: 3.8.0-25-generic #37-Ubuntu SMP
- CPU: Intel i7-2760QM
- JVM: OpenJDK 7u21-2.3.9-1ubuntu1
- added kernel boot parameters: isolcpus=4,5,6,7
- disable irq balancing: service irqbalance stop
- jHiccup command line: ./jHiccup -r 0 /usr/bin/java org.jhiccup.Idle -t 360000
- jHiccup with taskset: taskset -c 6 ./jHiccup -r 0 /usr/bin/java org.jhiccup.Idle -t 360000

Also, while running jHiccup on the isolated CPUs I got the native thread ID (nid) for HiccupRecorder and moved it to core 7 so that it isn't influenced by other JVM threads. I'm fairly confident in my steps (I double checked each of them in multiple ways - for example by looking at /proc/cmdline for the kernel boot parameter, looking at htop, looking at the output of taskset commands, etc).

Any ideas why running jHiccup on the isolated CPUs gives worse results?

Thanks,
Attila

Michael Barker

unread,
Jun 24, 2013, 6:25:09 AM6/24/13
to Attila-Mihaly Balazs, mechanica...@googlegroups.com
jHiccup will use separate threads one for measurement and one for
recording. You are effectively binding both of those threads to the
same core causing contention. Note that isolcpus is only useful if
you have access to setting the thread affinity for each thread. Note
that taskset is not particularly useful in combination with isolcpus.
If you taskset onto isolated cpus it will schedule all of the threads
onto the first core listed
(http://www.novell.com/support/kb/doc.php?id=7009596). You may want
to experiment with hacking in a java affinity library (Peter Lawrey's
works well) into jHiccup and see what the results are.

Mike.
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-symp...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

Attila-Mihaly Balazs

unread,
Jun 24, 2013, 10:02:10 AM6/24/13
to mechanica...@googlegroups.com, Attila-Mihaly Balazs
Thank you for the quick response.

- Re: multiple threeads - Looking at the source code it seems to me that there is a single thread named "HiccupRecorder" doing both. Specifically I have the following in my version of HiccurRecorder:

    class HiccupRecorder extends Thread {
       ...
        public void run() {
            try {
              ...
                while (doRun) {
                    long measurementStartTime, deltaTimeNs;
                    if (config.resolutionMs != 0) {
                        measurementStartTime = System.nanoTime();
                        Thread.sleep(config.resolutionMs);
                        if (allocateObjects) {
                            // Allocate an object to make sure potential allocation stalls are measured.
                            lastSleepTimeObj = new Long(measurementStartTime);
                        }
                        deltaTimeNs = System.nanoTime() - measurementStartTime;
                    } else {
                        measurementStartTime = System.nanoTime();
                        deltaTimeNs = System.nanoTime() - measurementStartTime;
                    }
                    long hiccupTimeNsec = (deltaTimeNs - resolutionUsec)/1000;
                    hiccupTimeNsec = (hiccupTimeNsec < 0) ? 0 : hiccupTimeNsec;
                    histogram.recordValue(hiccupTimeNsec, resolutionUsec);
             
- Also Re: thread contention: AFAIK in Linux threads are processes (just special kind of processes which share their address space) and threads ids can be used interchangeably with process IDs. Thus when I bound one thread with taskset to CPU 7, only that thread got migrated (I also checked this with htop - before the migratoin CPU 6 was 100% while after migration CPU 7 was 100%, while CPU 6 had a little activity - confirming IMHO that the rest of the threads remained on it).

I also don't think that I have any NUMA issues, given the closeness of the two CPUs.

Regards,
Attila

Gil Tene

unread,
Jun 24, 2013, 1:39:08 PM6/24/13
to mechanica...@googlegroups.com, Attila-Mihaly Balazs
The main thread (the one that executes commonMain) does background bookkeeping and file i/o, and swaps histograms with the recording thread so as not to block it. Since it's the agent's main thread, it's isn't given a "Hiccup..." named right now...

I'll be adding the following line to commonMain() in coming releases to give it an easy to notice name:

        Thread.currentThread().setName("HiccupBookkeeper ");

As to the larger hiccups when binding the thread to a specific core: since jHiccup is simply recording what the HiccupRecorder thread is experiencing, this simply means that whatever the setups difference you have, it's causing higher hiccups for that thread. They are most likely due to scheduling contention, and I'd try to verify that the cpus are actually isolated and that the binding of the threads had actually "taken" as intended. 

There is one additional theoretical/potential cause, but I doubt it's what's going on here: The HiccupRecorder thread can theoretically be blocked in synchronization during a histogram swap if the main thread gets scheduled out while holding the monitor between these two lines in the swapHistogram() method:
                    ...
                    newHistogram = replacementHistogram;
                    this.wait();

Swapping occurs every reporting interval (5 seconds in your default case), but for this contention to happen as often as your logs show someone would be both unlikely, and someone have to be kicking the main thread off of it's CPU, which would mean your core 6 has pressure on it...

I'll make histogram swapping completely lock free in future builds just to make this theoretical case impossible...

-- Gil.

Peter Lawrey

unread,
Jun 24, 2013, 2:03:48 PM6/24/13
to Attila-Mihaly Balazs, mechanica...@googlegroups.com
Running a whole JVM on one core is usually worse than letting the scheduler place all the threads in your process.  You should be able to try

taskset -c 4,5,6,7 ./jHiccup -r 0 /usr/bin/java org.jhiccup.Idle -t 360000

To get better results.  This is part of the reason I use a library (Java Affinity) to control specific threads (not all the threads), and let the other threads in the process run on "junk" cores to avoid the threads in the JVM also running on the "isolated" cpus.

Peter.


--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Michael Barker

unread,
Jun 24, 2013, 3:59:37 PM6/24/13
to Peter Lawrey, Attila-Mihaly Balazs, mechanica...@googlegroups.com
> taskset -c 4,5,6,7 ./jHiccup -r 0 /usr/bin/java org.jhiccup.Idle -t 360000

If you are using isolcpus, this will not work. Taskset and isolcpus
don't mix very well, in this case only the first CPU in the list will
be used, so all of the Java threads will end up on CPU 4
(http://www.novell.com/support/kb/doc.php?id=7009596).

Mike.

Jean-Philippe BEMPEL

unread,
Jun 26, 2013, 11:42:59 AM6/26/13
to mechanica...@googlegroups.com, Attila-Mihaly Balazs
Beware of NUMA effect and how are cpu id are distributed by OS.
Check it using 
numactl --hardware.

for me I have 

available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10
node 0 size: 24576 MB
node 0 free: 21042 MB
node 1 cpus: 1 3 5 7 9 11
node 1 size: 24563 MB
node 1 free: 23454 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10


means that even cpu ids are on node 0 (=physical cpu or socket 0) & odd cpu ids are on node 1 (physical cpu or socket 1).
So IF you have the same distribution than me it means you can have remote DRAM accesses which cost more in latency than local ones.
usually I use the numactl command to bin JVM process (but beware there is some threads created also by the JVM that can be disturb your applciation threads).

numactl --cpunodebind=0 --membind=0 java ...



On Monday, June 24, 2013 4:02:10 PM UTC+2, Attila-Mihaly Balazs wrote:

Attila-Mihaly Balazs

unread,
Jul 1, 2013, 9:26:05 AM7/1/13
to mechanica...@googlegroups.com, Attila-Mihaly Balazs
Thank you all for your feedback.

Couple of updates:

- numactl says that I have a single memory zone:
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 8086 MB
node 0 free: 2240 MB
node distances:
node   0 
  0:  10 

- After some more testing I got the best results by using thread affinity (setting main + HiccupRecorder to a core) and raising the thread priority to maximum: http://postimg.org/image/4tn95o3f9/
(I did this using -XX:ThreadPriorityPolicy=1 as root and using http://trac.assembla.com/Behemoth/browser/Tests/JAVA/test/src/main/java/test/threads/ThreadAffinity.java - I set the "main" and "HiccupRecorder thread on the same core and set the priority of HiccupRecorder to Thread.MAX_PRIORITY)

- Some other results I obtained:
  - Using thread affinity only: http://postimg.org/image/bvq8pede9/
  - Using taskset with -c 5-8: http://postimg.org/image/nsmgnm4pd/ (this indeed sets all the threads to a single core so it wasn't of much use)
  - I didn't try it with numactl cpunodebind, given that I have a single numa zone
  - Without CPU affinity: http://postimg.org/image/y1qnd1tgd/
  - Without CPU affinity, in interpreted mode (-Xint) - checking if the compilation is the cause of Jitter: http://postimg.org/image/5v3dwrvtz/ (it had by far the most uneven behavior)

Also, while running on the isolated cores I stopped irq balance and when running on the non-isolated cores I restarted it.

In conclusion it seems that even in the ideal case you can expect 99.99% latencies of 6 ms "out of nowhere" from time to time :(

If anybody has better results, I would love to hear about them.

Regards,
Attila Balazs

Gil Tene

unread,
Jul 1, 2013, 11:07:38 AM7/1/13
to mechanica...@googlegroups.com, Attila-Mihaly Balazs
If you have 6msec hiccups observed on an Idle Linux system, it basically tells you that the system itself needs a bunch more tuning if you are looking for consistent application latencies below that, and that the tuning is external to the jHiccup process. That's what I build the Idle mode for... (to detect this situation).

It is very normal to see an untuned "Idle" Linux system show hiccups in the 5-20msec range when measured over long periods of time (like 2 hours), and it is something I always recommend people do as a baseline on systems where they are trying to achieve contained program latency behaviors that go beyond those numbers. Basically, if you haven't gotten your idle jHiccup picture to be show nothing bigger than your target worst-case latency, you should not expect ANY program running on that system to be able to do better.

Affinity is usually the last thing, rather than the first thing, to tune for controlling noise on an Idle system. When you are looking at large fractions of a millisecond, affinity only helps outliers on systems where the CPUs get busy enough to have all cores busy with other threads unless affinity is used, and that just shouldn't happen on a truly Idle system. Once you get your system tuned so that it is quiet enough at Idle, affinity is a good way to protect your program from cross-process outliers when there is actual and significant CPU activity, i.e. in non-Idle conditions.

I'd usually start by clearing up your crontab, and checking that you don't have any unnecessary "helpful" daemons running on the system. Then go check your power management, and your BIOS settings for things like ECC scrubbing or other nice background-work-for-your-CPUs-that-even-your-OS-doesn't-know-about. Then go into Linux and start hacking away at those default settings that are built for a world where going away for 10s of milliseconds is an acceptable behavior (tuned, THP, file cache, etc.).

With proper system tuning, I've now seen dozens of sites with Linux systems that achieve picture with anywhere from ~0.5msec to ~1.5msec worst-observed-case hiccups on an Idle workload, and do so with no affinity controls. The list of things that can and should be tuned in a Linux setup is long and varied, and figuring out what might be causing a system hiccup is somewhat of a black art. I keep learning about new stuff that can cause system hiccups and new tricks to eliminate them on a weekly basis. Unfortunately, much changes by Linux distro, and some by hardware and vendor platforms, many settings that are right for one distro or piece of hardware may be wrong for others.

Attila-Mihaly Balazs

unread,
Jul 2, 2013, 3:09:19 AM7/2/13
to mechanica...@googlegroups.com, Attila-Mihaly Balazs
Thanks for the thoughtful response.

A couple more things I tried (but still didn't manage to get the latency down into the 1msec range :( ):

- I don't think other running processes should influence the results, since the test is being run on the "isolated" CPUs which shouldn't be running anything
- looked at what the isolated CPUs were doing with "perf top -C 5-7" - it turns out they were handling stuff related to the graphics card and wifi adapter, even though I had turned IRQ balance off

So I tried (one thing at a time, and each time I ran the test with thread affinity set to the isolated CPUs):
- booting into text mode
- turning irq balance off with a kernel parameter (acpi_irq_nobalance)
- turning the wifi card off completely
- changing the cpu frequency scaling governor such that I always have the maximum frequency

Still, I couldn't get a 99.99% latency with a value of ~1ms.

Attila
Reply all
Reply to author
Forward
0 new messages