Surprising hyperthreading results from Percona

Will Sargent

unread,

Jan 15, 2015, 8:31:10 PM1/15/15

to mechanica...@googlegroups.com

If you haven't seen it already:

http://www.percona.com/blog/2015/01/15/hyper-threading-double-cpu-throughput/

"Still not sure why do I find this interesting? Let me explain. If you look carefully, initially – at concurrency of 1 through 8 – it scales perfectly. So if you only had data for threads 1-8 (and you knew processes don’t incur coherency delays due to shared data structures), you’d probably predict that it will scale linearly until it reaches ~10 requests/sec at 12 cores, at which point adding more parallel requests would not have any benefits as the CPU would be saturated.

What happens in reality, though, is that past 8 parallel threads (hence, past 33% virtual CPU utilization), execution time starts to increase and maximum performance is only achieved at 24-32 concurrent requests. It looks like at the 33% mark there’s some kind of “throttling” happening.

In other words, to avoid a sharp performance hit past 50% CPU utilization, at 33% virtual thread utilization (i.e. 66% actual CPU utilization), the system gives the illusion of a performance limit – execution slows down so that the system only reaches the saturation point at 24 threads (visually, at 100% CPU utilization)."

Will Sargent

Consultant, Professional Services

Typesafe, the company behind Play Framework, Akka and Scala

Gary Mulder

unread,

Jan 16, 2015, 7:59:55 AM1/16/15

to mechanica...@googlegroups.com

While Intel hyper-threading has improved significantly, Adrian Cockroft noted the non-linear response times of hyper-threaded systems back in 2005, so it is not that "surprising":

http://perfcap.blogspot.ie/2005/05/performance-monitoring-with.html

The Linux scheduler is both hyper-thread and NUMA aware. It tries to (re)schedule threads on the same physical cores, before scheduling additional threads on hyper-cores that are mapped to the same physical cores. Assume say you have a single-socket 3GHz 4 physical core Intel CPU that reports to Linux as 8 hyper-cores and you have a multi-threaded CPU-bound workload:

The first four threads should be scheduled on each hyper-core that maps to a different physical core and obtain approx. the full 3GHz of core processing power.
Reported CPU utilisation will be approx. 50%, while physical CPU utilisation will be approx. 100%.
The 5th thread will then share one of the physical cores and run concurrently on the second hypercore as one of the first four threads.
Reported CPU will be approx. 62.5%, while physical CPU utilisation will remain at approx. 100%.
Assuming hyper-threading provides no benefit, the 5th thread and its companion thread will get approx. 1.5GHz of CPU.
The 8th thread will then result in all 8 threads obtaining approx. 1.5GHz of CPU, again assuming hyper-threading doesn't provide any performance gain.

Because the CPU utilisation is essentially meaningless, the load average (number of threads that want to run concurrently) is much more indicative of hyperthreaded performance. Performance will be linear up to a load average of 4, then taper dramatically from 4-8 as hyper-threading is used, to finally hitting saturation over 8.

Where hyperthreading can help somewhat is in hiding memory latency to increase overall system throughput. The idea is that modern CPUs are usually limited by off-die memory speeds, so context switching between two hyper-threads whenever they are waiting for main memory access may increase net throughput. There's no hard and fast rules, but hyper-threading may improve throughput when:

The application horizontally scales close to linearly as more threads are run
The application is often blocked on main memory access (i.e. doesn't run in cache)
The cost of the hyperthread context switch is less than the performance gain in hiding memory latency

In the above example benchmarking with 1, 4, and 8 threads with Intel Turbo Boost disabled (http://www.intel.com/content/www/us/en/architecture-and-technology/turbo-boost/turbo-boost-technology.html) should provide some useful numbers for comparison. MySQL for example does not scale to multi-core particularly well with some workloads, and I've seen 20-30% net throughput gain by disabling hyperthreading on a 2x8 physical core DB server. This is likely due to the fact that MySQL likes 16 fast cores rather than 32 slower cores. I suggest disabling Turbo Boost as it adds more complexity and variability to any benchmarking.

Now, if you want the lowest single threaded response times, hyperthreading is also unlikely to provide any benefit and could actually slow your application down due to the additional thread management overhead. It is best therefore to either disable half of the hyper-cores through the OS (e.g. disable every other hypercore so there is a 1:1 mapping between active hyper-cores and physical cores) or completely disabling hyperthreading in the BIOS.

Interestingly, by using tasksel to pin all execution on the second socket to just the threads you want to run fast, and keeping the number of those threads to some what less than the number of cores, you might be able to find the sweet spot for single-threaded single core performance by explicitly managing for CPU socket temperature so that the threads are:

Always running at turbo core frequencies as the socket temperature remains lower than if all cores were running "hot"
Taking advantage of lower latency main memory access as all NUMA mallocs are "local" to the socket.
Not sharing hyper-threaded cores with other threads (e.g. kernel, interrupts and other OS background tasks).

Running turbostat on a recent Linux distribution can be very illuminating, as on some Intel chips core Turbo speeds can often be up to 50% faster than the listed speeds when the CPU is not "hot".

Regards,

Gary

Matt Godbolt

unread,

Jan 16, 2015, 3:22:41 PM1/16/15

to mechanica...@googlegroups.com

On Fri, Jan 16, 2015 at 6:59 AM, Gary Mulder <flying...@gmail.com> wrote:
> On 16 January 2015 at 01:30, Will Sargent <will.s...@typesafe.com> wrote:
> Now, if you want the lowest single threaded response times, hyperthreading
> is also unlikely to provide any benefit and could actually slow your
> application down due to the additional thread management overhead. It is
> best therefore to either disable half of the hyper-cores through the OS
> (e.g. disable every other hypercore so there is a 1:1 mapping between active
> hyper-cores and physical cores) or completely disabling hyperthreading in
> the BIOS.

It's my understanding that with hyperthreading disabled in the BIOS,
the instruction fetcher and other core resources that are shared
between hyperthreads are no longer contended at all.

Is this strictly true? Is hyperthreading "disabled" by all the
hyperthread cores executing "HALT" and then never being woken again,
or is there some other mechanism?

I suppose I'm interested to see whether having hyperthreading disabled
at boot time is equivalent to just "not using" the hyperthreaded
cores.

Thanks in advance, Matt

Vitaly Davidovich

unread,

Jan 16, 2015, 3:34:10 PM1/16/15

to mechanica...@googlegroups.com

Intuitively, one would think that any resource(s) statically allocated would not be fully "relinquished" if OS simply doesn't schedule on the sibling HT thread. On the other hand, modern CPUs are pretty sophisticated in other areas that involve activating/deactivating certain portions of the unit (e.g. various C states). The following Intel forum discussion, https://software.intel.com/en-us/forums/topic/480007, doesn't shed much light on this as the answer (from an Intel person) is simply "it depends, varies based on CPU model, and if you want to know further details, we can't/won't disclose them" :). Having said that, if anyone has "inside" info on this, I'd be interested as well.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Greg Young

unread,

Jan 16, 2015, 3:37:05 PM1/16/15

to mechanica...@googlegroups.com

+1

--
Studying for the Turing test

Gary Mulder

unread,

Jan 17, 2015, 10:24:43 AM1/17/15

to mechanica...@googlegroups.com

On 16 January 2015 at 20:22, Matt Godbolt <ma...@godbolt.org> wrote:

I suppose I'm interested to see whether having hyperthreading disabled
at boot time is equivalent to just "not using" the hyperthreaded
cores.

There's always a possibility they might not be the same, so they should be two individual test cases. However, it is expedient in a test env. to quickly simulate disabling hyper-threading by running the following script without having to ask the Infra guys to change something in the BIOS:

for X in `seq 1 2 23`; do echo "0" > ./cpu${X}/online; done

And a prod env. where getting that extra 20% performance justifies taking an outage and making the change persistent.

Gary (who sadly works in a land of unsympathetic code and Infra)

Matt Godbolt

unread,

Jan 17, 2015, 12:29:04 PM1/17/15

to mechanica...@googlegroups.com

On Sat, Jan 17, 2015 at 9:24 AM, Gary Mulder <flying...@gmail.com> wrote:

There's always a possibility they might not be the same, so they should be two individual test cases. However, it is expedient in a test env. to quickly simulate disabling hyper-threading by running the following script without having to ask the Infra guys to change something in the BIOS:

for X in `seq 1 2 23`; do echo "0" > ./cpu${X}/online; done

And a prod env. where getting that extra 20% performance justifies taking an outage and making the change persistent.

Gary (who sadly works in a land of unsympathetic code and Infra)

Good ideas. If I get some time I'll give it a go on some of our production machines (MLK day gives me a rare opportunity to monkey about with this kind of stuff :))

Regards, Matt

Reply all

Reply to author

Forward