BLAS.set_num_threads() and peakflops scaling

358 views
Skip to first unread message

Thomas Covert

unread,
Oct 19, 2016, 12:04:00 PM10/19/16
to julia-users
I have a recent iMac with 4 logical cores (and 8 hyper threads).  I would have thought that peakflops(N) for a large enough N should be increasing in the number of threads I allow BLAS to use.  I do find that peakflops(N) with 1 thread is about half as high as peakflops(N) with 2 threads, but there is no gain to 4 threads.  Are my expectations wrong here, or is it possible that BLAS is somehow configured incorrectly on my machine?  In the example below, N = 6755, a number relevant for my work, but the results are similar with 5000 or 10000.

here is my versioninfo()
julia> versioninfo()
Julia Version 0.5.0
Commit 3c9d753* (2016-09-19 18:14 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin15.6.0)
  CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
  WORD_SIZE: 64
  BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

here is an example peakflops() exercise:
julia> BLAS.set_num_threads(1)

julia> mean(peakflops(6755) for i=1:10)
5.225580459387056e10

julia> BLAS.set_num_threads(2)

julia> mean(peakflops(6755) for i=1:10)
1.004317640281997e11

julia> BLAS.set_num_threads(4)

julia> mean(peakflops(6755) for i=1:10)
9.838116463900085e10





Ralph Smith

unread,
Oct 19, 2016, 9:28:16 PM10/19/16
to julia-users
At least 2 things contribute to erratic results from peakflops(). With hyperthreading, the threads are not always put on separate cores. Secondly, the measured time includes
the allocation of the result matrix, so garbage collection affects some of the results.  Most available advice says to disable hyperthreading on dedicated number crunchers
(most full loads work slightly more efficiently without the extra context switching).  The GC issue seems to be a mistake, if "peak" is to be taken seriously.

Thomas Covert

unread,
Oct 19, 2016, 11:47:40 PM10/19/16
to julia-users
So are you suggesting that real numerical workloads under BLAS.set_num_threads(4) will indeed be faster than under BLAS.set_num_threads(2)?  That hasn't been my experience and I figured the peakflops() example would constitute an MWE.  Is there another workload you would suggest I try to figure out if this is just a peak flops() aberration or a real issue?

Stefan Karpinski

unread,
Oct 20, 2016, 5:45:51 PM10/20/16
to Julia Users
I think Ralph is suggesting that you disable the CPU's hyperthreading if you run this kind of code often. We've done that on our benchmarking machines, for example.

Thomas Covert

unread,
Oct 20, 2016, 11:00:41 PM10/20/16
to julia-users
Thanks - I will try to figure out how to do that.  I will note, however, that the OpenBLAS FAQ suggests that OpenBLAS tries to avoid allocating threads to the same physical core on machines with hyper threading, so perhaps this is not the cause:

Ralph Smith

unread,
Oct 21, 2016, 12:09:07 AM10/21/16
to julia-users
That's interesting, I see the code in OpenBLAS. However, on the Linux systems I use, when I had hyperthreading enabled the allocations looked random, and I generally got less consistent benchmarks.  I'll have to check that again.

You can also avoid the memory allocation effects by something like
using BenchmarkTools
a
= rand(n,n); b=rand(n,n); c = similar(a);
@benchmark A_mul_B!($c,$a,$b)

Of course this is only directly relevant to your real workload if that is dominated by sections where you can optimize away allocations and memory latency.

Ralph Smith

unread,
Oct 21, 2016, 11:05:04 PM10/21/16
to julia-users
On looking more carefully, I believe I was mistaken about thread assignment to cores - that seems to be done well in OpenBLAS (and maybe Linux in general nowadays).  Perhaps the erratic benchmarks under hyperthreading - even after heap management is tamed - arise when the operating system detects idle virtual cores and schedules disruptive processes there.

Thomas Covert

unread,
Oct 23, 2016, 5:04:08 PM10/23/16
to julia-users
At least with my experience on a mac, I've never seen real linear algebra code (not just peakflps) in Julia + OpenBLAS saturate more than 2 cores, even when setting the thread count to 4 on a machine with 4 real cores.  When I try similar code on a linux machine I have access to, I never have a problem saturating as many real cores as are available, which makes me think that somehow the BLAS + threading situation on the mac version of Julia is not quite right.  
Reply all
Reply to author
Forward
0 new messages