I have a recent iMac with 4 logical cores (and 8 hyper threads). I would have thought that peakflops(N) for a large enough N should be increasing in the number of threads I allow BLAS to use. I do find that peakflops(N) with 1 thread is about half as high as peakflops(N) with 2 threads, but there is no gain to 4 threads. Are my expectations wrong here, or is it possible that BLAS is somehow configured incorrectly on my machine? In the example below, N = 6755, a number relevant for my work, but the results are similar with 5000 or 10000.
julia> versioninfo()
Julia Version 0.5.0
Commit 3c9d753* (2016-09-19 18:14 UTC)
Platform Info:
System: Darwin (x86_64-apple-darwin15.6.0)
CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
WORD_SIZE: 64
BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
here is an example peakflops() exercise:
julia> BLAS.set_num_threads(1)
julia> mean(peakflops(6755) for i=1:10)
5.225580459387056e10
julia> BLAS.set_num_threads(2)
julia> mean(peakflops(6755) for i=1:10)
1.004317640281997e11
julia> BLAS.set_num_threads(4)
julia> mean(peakflops(6755) for i=1:10)
9.838116463900085e10