--
You received this message because you are subscribed to the Google Groups "blis-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blis-discuss...@googlegroups.com.
To post to this group, send email to blis-d...@googlegroups.com.
Visit this group at https://groups.google.com/group/blis-discuss.
For more options, visit https://groups.google.com/d/optout.
Thanks for your interest in BLIS.
As Devin said, BLIS does not yet parallelize level-2 operations. Our justification for this is that level-2 operations are memory bandwidth-limited, not compute-limited, and therefore they inherently lack the potential for high performance that is found with level-3 operations. Enabling level-2 parallelism is on my long-term to-do list, but in all honesty it is pretty low priority for now. (Apologies.)
In addition to not being parallelized yet, some level-2 operations do not yet make use of optimized kernels on Haswell and newer architectures. However, those are mostly limited to complex domain level-2 operations, and also to non-x86_64 hardware, so that does not apply in your case. (On a related note, we are investigating ways of producing better level-1v and level-1f kernels automatically via compiler flags [1]. These are the kernels that power level-2 operations, and would benefit less-common architectures for which we do not yet have hand-optimized kernels.)
Field
No prob. I was just wondering if I made a silly mistake.
That sounds like an interesting project. I have made the experience on x86-machines that stream intrinsics really boost the performance. Well hard to find to find the right xx aligned address in case of two dimensions, I guess.
Best
Cem
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 134217728 (elements), Offset = 0 (elements)
Memory per array = 1024.0 MiB (= 1.0 GiB).
Total memory required = 3072.0 MiB (= 3.0 GiB).
Each kernel will be executed 100 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 1
Number of Threads counted = 1
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 92622 microseconds.
(= 92622 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 10334.0 0.209801 0.207807 0.212059
Scale: 10371.1 0.210603 0.207065 0.213303
Add: 12626.6 0.256528 0.255114 0.258246
Triad: 12644.4 0.256321 0.254756 0.258015
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
[jrhammon@pcl-skx08 STREAM]$ export KMP_HW_SUBSET=1s,20c,1t
[jrhammon@pcl-skx08 STREAM]$ ./stream_c.exe
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 134217728 (elements), Offset = 0 (elements)
Memory per array = 1024.0 MiB (= 1.0 GiB).
Total memory required = 3072.0 MiB (= 3.0 GiB).
Each kernel will be executed 100 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 20
Number of Threads counted = 20
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 22465 microseconds.
(= 22465 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 93833.8 0.022969 0.022886 0.023070
Scale: 95993.9 0.022580 0.022371 0.022963
Add: 102668.8 0.031514 0.031375 0.031748
Triad: 102944.1 0.031458 0.031291 0.031798
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
--
You received this message because you are subscribed to the Google Groups "blis-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blis-discuss...@googlegroups.com.
To post to this group, send email to blis-d...@googlegroups.com.
Visit this group at https://groups.google.com/group/blis-discuss.
For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to a topic in the Google Groups "blis-discuss" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/blis-discuss/bU7VUH68YTM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to blis-discuss...@googlegroups.com.