Performance of Dense Matrix-Vector Multiplication

Cem Bassoy

unread,

Jan 17, 2019, 12:38:10 PM1/17/19

to blis-d...@googlegroups.com

Hi,

after calling the test-suite of blis, the matrix-vector multiplication with
blis_sgemv_nn_rcc achieves (only) 7 GFLOPS [30 GFLOPS with openBLAS] in single precision.
Only one processor is utilized while others are idle.

Is the matrix-vector not parallelized or did I do sth wrong? Matrix-matrix multiplication seems to work fine as gemm reaches 360 GFlops in single precision with more processors involved.

Best,

CB

---------------

configuration in input.operations:

....

2 # Level-2

...

2        # gemv
32768 32768    #   dimensions: m n
nn       #   parameters: transa conjx

---------------

hardware:

intel i9-7900 with 10 cores and 20 hw-threads

---------------

software:

ubuntu 18.04, gcc 7.3

---------------

installation:

./configure --libdir=/usr/lib --includedir=/usr/include --enable-cblas --enable-threading=openmp auto

make -j20

make check -j20

make install

export OMP_NUM_THREADS=20

Devin Matthews

unread,

Jan 17, 2019, 12:39:48 PM1/17/19

to blis-d...@googlegroups.com

Hi CB,

Indeed, gemv is not parallelized currently.

Thanks,
Devin Matthews

--
You received this message because you are subscribed to the Google Groups "blis-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blis-discuss...@googlegroups.com.
To post to this group, send email to blis-d...@googlegroups.com.
Visit this group at https://groups.google.com/group/blis-discuss.
For more options, visit https://groups.google.com/d/optout.

Field Van Zee

unread,

Jan 17, 2019, 1:10:39 PM1/17/19

to blis-discuss

Cem,

Thanks for your interest in BLIS.

As Devin said, BLIS does not yet parallelize level-2 operations. Our justification for this is that level-2 operations are memory bandwidth-limited, not compute-limited, and therefore they inherently lack the potential for high performance that is found with level-3 operations. Enabling level-2 parallelism is on my long-term to-do list, but in all honesty it is pretty low priority for now. (Apologies.)

In addition to not being parallelized yet, some level-2 operations do not yet make use of optimized kernels on Haswell and newer architectures. However, those are mostly limited to complex domain level-2 operations, and also to non-x86_64 hardware, so that does not apply in your case. (On a related note, we are investigating ways of producing better level-1v and level-1f kernels automatically via compiler flags [1]. These are the kernels that power level-2 operations, and would benefit less-common architectures for which we do not yet have hand-optimized kernels.)

Field

[1] https://github.com/flame/blis/issues/259

cem.b...@gmail.com

unread,

Jan 17, 2019, 1:26:57 PM1/17/19

to blis-discuss

No prob. I was just wondering if I made a silly mistake.
That sounds like an interesting project. I have made the experience on x86-machines that stream intrinsics really boost the performance. Well hard to find to find the right xx aligned address in case of two dimensions, I guess.

Best
Cem

Jeff Hammond

unread,

Jan 17, 2019, 1:35:49 PM1/17/19

to Field Van Zee, blis-discuss

A large number of modern HPC processors require parallelism to saturate memory-bandwidth.

For example, there is a ~10x difference in STREAM bandwidth running on Intel Xeon Platinum 8180 processors with 1 and 20 cores, respectively:

-------------------------------------------------------------

STREAM version $Revision: 5.10 $

-------------------------------------------------------------

This system uses 8 bytes per array element.

-------------------------------------------------------------

Array size = 134217728 (elements), Offset = 0 (elements)

Memory per array = 1024.0 MiB (= 1.0 GiB).

Total memory required = 3072.0 MiB (= 3.0 GiB).

Each kernel will be executed 100 times.

The *best* time for each kernel (excluding the first iteration)

will be used to compute the reported bandwidth.

-------------------------------------------------------------

Number of Threads requested = 1

Number of Threads counted = 1

-------------------------------------------------------------

Your clock granularity/precision appears to be 1 microseconds.

Each test below will take on the order of 92622 microseconds.

(= 92622 clock ticks)

Increase the size of the arrays if this shows that

you are not getting at least 20 clock ticks per test.

-------------------------------------------------------------

WARNING -- The above is only a rough guideline.

For best results, please be sure you know the

precision of your system timer.

-------------------------------------------------------------

Function Best Rate MB/s Avg time Min time Max time

Copy: 10334.0 0.209801 0.207807 0.212059

Scale: 10371.1 0.210603 0.207065 0.213303

Add: 12626.6 0.256528 0.255114 0.258246

Triad: 12644.4 0.256321 0.254756 0.258015

-------------------------------------------------------------

Solution Validates: avg error less than 1.000000e-13 on all three arrays

-------------------------------------------------------------

[jrhammon@pcl-skx08 STREAM]$ export KMP_HW_SUBSET=1s,20c,1t

[jrhammon@pcl-skx08 STREAM]$ ./stream_c.exe

-------------------------------------------------------------

STREAM version $Revision: 5.10 $

-------------------------------------------------------------

This system uses 8 bytes per array element.

-------------------------------------------------------------

Array size = 134217728 (elements), Offset = 0 (elements)

Memory per array = 1024.0 MiB (= 1.0 GiB).

Total memory required = 3072.0 MiB (= 3.0 GiB).

Each kernel will be executed 100 times.

The *best* time for each kernel (excluding the first iteration)

will be used to compute the reported bandwidth.

-------------------------------------------------------------

Number of Threads requested = 20

Number of Threads counted = 20

-------------------------------------------------------------

Your clock granularity/precision appears to be 1 microseconds.

Each test below will take on the order of 22465 microseconds.

(= 22465 clock ticks)

Increase the size of the arrays if this shows that

you are not getting at least 20 clock ticks per test.

-------------------------------------------------------------

WARNING -- The above is only a rough guideline.

For best results, please be sure you know the

precision of your system timer.

-------------------------------------------------------------

Function Best Rate MB/s Avg time Min time Max time

Copy: 93833.8 0.022969 0.022886 0.023070

Scale: 95993.9 0.022580 0.022371 0.022963

Add: 102668.8 0.031514 0.031375 0.031748

Triad: 102944.1 0.031458 0.031291 0.031798

-------------------------------------------------------------

Solution Validates: avg error less than 1.000000e-13 on all three arrays

-------------------------------------------------------------

Jeff

--
You received this message because you are subscribed to the Google Groups "blis-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blis-discuss...@googlegroups.com.
To post to this group, send email to blis-d...@googlegroups.com.
Visit this group at https://groups.google.com/group/blis-discuss.
For more options, visit https://groups.google.com/d/optout.

--

Jeff Hammond
jeff.s...@gmail.com
http://jeffhammond.github.io/

Jeff Hammond

unread,

Jan 17, 2019, 5:47:09 PM1/17/19

to Field Van Zee, blis-discuss

Below is a more complete data set for STREAM on the Intel Xeon Platinum 8180 processor, for folks who are interested in saturation behavior. This is not official Intel data or marketing material, just something I measured at some point. You should be able to reproduce with https://github.com/jeffhammond/STREAM/tree/skx.

Every CPU is different, but most CPU servers require some parallelism to drive bandwidth, whether that is because the available memory bandwidth vastly exceeds what one core can drive/consume (e.g. Intel Knights Landing and IBM Blue Gene/Q) or because of NUMA (e.g. multi-socket CPU systems, multi-die CPU packages).

Jeff Hammond

unread,

Jan 17, 2019, 6:25:25 PM1/17/19

to Field Van Zee, blis-discuss

Just for contrast, a high-end 8-core desktop from a few years ago saturates with 4 cores.

Cem Bassoy

unread,

Jan 18, 2019, 2:00:50 AM1/18/19

to Jeff Hammond, Field Van Zee, blis-discuss

Jeff,

thanks for the data. Can you shortly explain why.

More memory controllers? More Data Lanes?

How bout Latency?

Best

CB

Am Fr., 18. Jan. 2019, 00:25 hat Jeff Hammond <jeff.s...@gmail.com> geschrieben:

You received this message because you are subscribed to a topic in the Google Groups "blis-discuss" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/blis-discuss/bU7VUH68YTM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to blis-discuss...@googlegroups.com.

Jeff Hammond

unread,

Jan 21, 2019, 3:35:22 PM1/21/19

to Cem Bassoy, Field Van Zee, blis-discuss

On the left side of these charts, what you see is that one core cannot consume all of the available bandwidth. A DDR3/DDR4 memory channel provides ~20 GB/s (exact numbers depend on the details, which don't matter for this discussion). A core can consume ~10-20 GB/s, depending on the details. That means you need at least 1-2 cores per channel to consume all the bandwidth. For KNL MCDRAM, you need quite a few cores to saturate the nearly 500 GB/s of available bandwidth (on KNL, the real quantity of interest is tiles, not cores, because what matters is CHAs not the core itself, but in most CPUs, these are 1:1).

At least for Intel CPUs, using nontemporal store improves bandwidth, because a normal i.e. write-back store involves both a read and a write to memory, whereas a write-through store doesn't do the read. This means that STREAM triad does 3 reads and 1 write for write-back stores but only 2 reads and 1 write for write-through stores, which means 3/4 of the memory traffic. In KNL and SKX, I see nontemporal stores being worth ~20% increase in bandwidth (higher or lower depending on the details).

How much bandwidth a core can drive depends on the size of certain buffers at the interface between cache and memory. You might be able to understand this better on Intel CPUs by running STREAM and measuring the demand data read uncore counters (Vtune or Linux perf can do this).

In any case, there are a lot of interesting tradeoffs in designing CPUs for various workloads. For example, what makes for the highest single-core performance is often at odds with what makes for the highest multi-core performance. It is often the case that server CPUs optimize for the latter, whereas client CPUs optimize for the former. This tension increases as a function of core count, which is why you might notice that high-end desktop CPUs beat high-end server CPUs for single-threaded workloads. Fortunately, most of the people using server CPUs with 20+ cores know how to run in parallel, even if this is just a bunch of co-located VMs.

Jeff

Reply all

Reply to author

Forward