POWER8 VSX performance analysis

David Edelsohn

unread,

Mar 29, 2016, 8:00:37 PM3/29/16

to OpenBLAS-dev, Anton Blanchard

[Moving the discussion about recent POWER8 VSX performance here.]

Anton compared ATLAS, ESSL (the IBM BLAS) and OpenBLAS using 16B aligned
arguments. Testing the functions you have implemented, we see it beat
ATLAS at many tests and mostly match ESSL. Nice work!

He does see 2-3x slower performance on sgemm, strmm, cgemm and ctrmm,
possibly because of the use of lxsspx/stxssp.

Also, it helps to allocate the static buffers with 16B alignment.

Werner Saar

unread,

Mar 30, 2016, 1:24:00 AM3/30/16

to OpenBLAS-dev, an...@samba.org

Hi,

the performamce of the sgemm kernel is about 40 Gflops.
At 3.6 GHZ, Rpeak is about 16 * 3.6 = 57.6 Gflops.
So I reach 69% of Rpeak. The POWER6 sgemm_kernel only reaches 14 GFlops.
lxsspx and stxspx instructions are only used in the save macros,
the other macros use lxv4wx.
In OpenBLAS, we always use the original matrix C and if LDC is odd, we will have
only 4 byte aligment. Without use of lxsspx and stxsspx, there were a few precision
problems with lapack.

Any help is appreciated.

Best regards
Werner

Anton Blanchard

unread,

Mar 30, 2016, 7:13:18 AM3/30/16

to OpenBLAS-dev, an...@samba.org

Hi,

Here is what I see running benchmark/sgemm for array sizes up to

1024x1024. It shows OpenBLAS is a fair bit slower than ESSL and

ATLAS.

Anton

sgemm_performance.jpg

David Edelsohn

unread,

Mar 30, 2016, 8:38:25 AM3/30/16

to OpenBLAS-dev, an...@samba.org

[Repeating another, earlier note from Anton.]

We have found a number of functions are sensitive to array alignment. As

an example:

# cat sswap.c

#include <cblas.h>

#define NR 1024

#define ITERATIONS 10000000

#ifdef FAST

float x[NR] __attribute__ ((aligned (16)));;

float y[NR] __attribute__ ((aligned (16)));;

#else

float x[NR];

float y[NR];

#endif

int main(void)

{

unsigned long i;

for (i = 0; i < NR; i++)

x[i] = y[i] = i * 1.0010243;

for (i = 0; i < ITERATIONS; i++)

cblas_sswap(NR, x, 1, y, 1);

return 0;

}

# gcc -O3 -mcpu=power8 -o sswap sswap.c -lopenblas

# time ./sswap

2.889s

# gcc -DFAST -O3 -mcpu=power8 -o sswap sswap.c -lopenblas

# time ./sswap

1.362s

This is likely only an issue for statically defined arrays, since

malloc should return 16B aligned memory.

Zhang Xianyi

unread,

Mar 30, 2016, 11:08:09 AM3/30/16

to Werner Saar, OpenBLAS-dev, an...@samba.org

Hi Werner,

I think we need to implement two branches for sgemm as following.

if(C is 16B aligned && LDC is even) {

// fast branch

}else{

// slow branch. current OpenBLAS implementation.

}

Xianyi

--
You received this message because you are subscribed to the Google Groups "OpenBLAS-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openblas-dev...@googlegroups.com.
To post to this group, send email to openbl...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Werner Saar

unread,

Mar 31, 2016, 7:31:16 AM3/31/16

to OpenBLAS-dev, wern...@googlemail.com, an...@samba.org

Hi,

I'am working on the problem with sgemm.
But I think, that I need at least 3 days for a good solution.

Best regards
Werner

Werner Saar

unread,

Apr 3, 2016, 12:41:49 AM4/3/16

to OpenBLAS-dev, an...@samba.org

Hi,

the sgemm- and strmm-kernels are now better optimized.
The performance is now 50.5 GFLOPS (ESSL: 49 GFLOPS)

Best regards
Werner

Am Mittwoch, 30. März 2016 02:00:37 UTC+2 schrieb David Edelsohn:

Werner Saar

unread,

Apr 4, 2016, 6:51:50 AM4/4/16

to OpenBLAS-dev, an...@samba.org

Hi,

I also updated the cgemm- and ctrmm-kernel.

Performance data for cgemm:

47.5 GFlops with 1 thread ( ESSL: 49.0 GFLOPS )
950 Gflops with 20 threads

Best regards
Werner

Am Mittwoch, 30. März 2016 02:00:37 UTC+2 schrieb David Edelsohn:

Guha Prasad Venkataraman

unread,

Apr 20, 2016, 9:44:03 AM4/20/16

to OpenBLAS-dev, an...@samba.org

Hi,

My name is Prasad and I work for IBM in the Power systems development team.

I did the Single instance SGEMM performance compare of OpenBLAS v0.2.18 on POWER8 vs OpenBLAS v0.2.18 on x86 and also with x86 MKL. There are few instances where POWER8's OpenBLAS performs better than x86's OpenBLAS, however there are other cases where POWER8 OpenBLAS is considerably slower.

The code mallocs 3 arrays, A,(m,k), B(K,n) and C(m,n) and initialize arrays them to random values. Post that the cblas_sgemm is called in a loop for n iterations.

for (i=0; i<iterations; i++)
{
cblas_sgemm (CblasRowMajor, CblasNoTrans, CblasNoTrans, m,n,k,
alpha, A, lda, B, ldb, beta, C, n);
}

The measurement was done for the above for loop alone.

Details of each Run :

Runs	M	N	K	lda	ldb	alpha	beta
1	1000	1	1	1	1	-1	1
2	128	169	1728	1728	169	1	0
3	128	729	1200	1200	729	1	0
4	192	169	1728	1728	169	1	0
5	256	169	1	1	169	1	1
6	256	729	1	1	729	1	1
7	384	169	1	1	169	1	1
8	384	169	2304	2304	169	1	0
9	50	1000	1	1	1000	1	1
10	50	1000	4096	4096	4096	1	0
11	50	4096	1	1	4096	1	1
12	50	4096	4096	4096	4096	1	0
13	50	4096	9216	9216	9216	1	0
14	96	3025	1	1	3025	1	1
15	96	3025	363	363	3025	1	0

Let me know if you have any questions.

Thanks
Prasad.

Auto Generated Inline Image 1

Guha Prasad Venkataraman

unread,

Apr 20, 2016, 9:47:48 AM4/20/16

to OpenBLAS-dev, an...@samba.org

Reposting the Run details :

<p style="language:en-US;margin-top:0pt;margin-bottom:0pt;margin-left:0in; t

Zhang Xianyi

unread,

Apr 20, 2016, 8:19:41 PM4/20/16

to Guha Prasad Venkataraman, OpenBLAS-dev, Anton Blanchard

Hi Prasad,

The input sizes are very interesting.

What's x86 CPU (the frequency)? How many cores?

Xianyi

--

Guhaa Prasad Venkataraman

unread,

Apr 21, 2016, 10:45:16 AM4/21/16

to Zhang Xianyi, OpenBLAS-dev, Anton Blanchard

Hi Zhang,

It is a 24 core Haswell box running at 2.6Ghz. The exact model is Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz.

the input sizes are from AlexNet on Caffe.

Let me know if you have any other questions.

Thanks

Prasad.

David Edelsohn

unread,

Apr 21, 2016, 10:53:11 AM4/21/16

to Guhaa Prasad Venkataraman, Zhang Xianyi, OpenBLAS-dev, Anton Blanchard

Prasad,

Which OMP configuration for OpenBLAS are you using on the Intel system?

Thanks, David

> --
> You received this message because you are subscribed to a topic in the
> Google Groups "OpenBLAS-dev" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/openblas-dev/QqjCsEFuPHo/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to

Werner Saar

unread,

Apr 22, 2016, 12:28:08 AM4/22/16

to openbl...@googlegroups.com

Hi,

A Haswell core can execute 16 double precision or 32 single precision floating point operations per cycle.
A POWER8 core with VSX can only execute 8 double precision or 16 single precision floating point operations per cycle.

Best regards
Werner

Anton Blanchard

unread,

Apr 25, 2016, 8:58:48 PM4/25/16

to OpenBLAS-dev, an...@samba.org

Hi,

Some of your arguments in your table are wrong (issues with ldb).

Providing a simple test case is a much better way to report an issue. I've

attached one.

Anton

dgemm_test.c

Werner Saar

unread,

Apr 26, 2016, 12:39:58 AM4/26/16

to openbl...@googlegroups.com

Hi,

I just ran test the test.
I don't see any error.

Best regards
Werner

Guhaa Prasad Venkataraman

unread,

Apr 26, 2016, 1:35:32 AM4/26/16

to Werner Saar, OpenBLAS-dev

Hi,

I don't see any error either. Can you let me know which instances are causing the issue ?

BTW, Werner sorry for the delay in getting back to you. I know and understand. Per the guidance of Daivd, will revert back with the P8's OpenBLAS vs ESSL compare, by the end of my day. There are few cases, where OpenBLAS is behind ESSL.

Thanks

Prasad.

--
You received this message because you are subscribed to a topic in the Google Groups "OpenBLAS-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/openblas-dev/QqjCsEFuPHo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to openblas-dev...@googlegroups.com.

To post to this group, send email to openbl...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

தான் தானாகவே தான் இருக்கிறது - பகவான் ஸ்ரீ ரமண மஹரிஷி

Guhaa Prasad Venkataraman

unread,

Apr 26, 2016, 10:38:47 AM4/26/16

to Werner Saar, OpenBLAS-dev

Hi,

We got the P8 - OpenBLAS v0.2.19 and ESSL (5.3.2) compare - Single thread runs. As you can see, most of the cases, OpenBLAS is almost on par with ESSL. But there are few instances it is not, especially when the both N and K are 1. Run11 also has N & K=1, but still it is only 4% off from ESSL, whereas in the other Runs, it is off by more than 50%.

Inline image 1

As Anton suggested, I am creating a testcase - so it will be easier to execute. Also expanding the combination of matrix sizes as well. will revert back soon.