POWER8 VSX performance analysis

161 views
Skip to first unread message

David Edelsohn

unread,
Mar 29, 2016, 8:00:37 PM3/29/16
to OpenBLAS-dev, Anton Blanchard
[Moving the discussion about recent POWER8 VSX performance here.]

Anton compared ATLAS, ESSL (the IBM BLAS) and OpenBLAS using 16B aligned
arguments. Testing the functions you have implemented, we see it beat
ATLAS at many tests and mostly match ESSL. Nice work!


He does see 2-3x slower performance on sgemm, strmm, cgemm and ctrmm,
possibly because of the use of lxsspx/stxssp.


Also, it helps to allocate the static buffers with 16B alignment.

Werner Saar

unread,
Mar 30, 2016, 1:24:00 AM3/30/16
to OpenBLAS-dev, an...@samba.org
Hi,

the performamce of the sgemm kernel is about 40 Gflops.
At 3.6 GHZ, Rpeak is about 16 * 3.6 = 57.6 Gflops.
So I reach 69% of Rpeak. The POWER6 sgemm_kernel only reaches 14 GFlops.
lxsspx and stxspx instructions are only used in the save macros,
the other macros use lxv4wx.
In OpenBLAS, we always use the original matrix C and if LDC is odd, we will have
only 4 byte aligment. Without use of lxsspx and stxsspx, there were a few precision
problems with lapack.

Any help is appreciated.

Best regards
Werner

Anton Blanchard

unread,
Mar 30, 2016, 7:13:18 AM3/30/16
to OpenBLAS-dev, an...@samba.org
Hi,

Here is what I see running benchmark/sgemm for array sizes up to
1024x1024. It shows OpenBLAS is a fair bit slower than ESSL and
ATLAS.

Anton
sgemm_performance.jpg

David Edelsohn

unread,
Mar 30, 2016, 8:38:25 AM3/30/16
to OpenBLAS-dev, an...@samba.org
[Repeating another, earlier note from Anton.]

We have found a number of functions are sensitive to array alignment. As
an example:

# cat sswap.c

#include <cblas.h>

#define NR 1024
#define ITERATIONS 10000000
 
#ifdef FAST
float x[NR] __attribute__ ((aligned (16)));;
float y[NR] __attribute__ ((aligned (16)));;
#else
float x[NR];
float y[NR];
#endif

int main(void)
{
unsigned long i;

for (i = 0; i < NR; i++)
x[i] = y[i] = i * 1.0010243;

for (i = 0; i < ITERATIONS; i++)
cblas_sswap(NR, x, 1, y, 1);

return 0;
}

# gcc -O3 -mcpu=power8 -o sswap sswap.c -lopenblas
# time ./sswap

2.889s

# gcc -DFAST -O3 -mcpu=power8 -o sswap sswap.c -lopenblas
# time ./sswap

1.362s

This is likely only an issue for statically defined arrays, since
malloc should return 16B aligned memory.

Zhang Xianyi

unread,
Mar 30, 2016, 11:08:09 AM3/30/16
to Werner Saar, OpenBLAS-dev, an...@samba.org
Hi Werner,

I think we need to implement two branches for sgemm as following.

if(C is 16B aligned && LDC is even) {
   // fast branch
}else{
   // slow branch. current OpenBLAS implementation.
}



Xianyi

--
You received this message because you are subscribed to the Google Groups "OpenBLAS-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openblas-dev...@googlegroups.com.
To post to this group, send email to openbl...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Werner Saar

unread,
Mar 31, 2016, 7:31:16 AM3/31/16
to OpenBLAS-dev, wern...@googlemail.com, an...@samba.org
Hi,

I'am working on the problem with sgemm.
But I think, that I need at least 3 days for a good solution.

Best regards
Werner

Werner Saar

unread,
Apr 3, 2016, 12:41:49 AM4/3/16
to OpenBLAS-dev, an...@samba.org
Hi,

the sgemm- and strmm-kernels are now better optimized.
The performance is now 50.5 GFLOPS (ESSL: 49 GFLOPS)


Best regards
Werner


Am Mittwoch, 30. März 2016 02:00:37 UTC+2 schrieb David Edelsohn:

Werner Saar

unread,
Apr 4, 2016, 6:51:50 AM4/4/16
to OpenBLAS-dev, an...@samba.org
Hi,

I also updated the cgemm- and ctrmm-kernel.

Performance data for cgemm:

47.5 GFlops with 1 thread   ( ESSL: 49.0 GFLOPS )
950 Gflops with 20 threads


Best regards
Werner
 

Am Mittwoch, 30. März 2016 02:00:37 UTC+2 schrieb David Edelsohn:

Guha Prasad Venkataraman

unread,
Apr 20, 2016, 9:44:03 AM4/20/16
to OpenBLAS-dev, an...@samba.org
Hi,

My name is Prasad and I work for IBM in the Power systems development team.

I did the Single instance SGEMM performance compare of OpenBLAS v0.2.18 on POWER8 vs OpenBLAS v0.2.18 on x86 and also with x86 MKL. There are few instances where POWER8's OpenBLAS performs better than x86's OpenBLAS, however there are other cases where POWER8 OpenBLAS is considerably slower.




The code mallocs 3 arrays, A,(m,k), B(K,n) and C(m,n) and initialize arrays them to random values. Post that the cblas_sgemm is called in a loop for n iterations.

  for (i=0; i<iterations; i++)
  {
  cblas_sgemm (CblasRowMajor, CblasNoTrans, CblasNoTrans, m,n,k,
               alpha, A, lda, B, ldb, beta, C, n);
  }

The measurement was done for the above for loop alone.

Details of each Run :

Runs

M

N

K

lda

ldb

alpha

beta

1

1000

1

1

1

1

-1

1

2

128

169

1728

1728

169

1

0

3

128

729

1200

1200

729

1

0

4

192

169

1728

1728

169

1

0

5

256

169

1

1

169

1

1

6

256

729

1

1

729

1

1

7

384

169

1

1

169

1

1

8

384

169

2304

2304

169

1

0

9

50

1000

1

1

1000

1

1

10

50

1000

4096

4096

4096

1

0

11

50

4096

1

1

4096

1

1

12

50

4096

4096

4096

4096

1

0

13

50

4096

9216

9216

9216

1

0

14

96

3025

1

1

3025

1

1

15

96

3025

363

363

3025

1

0


Let me know if you have any questions.

Thanks
Prasad.
Auto Generated Inline Image 1

Guha Prasad Venkataraman

unread,
Apr 20, 2016, 9:47:48 AM4/20/16
to OpenBLAS-dev, an...@samba.org
Reposting the Run details :
<p style="language:en-US;margin-top:0pt;margin-bottom:0pt;margin-left:0in; t

Zhang Xianyi

unread,
Apr 20, 2016, 8:19:41 PM4/20/16
to Guha Prasad Venkataraman, OpenBLAS-dev, Anton Blanchard
Hi Prasad,

The input sizes are very interesting.  

What's x86 CPU (the frequency)? How many cores?

Xianyi

--

Guhaa Prasad Venkataraman

unread,
Apr 21, 2016, 10:45:16 AM4/21/16
to Zhang Xianyi, OpenBLAS-dev, Anton Blanchard
Hi Zhang,

It is a 24 core Haswell box running at 2.6Ghz. The exact model is  Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz.

the input sizes are from AlexNet on Caffe.

Let me know if you have any other questions.

Thanks
Prasad.

David Edelsohn

unread,
Apr 21, 2016, 10:53:11 AM4/21/16
to Guhaa Prasad Venkataraman, Zhang Xianyi, OpenBLAS-dev, Anton Blanchard
Prasad,

Which OMP configuration for OpenBLAS are you using on the Intel system?

Thanks, David
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "OpenBLAS-dev" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/openblas-dev/QqjCsEFuPHo/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to

Werner Saar

unread,
Apr 22, 2016, 12:28:08 AM4/22/16
to openbl...@googlegroups.com
Hi,

A Haswell core can execute 16 double precision or 32 single precision floating point operations per cycle.
A POWER8 core with VSX can only execute 8 double precision or 16 single precision floating point operations per cycle.

Best regards
Werner

Anton Blanchard

unread,
Apr 25, 2016, 8:58:48 PM4/25/16
to OpenBLAS-dev, an...@samba.org
Hi,

Some of your arguments in your table are wrong (issues with ldb).

Providing a simple test case is a much better way to report an issue. I've
attached one.

Anton
dgemm_test.c

Werner Saar

unread,
Apr 26, 2016, 12:39:58 AM4/26/16
to openbl...@googlegroups.com
Hi,

I just ran test the test.
I don't see any error.

Best regards
Werner

Guhaa Prasad Venkataraman

unread,
Apr 26, 2016, 1:35:32 AM4/26/16
to Werner Saar, OpenBLAS-dev
Hi,

I don't see any error either. Can you let me know which instances are causing the issue ?

BTW, Werner sorry for the delay in getting back to you. I know and understand. Per the guidance of Daivd, will revert back with the P8's OpenBLAS vs ESSL compare, by the end of my day. There are few cases, where OpenBLAS is behind ESSL.

Thanks
Prasad.

--
You received this message because you are subscribed to a topic in the Google Groups "OpenBLAS-dev" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/openblas-dev/QqjCsEFuPHo/unsubscribe.
To unsubscribe from this group and all its topics, send an email to openblas-dev...@googlegroups.com.

To post to this group, send email to openbl...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
தான் தானாகவே தான் இருக்கிறது - பகவான் ஸ்ரீ ரமண மஹரிஷி

Guhaa Prasad Venkataraman

unread,
Apr 26, 2016, 10:38:47 AM4/26/16
to Werner Saar, OpenBLAS-dev
Hi,

We got the P8 - OpenBLAS v0.2.19 and ESSL (5.3.2) compare - Single thread runs. As you can see, most of the cases, OpenBLAS is almost on par with ESSL. But there are few instances it is not, especially when the both N and K are 1. Run11 also has N & K=1, but still it is only 4% off from ESSL, whereas in the other Runs, it is off by more than 50%.

Inline image 1

As Anton suggested, I am creating a testcase - so it will be easier to execute. Also expanding the combination of matrix sizes as well. will revert back soon.

Thanks
Prasad.


Reply all
Reply to author
Forward
0 new messages