Performance of level 1 AXPY & DOT with "large" arrays

32 views
Skip to first unread message

Thomas Boucheres

unread,
Feb 6, 2019, 10:19:24 AM2/6/19
to blis-discuss
Hello Blis family;)

First of all, my English is not really good so I hope you will be able to understand my comment... (Apologies).


Let my explain my goal. I'm working in the field of CFD and, in this domain, the cornerstone of the solver performance relies not an direct solvers (for which gemm performance is mandatory) but on iterative ones. Usual iterative linear solvers are more or less directly some workarounds about Krylov methods like gmres. For them, the CPU is driven by 3 things:
* 33%: the efficiency of the sparse matrix vector product --> ok, it is of no concern here.
* 33%: the efficiency of the preconditionner --> again, it is of no concern here
* 33%: the efficiency of some blas level 1 type operations (let me says axpy and dot to fix the ideas) --> ok this is why I'm here today;)

Ok, in view of my goal, I'm currently interested on finding level 1 implementations of axpy/dot which are "more" performant than usual implementation for(...i...){y[i] += a.x[i]}.

To achieve this goal, I have tried to use blis but the results I found are somewhat disappointing. Let me describe with more details now.

I work with Intel processors, ranging from old ones (Intel Xeon CPU E5-2650-v2 -- Ivy Bridge) to more recent nes (Intel Xeon Gold 6132 -- skylake). So, I have installed Blis using the intel64 configure option (without threads -- it is not my objective / I need to work only on sequential mode for this part).
Afterthat, I launch some benchmarks on daxpy/ddot procedures. These benchmarks compare 3 things:
* the blis implementation
* the standard blas implementation of my OS (CentOS 7.5)
* a "by hand" implementation.

My "by hand" implementation is quite rought and reads mostly like this:
double my_dot(x,y)
{
i = r1 = r2 = r3 = r4 = 0;
for (;i<x.size-4;i++)
r1 += x[i]*y[i] ; r2 += x[i+1]*y[i+1]; ... ; r4 += ...;
for (;i<x.size(); i++)
r1 += x[i]*y[i];
return r1+r2+r3+r4;
}

[note: the my_axpy implementation follows the same pattern]

Ok, it is terrible because it seems I cannot join the pictures on which I have report the comparative results of the 3 methods for axpy & dot. I hope it will be possible later because things will be quite clearer if you could see the results directly.

Ok, what happens with the results is the following [the runs have been launched on a Intel Xeon Gold 5122]

For arrays of size <~ 50.000:
* for daxpy, Blis outperforms by a factor 2 the blas OS implementation. And <my_daxpy> is something like 20% less performant than the blas OS.
* for ddot, Blis outperforms by a factor 4 the blas OS implementation. And <my_ddot> is something like 2x better than blas OS (so 2x less performant than Blis).
At this stage, I was quite happy.

The "problem" arises for arrays of size >~ 50.000 [note that such a size is quite common -- and somewhat "small" -- for real CFD problems]. Here:
* the blas OS and "by_hand" implementations are quite stable in terme of performance (MFlops rate versus array size).
* the Blis performance degrades considerably, going to the level of blas OS for daxpy (which is slighly better than <my_daxpy>) and gind to the level of <my_ddot> for ddot.


I'm quite disappointed because in practice it means I have no interest to use Blis implementation (real arrays on CFD have often sizes between 50.000 and 500.000 for each processor)


So here is my ultimate question to the Blis experts: do you think my results are "normals"? Hadn't I do a mistake? If yes, what is the way to have good performance for Blis on my target sizes?


Sincerely.

Field Van Zee

unread,
Feb 6, 2019, 3:24:22 PM2/6/19
to blis-discuss
Thomas,

Thanks for your interest in BLIS, and also for taking the time to post a message outlining your observations and concerns in detail. And don't worry about your English; I was able to understand your message just fine. :)

I've never done any axpy/dot performance tests out to problem sizes as large as 50000. But now that you've reported these issues, I'm curious to investigate it to see if I similarly see a drop off in the performance signature at such large problem sizes.

In the meantime, others may chime in with explanations or theories that could explain why the performance of BLIS's axpyv/dotv drops back to the level of the other two implementations.

I also have a few questions that may help us later on:

1. Are you using a vector increment of 1 for x and y in axpy and dot? If not, what value(s) are you using?

2. It may be interesting to know what BLAS implementation is included with CentOS 7.5. You may be able to query it as a user, but I'm not sure how off hand, in part because I don't use CentOS very often.

3. I don't quite understand how the my_dot() code you included in your original message implements the dot product operation. Can you explain what it does in more detail? At first glance, it appeared to me that you have bugs in the implementation, but it may be that you're not actually implementing dot product, but rather a more complicated operation that can be, optionally, cast in terms of dot product.

Regards,
Field

Devin Matthews

unread,
Feb 6, 2019, 3:39:44 PM2/6/19
to blis-discuss
Hi Thomas,

For very large daxpy or ddot, the operation will be completely limited by bandwidth from main memory. For smaller sizes, the vector may fit in cache and so you get a bigger difference between "simple" implementations and "fancy" implementations that use vectorization etc. Have you compared the daxpy speed you are getting against the STREAM benchmark? That would give a good measure of how close you are to the absolute limit.

Here are two suggestions for how you might improve performance further:

1) Parallelization will help to some degree as one core cannot saturate the memory bandwidth. You can either split the array into chunks yourself and have each thread call daxpy on part of it, or we should at some point have a threaded daxpy.

2) If you have several daxpy or ddot calls that share one of the vector operands, then you can increase performance quite a bit. For example, if you have x += a*y, x += b*z, x += c*w, etc. then this could be formulated as matrix-vector multiplication (similarly for ddot). This can give up to a 2-3x improvement theoretically. Some other instances can also give some speedup by combining operations. For example, in my own code I often do things like:

dscal(n, -1.0, x, 1);
daxpy(n, 2.0, y, 1, x, 1);

So I instead wrote a "daxpby" function:

daxpby(n, 2.0, y, 1, -1.0, x, 1);

Theoretically this is a 40% reduction in memory movement.

Thanks,
Devin Matthews

Jeff Hammond

unread,
Feb 6, 2019, 3:47:31 PM2/6/19
to Devin Matthews, blis-discuss
I agree with Devin.  In a world with compilers that can do even simple auto-vectorization, there is no reason for the level-1 BLAS to exist, because these are all memory bound for larger inputs but can be trivially vectorized for the smaller cases where they may be compute-bound.  You might need to use "-ffast-math -mfma" or equivalent for compilers to emit a SIMD dot product but good compilers should do it at that point.  Or you can use _Pragma("omp simd") or _Pragma("ivdep").

Jeff

--
You received this message because you are subscribed to the Google Groups "blis-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blis-discuss...@googlegroups.com.
To post to this group, send email to blis-d...@googlegroups.com.
Visit this group at https://groups.google.com/group/blis-discuss.
For more options, visit https://groups.google.com/d/optout.


--

Thomas Boucheres

unread,
Feb 8, 2019, 1:16:02 PM2/8/19
to blis-discuss
Hi everybody,

a great (great) thanks for all your answers.
And, most importantly, I am sorry for being so long to answer you (I was quite busy at work in fact;)

Ok, first of all, I (now;) understand your comments that the performances of blas L1 i has found are in fact normal since for large arrays this level of blas is entirely limited by bandwidth from main memory (see the message M. Lee Killough has sent me privately that I join at the end of this post).

On the other hand, to answer your questions, some few comments:
* Jeff: I compile classicall ywith gcc 7.2 -O3 --fast-math -funroll-loops (no mfma, nor avx/sse explicitly mentioned - I need to check defaut values now)
* Devin: your suggestions for improving performance (SMP or gathering operations) are quite good but I cannot use them for my problem (1- there is no possible gathering in the Krylov algortihm I used, 2- for my CFD problems sizes, it "always" better to use all processors in a distributed memory framework rather than in a SMP framework because the former can shortdown many others parts of the overall algorithm).
* Field Van Zee: well my description of "my_dot" was a little wrong (in the indexing) but it doesn't differ very much. Precisely, I do the following:
dbl my_dot(const dbl* x, const double* y, const int n)
{
dbl r1=0,r2=0,r3=0,r4=0;
const int nm = n - 4;
int i = 0;
for (; i<nm; i+=4){ //here is the mistake in my previous post
r1 += x[i]*y[i];
...
r4 += x[i+3]*y[i+3]
}
for (; i<n; i++){
r1 += x[i]*y[i]
}
return r1+r2+r3+r4;
}

What is striking (for me) are the following:
* my gcc doesn't perform such an naive implementation
* my os blas doesn't perform such an naive implementation
* such an naive implementation is twice as good as os blas and, for arrays of size > 50.000, as good as the blis one.
* such an naive implementation used for daxpy deteriors the performance (by around a factor 20%) comparing to the os blas (whatever is the array size).


******************************
Hereafter is the mail of M. Lee Killough (a great thank for the link to the ppt presentation -- it was illuminating for me;)

(I can't cc list, since I don't use gmail address. I only receive forwarded copies of the list.)

This is expected performance for level 1 BLAS, because they are limited by memory bandwidth, and GFLOP/s becomes smaller and smaller as you go from L1, L2, L3 caches, to main memory.

As you exceed the size of the cache(s), all BLAS-1 implementations will asymptote down in performance to the same limit caused by memory speed, except that poor ones might be a constant factor slower.

As you get larger than the caches, no software implementation can be any faster, because all memory bandwidth is saturated. Some software may use prefetching and temporality hints to get a few percentages better performance, but it doesn't change that they are all hitting fundamental memory bandwidth limits.

So BLIS and your OS BLAS both falling to the same performance level is good, because it means BLIS is not falling down further. All of them will fall down asymptotically in performance as you get out of cache.

See the slides starting with "DAXPY Theoretical Performance" (best viewed in PowerPoint, since it has animations):

http://leekillough.com/kernels.pptx


It's a 15-year old presentation, but it still applies today. See the charts with the TX7 performance asymptotes for DAXPY and DDOT and compare them against DGEMV and DGEMM.


The tables with colored cells and mops/flops show how to compute the theoretical peak based on the most constrained path: reads, writes, total reads+writes, or flops.


Each cache level has a different bandwidth and so the reads, writes, total reads+writes or flops for that level of cache, may change which one becomes the constraint and thus determines the effective theoretical peak flops, shown by horizontal lines in the graphs.


(The NEC SX-8 was a fast vector architecture with 64 GB/s main memory to CPU bandwidth, which is why it did not drop off in the graphs. Unfortunately it cost hundreds as times as much as conventional processors, since it had 4096 memory banks all fetching data at the same time. That was a hardware advantage, not software advantage.)

Thomas Boucheres

unread,
Feb 8, 2019, 1:22:26 PM2/8/19
to blis-discuss
Re,

sorry Field Van Zee, I have forgot to answer some of your questions:
* I use increments 1 for x & y
* for my os blas library, on Monday, I will try to get some informations on it (but I suspect that it is a quite common & basic implementation as all native OS one's)

Sincerely...

PS: I'm sorry, I don't know how to "hide" previous messages...

Devin Matthews

unread,
Feb 8, 2019, 1:27:08 PM2/8/19
to blis-discuss
On Friday, February 8, 2019 at 12:16:02 PM UTC-6, Thomas Boucheres wrote:

> * Devin: your suggestions for improving performance (SMP or gathering operations) are quite good but I cannot use them for my problem (1- there is no possible gathering in the Krylov algortihm I used

In the variations of Krylov solvers I am familiar with, there is still some opportunity for merging operations. For example, when updating the subspace matrix (ddot between two full-space vectors), one would dot one vector against a number of other vectors to form one row or column of the subspace matrix. If you merge n of these ddots together then you could reduce the cost from 2*n*N to (n+1)*N for a full vector size of N.

Similarly, the new trial vector may be a (preconditioned) linear combination of previous trial vectors. Combining these daxpys would give similar benefits.

Although, your particular problem may not match my experience :).

Devin

Jed Brown

unread,
Feb 8, 2019, 2:19:14 PM2/8/19
to Devin Matthews, blis-discuss
Yes, though it depends on the Krylov method and some rearrangements
affect stability. VecMAXPY and VecMDot in PETSc perform the "level 2
BLAS" operations without requiring that the columns reside in
regularly-spaced memory. (These would be useful variants for BLIS to
support, and slightly generalizes stride, which implies that the memory
was allocated in advance -- with Krylov methods you often don't know in
advance how large the subspace may become.)

There are also Krylov methods specifically intended to rearrange
operations to combine or pipeline reductions. Examples include IBCGS,
single-reduction CG, and the many pipelined methods that have emerged in
recent years -- you can find implementations of many such methods in
PETSc.

Jed Brown

unread,
Feb 8, 2019, 2:31:08 PM2/8/19
to Thomas Boucheres, blis-discuss
Thomas Boucheres <thomas.b...@gmail.com> writes:

> * Field Van Zee: well my description of "my_dot" was a little wrong (in the indexing) but it doesn't differ very much. Precisely, I do the following:
> dbl my_dot(const dbl* x, const double* y, const int n)
> {
> dbl r1=0,r2=0,r3=0,r4=0;
> const int nm = n - 4;
> int i = 0;
> for (; i<nm; i+=4){ //here is the mistake in my previous post
> r1 += x[i]*y[i];
> ...
> r4 += x[i+3]*y[i+3]
> }
> for (; i<n; i++){
> r1 += x[i]*y[i]
> }
> return r1+r2+r3+r4;
> }
>
> What is striking (for me) are the following:
> * my gcc doesn't perform such an naive implementation

It can. This naive code:

static void dot_for(int n,const double a[],const double b[],double *z)
{
double sum = 0.0;
for (int i=0; i<n; i++) sum += a[i]*b[i];
*z = sum;
}

$ gcc -c -O3 -march=native -ffast-math -fno-inline dot.c

yields this inner loop:

0000000000000028 <dot_for+0x28> vmovupd ymm2,YMMWORD PTR [rsi+rax*1]
000000000000002d <dot_for+0x2d> vfmadd231pd ymm0,ymm2,YMMWORD PTR [rdx+rax*1]
0000000000000033 <dot_for+0x33> add rax,0x20
0000000000000037 <dot_for+0x37> cmp rax,r8
000000000000003a <dot_for+0x3a> jne 0000000000000028 <dot_for+0x28>

clang-7 unrolls further (same compiler options)

0000000000000f10 <dot_for+0x70> vmovupd ymm4,YMMWORD PTR [rdx+rax*8]
0000000000000f15 <dot_for+0x75> vmovupd ymm5,YMMWORD PTR [rdx+rax*8+0x20]
0000000000000f1b <dot_for+0x7b> vmovupd ymm6,YMMWORD PTR [rdx+rax*8+0x40]
0000000000000f21 <dot_for+0x81> vmovupd ymm7,YMMWORD PTR [rdx+rax*8+0x60]
0000000000000f27 <dot_for+0x87> vfmadd132pd ymm4,ymm0,YMMWORD PTR [rsi+rax*8]
0000000000000f2d <dot_for+0x8d> vfmadd132pd ymm5,ymm1,YMMWORD PTR [rsi+rax*8+0x20]
0000000000000f34 <dot_for+0x94> vfmadd132pd ymm6,ymm2,YMMWORD PTR [rsi+rax*8+0x40]
0000000000000f3b <dot_for+0x9b> vfmadd132pd ymm7,ymm3,YMMWORD PTR [rsi+rax*8+0x60]
0000000000000f42 <dot_for+0xa2> vmovupd ymm0,YMMWORD PTR [rdx+rax*8+0x80]
0000000000000f4b <dot_for+0xab> vmovupd ymm1,YMMWORD PTR [rdx+rax*8+0xa0]
0000000000000f54 <dot_for+0xb4> vmovupd ymm2,YMMWORD PTR [rdx+rax*8+0xc0]
0000000000000f5d <dot_for+0xbd> vmovupd ymm3,YMMWORD PTR [rdx+rax*8+0xe0]
0000000000000f66 <dot_for+0xc6> vfmadd132pd ymm0,ymm4,YMMWORD PTR [rsi+rax*8+0x80]
0000000000000f70 <dot_for+0xd0> vfmadd132pd ymm1,ymm5,YMMWORD PTR [rsi+rax*8+0xa0]
0000000000000f7a <dot_for+0xda> vfmadd132pd ymm2,ymm6,YMMWORD PTR [rsi+rax*8+0xc0]
0000000000000f84 <dot_for+0xe4> vfmadd132pd ymm3,ymm7,YMMWORD PTR [rsi+rax*8+0xe0]
0000000000000f8e <dot_for+0xee> add rax,0x20
0000000000000f92 <dot_for+0xf2> add r10,0x2
0000000000000f96 <dot_for+0xf6> jne 0000000000000f10 <dot_for+0x70>
Reply all
Reply to author
Forward
0 new messages