First of all, my English is not really good so I hope you will be able to understand my comment... (Apologies).
Let my explain my goal. I'm working in the field of CFD and, in this domain, the cornerstone of the solver performance relies not an direct solvers (for which gemm performance is mandatory) but on iterative ones. Usual iterative linear solvers are more or less directly some workarounds about Krylov methods like gmres. For them, the CPU is driven by 3 things:
* 33%: the efficiency of the sparse matrix vector product --> ok, it is of no concern here.
* 33%: the efficiency of the preconditionner --> again, it is of no concern here
* 33%: the efficiency of some blas level 1 type operations (let me says axpy and dot to fix the ideas) --> ok this is why I'm here today;)
Ok, in view of my goal, I'm currently interested on finding level 1 implementations of axpy/dot which are "more" performant than usual implementation for(...i...){y[i] += a.x[i]}.
To achieve this goal, I have tried to use blis but the results I found are somewhat disappointing. Let me describe with more details now.
I work with Intel processors, ranging from old ones (Intel Xeon CPU E5-2650-v2 -- Ivy Bridge) to more recent nes (Intel Xeon Gold 6132 -- skylake). So, I have installed Blis using the intel64 configure option (without threads -- it is not my objective / I need to work only on sequential mode for this part).
Afterthat, I launch some benchmarks on daxpy/ddot procedures. These benchmarks compare 3 things:
* the blis implementation
* the standard blas implementation of my OS (CentOS 7.5)
* a "by hand" implementation.
My "by hand" implementation is quite rought and reads mostly like this:
double my_dot(x,y)
{
i = r1 = r2 = r3 = r4 = 0;
for (;i<x.size-4;i++)
r1 += x[i]*y[i] ; r2 += x[i+1]*y[i+1]; ... ; r4 += ...;
for (;i<x.size(); i++)
r1 += x[i]*y[i];
return r1+r2+r3+r4;
}
[note: the my_axpy implementation follows the same pattern]
Ok, it is terrible because it seems I cannot join the pictures on which I have report the comparative results of the 3 methods for axpy & dot. I hope it will be possible later because things will be quite clearer if you could see the results directly.
Ok, what happens with the results is the following [the runs have been launched on a Intel Xeon Gold 5122]
For arrays of size <~ 50.000:
* for daxpy, Blis outperforms by a factor 2 the blas OS implementation. And <my_daxpy> is something like 20% less performant than the blas OS.
* for ddot, Blis outperforms by a factor 4 the blas OS implementation. And <my_ddot> is something like 2x better than blas OS (so 2x less performant than Blis).
At this stage, I was quite happy.
The "problem" arises for arrays of size >~ 50.000 [note that such a size is quite common -- and somewhat "small" -- for real CFD problems]. Here:
* the blas OS and "by_hand" implementations are quite stable in terme of performance (MFlops rate versus array size).
* the Blis performance degrades considerably, going to the level of blas OS for daxpy (which is slighly better than <my_daxpy>) and gind to the level of <my_ddot> for ddot.
I'm quite disappointed because in practice it means I have no interest to use Blis implementation (real arrays on CFD have often sizes between 50.000 and 500.000 for each processor)
So here is my ultimate question to the Blis experts: do you think my results are "normals"? Hadn't I do a mistake? If yes, what is the way to have good performance for Blis on my target sizes?
Sincerely.
Thanks for your interest in BLIS, and also for taking the time to post a message outlining your observations and concerns in detail. And don't worry about your English; I was able to understand your message just fine. :)
I've never done any axpy/dot performance tests out to problem sizes as large as 50000. But now that you've reported these issues, I'm curious to investigate it to see if I similarly see a drop off in the performance signature at such large problem sizes.
In the meantime, others may chime in with explanations or theories that could explain why the performance of BLIS's axpyv/dotv drops back to the level of the other two implementations.
I also have a few questions that may help us later on:
1. Are you using a vector increment of 1 for x and y in axpy and dot? If not, what value(s) are you using?
2. It may be interesting to know what BLAS implementation is included with CentOS 7.5. You may be able to query it as a user, but I'm not sure how off hand, in part because I don't use CentOS very often.
3. I don't quite understand how the my_dot() code you included in your original message implements the dot product operation. Can you explain what it does in more detail? At first glance, it appeared to me that you have bugs in the implementation, but it may be that you're not actually implementing dot product, but rather a more complicated operation that can be, optionally, cast in terms of dot product.
Regards,
Field
For very large daxpy or ddot, the operation will be completely limited by bandwidth from main memory. For smaller sizes, the vector may fit in cache and so you get a bigger difference between "simple" implementations and "fancy" implementations that use vectorization etc. Have you compared the daxpy speed you are getting against the STREAM benchmark? That would give a good measure of how close you are to the absolute limit.
Here are two suggestions for how you might improve performance further:
1) Parallelization will help to some degree as one core cannot saturate the memory bandwidth. You can either split the array into chunks yourself and have each thread call daxpy on part of it, or we should at some point have a threaded daxpy.
2) If you have several daxpy or ddot calls that share one of the vector operands, then you can increase performance quite a bit. For example, if you have x += a*y, x += b*z, x += c*w, etc. then this could be formulated as matrix-vector multiplication (similarly for ddot). This can give up to a 2-3x improvement theoretically. Some other instances can also give some speedup by combining operations. For example, in my own code I often do things like:
dscal(n, -1.0, x, 1);
daxpy(n, 2.0, y, 1, x, 1);
So I instead wrote a "daxpby" function:
daxpby(n, 2.0, y, 1, -1.0, x, 1);
Theoretically this is a 40% reduction in memory movement.
Thanks,
Devin Matthews
--
You received this message because you are subscribed to the Google Groups "blis-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blis-discuss...@googlegroups.com.
To post to this group, send email to blis-d...@googlegroups.com.
Visit this group at https://groups.google.com/group/blis-discuss.
For more options, visit https://groups.google.com/d/optout.
> * Devin: your suggestions for improving performance (SMP or gathering operations) are quite good but I cannot use them for my problem (1- there is no possible gathering in the Krylov algortihm I used
In the variations of Krylov solvers I am familiar with, there is still some opportunity for merging operations. For example, when updating the subspace matrix (ddot between two full-space vectors), one would dot one vector against a number of other vectors to form one row or column of the subspace matrix. If you merge n of these ddots together then you could reduce the cost from 2*n*N to (n+1)*N for a full vector size of N.
Similarly, the new trial vector may be a (preconditioned) linear combination of previous trial vectors. Combining these daxpys would give similar benefits.
Although, your particular problem may not match my experience :).
Devin