Jacobi 2D Poisson Solver results

119 views
Skip to first unread message

dpephd

unread,
Feb 4, 2012, 8:27:43 PM2/4/12
to Intel SPMD Program Compiler Users
Thanks to Matt's work in incorporating LLVM 3.1 into the latest
release of ISPC (v.1.1.3), the double precision performance regression
noted in issue #119 has now been resolved.

So, I'm happy to present some single and double precision performance
results from my Jacobi 2D Poisson Solver that I have been working on
since August of last year.

<code>
poisson2d_float results:

2D Jacobi Poisson Solver over square, regular grid
2D Five Pt Numerical Discretization Utilizing ISPC and gcc (single
precision)
nx = 32, stop_tol = 8.43227e-07, max iterations = 4096

sse4: 1.93x speedup from ISPC
sse4-x2: 2.57x speedup from ISPC
avx: 2.47x speedup from ISPC
avx-x2: 2.80x speedup from ISPC

poisson2d_double results:

2D Jacobi Poisson Solver over square, regular grid
2D Five Pt Numerical Discretization Utilizing ISPC and gcc (double
precision)
nx = 80, stop_tol = 2.32306e-08, max iterations = 25600

sse4: 1.65x speedup from ISPC
sse4x2: 1.94x speedup from ISPC
avx: 0.39x speedup from ISPC
avx-x2: 0.63x speedup from ISPC

Intel(r) SPMD Program Compiler (ispc), build 20120120 (commit
1bba9d43074fd243, LLVM 3.1)
gcc -v: 4.6.2 20111027 (Red Hat 4.6.2-1) (GCC)
uname -a: Linux jemez 3.2.2-1.fc16.x86_64 #1 SMP Thu Jan 26 03:21:58
UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

gcc flags: -mtune=corei7-avx -O3 -ftree-vectorizer-verbose=2 -floop-
interchange -floop-strip-mine -floop-block -Wall -m64
ispc flags: -O2 --arch=x86-64 --cpu=corei7-avx
</code>

Some observations that can be made is that in general ISPC can offer
better than 2x speedups for singe precision code and almost 2x
speedups for double precision codes.

For double precision calculations, it is recommended to use the sse4x2
target since the avx targets are showing slowdowns as compared to
gcc. Why this is being observed could possibly be better addressed by
Matt, but given the better results posted by avx-x2 vs. avx targets
(and sse4 targets as well), using a larger vector width for double
precision operations appears to be advantageous for achieving larger
speedups (i.e. lower runtimes). Possibly an avx-x4 target would
result in an observed speedup.

Note also that -O3 is used when compiling with gcc since vectorization
is performed at this optimization level and not -O2, in order to make
the comparison of gcc to ISPC results fairer. gcc is able to
vectorize the main Jacobi solver loop. No thread-level
parallelization results are reported for either gcc or ISPC.

Both solvers converged to a solution within the prescribed stopping
tolerance before hitting the maximum iteration count.

I welcome any comments folks may have concerning these results.

Doug

dpephd

unread,
Feb 4, 2012, 9:14:26 PM2/4/12
to Intel SPMD Program Compiler Users
Some additional hardware information since it is important ...

CPU: Intel(R) Core(TM) i7-2630QM CPU @ 2.00GHz,
flags fpu, sse, sse2, ssse3, sse4_1, sse4_2, avx listed in /proc/
cpuinfo
Memory: 8GB RAM

all operations performed in-core (i.e. no disk I/O was accounted for
in the results reported). The matrix sizes comfortably reside in
memory (32x32 & 80x80).

dpephd

unread,
Feb 5, 2012, 10:48:45 AM2/5/12
to Intel SPMD Program Compiler Users
To put a 2x speedup in perspective for this type calculation, consider
that 2x-5x speedups are considered a big deal by the CAE community ...
see e.g. "Accelerating the ANSYS Direct Sparse Solver with
GPUs" (Krawezik and Poole), http://saahpc.ncsa.illinois.edu/09/papers/Krawezik_paper.pdf
.

This simple iterative solver is extremely STREAM-like, so the ability
of the ISPC system to offer any speedups over natively compiled gcc is
I think an accomplishment. Note that the direct sparse solver
discussed in the previously mentioned paper relies heavily on GEMM
(matrix-multiplication) operations which both AMD and Intel offer
optimized implementations of as part of their ACML/MKL math
libraries. Neither of these libraries offer sparse iterative solver
support despite their compactness in storage (memory usage) and their
ability to offer in general faster time to solution (especially with
preconditioning) as opposed to direct solver approaches.

Doug

On Feb 4, 5:27 pm, dpephd <dpep...@gmail.com> wrote:

>
> Some observations that can be made is that in general ISPC can offer
> better than 2x speedups for singe precision code and almost 2x
> speedups for double precision codes.
>

>

Matt Pharr

unread,
Feb 6, 2012, 11:34:08 PM2/6/12
to ispc-...@googlegroups.com

On Feb 4, 2012, at 5:27 PM, dpephd wrote:
[…]

Some observations that can be made is that in general ISPC can offer
better than 2x speedups for singe precision code and almost 2x
speedups for double precision codes.

Fantastic--that's a nice result, particularly given your other comment about the state of the art with other compilation approaches for these computations.


For double precision calculations, it is recommended to use the sse4x2
target since the avx targets are showing slowdowns as compared to
gcc.  Why this is being observed could possibly be better addressed by
Matt, but given the better results posted by avx-x2 vs. avx targets
(and sse4 targets as well), using a larger vector width for double
precision operations appears to be advantageous for achieving  larger
speedups (i.e. lower runtimes).  Possibly an avx-x4 target would
result in an observed speedup.
[…]

I'm surprised and interested that an SSE4 target was better than one of the AVX ones in this case (and that the AVX ones had such a slowdown).  If you're willing to share your code at some point point (or a representative kernel), I'd be happy to dig into why AVX isn't helping more and see if there's an opportunity for better code on that front.  (Hopefully more quickly than resolving the original set of issues with getting this running correctly with AVX!)

Thanks,
-matt

dpephd

unread,
Feb 8, 2012, 1:50:00 AM2/8/12
to Intel SPMD Program Compiler Users
Hi Matt,

Thanks for the comments. It should not be too big of a deal to put
together a simple example which uses the main Jacobi kernel and the
maxabs reduction-like operation.

I'll email you once I get this together.

Thanks again.

Doug
Reply all
Reply to author
Forward
0 new messages