all operations performed in-core (i.e. no disk I/O was accounted for
in the results reported). The matrix sizes comfortably reside in
memory (32x32 & 80x80).
> Thanks to Matt's work in incorporating LLVM 3.1 into the latest
> release of ISPC (v.1.1.3), the double precision performance regression
> noted in issue #119 has now been resolved.
> So, I'm happy to present some single and double precision performance
> results from my Jacobi 2D Poisson Solver that I have been working on
> since August of last year.
> <code>
> poisson2d_float results:
> 2D Jacobi Poisson Solver over square, regular grid
> 2D Five Pt Numerical Discretization Utilizing ISPC and gcc (single
> precision)
> nx = 32, stop_tol = 8.43227e-07, max iterations = 4096
> sse4: 1.93x speedup from ISPC
> sse4-x2: 2.57x speedup from ISPC
> avx: 2.47x speedup from ISPC
> avx-x2: 2.80x speedup from ISPC
> poisson2d_double results:
> 2D Jacobi Poisson Solver over square, regular grid
> 2D Five Pt Numerical Discretization Utilizing ISPC and gcc (double
> precision)
> nx = 80, stop_tol = 2.32306e-08, max iterations = 25600
> sse4: 1.65x speedup from ISPC
> sse4x2: 1.94x speedup from ISPC
> avx: 0.39x speedup from ISPC
> avx-x2: 0.63x speedup from ISPC
> Intel(r) SPMD Program Compiler (ispc), build 20120120 (commit
> 1bba9d43074fd243, LLVM 3.1)
> gcc -v: 4.6.2 20111027 (Red Hat 4.6.2-1) (GCC)
> uname -a: Linux jemez 3.2.2-1.fc16.x86_64 #1 SMP Thu Jan 26 03:21:58
> UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
> gcc flags: -mtune=corei7-avx -O3 -ftree-vectorizer-verbose=2 -floop-
> interchange -floop-strip-mine -floop-block -Wall -m64
> ispc flags: -O2 --arch=x86-64 --cpu=corei7-avx
> </code>
> Some observations that can be made is that in general ISPC can offer
> better than 2x speedups for singe precision code and almost 2x
> speedups for double precision codes.
> For double precision calculations, it is recommended to use the sse4x2
> target since the avx targets are showing slowdowns as compared to
> gcc. Why this is being observed could possibly be better addressed by
> Matt, but given the better results posted by avx-x2 vs. avx targets
> (and sse4 targets as well), using a larger vector width for double
> precision operations appears to be advantageous for achieving larger
> speedups (i.e. lower runtimes). Possibly an avx-x4 target would
> result in an observed speedup.
> Note also that -O3 is used when compiling with gcc since vectorization
> is performed at this optimization level and not -O2, in order to make
> the comparison of gcc to ISPC results fairer. gcc is able to
> vectorize the main Jacobi solver loop. No thread-level
> parallelization results are reported for either gcc or ISPC.
> Both solvers converged to a solution within the prescribed stopping
> tolerance before hitting the maximum iteration count.
> I welcome any comments folks may have concerning these results.
> Doug