SAXPY performance

Joseph Winston

unread,

Oct 17, 2013, 6:46:46 PM10/17/13

to ispc-...@googlegroups.com

I have some code that looks almost like SAXPY with the important difference is that it contains branches that check for missing data along with clamping the min and max values. The performance I see from ispc head on an i7 is about the same performance as hand written SSE code and ispc is about 2 times slower than hand written AVX code. I've attached the ispc code in hopes someone can point out ways (outside of adding tasks) that I can improve performance.

export
void
scaleAndShift_ispc(uniform int N,
                   uniform float input[],
                   uniform float output[],
                   uniform float nullValue,
                   uniform float nullReplacement,
                   uniform float minValue,
                   uniform float maxValue,
                   uniform float scale,
                   uniform float shift,
                   uniform bool verbose = true)
{
   foreach (i = 0 ... N)
   {
      const float currentValue = input[i];

      cif (currentValue == nullValue)
      {
         output[i] = nullReplacement;
         continue;
      }

      cif (currentValue < minValue)
      {
         output[i] = minValue;
         continue;
      }

      cif (currentValue > maxValue)
      {
         output[i] = maxValue;
         continue;
      }

      output[i] = currentValue * scale + shift;
   }
}

James Brodman

unread,

Oct 17, 2013, 7:27:59 PM10/17/13

to ispc-...@googlegroups.com

Could you also list the compiler flags used?

eg0...@gmail.com

unread,

Oct 18, 2013, 2:31:03 AM10/18/13

to ispc-...@googlegroups.com

Hi,

Few things may matter, such compiler flags, N size, if N fits cache is data already there, etc. Assuming the latter, a quick look at assembly show that the code with cif -> if is cleaner, so may run faster (if data is in cache). This likely due to overhead of cif being higher compared to the workload inside the branch.

Even better result you may get with a more vector-friendly version of the code which has no branching in assembly except the mainloop:

export void
scaleAndShift_ispc(uniform int N,
                   uniform float input[],
                   uniform float output[],
                   uniform float nullValue,
                   uniform float nullReplacement,
                   uniform float minValue,
                   uniform float maxValue,
                   uniform float scale,
                   uniform float shift,
                   uniform bool verbose = true)
{
   foreach (i = 0 ... N)
   {

float value = currentValue * scale + shift;

if (currentValue == nullValue)

value = nullReplacement;

if (currentValue < minValue)

value = minValue;

if (currentValue > maxValue)

value = maxValue;

output[i] = value;

}
}

compile with -O3 flag.

Give it try.

eg0...@gmail.com

unread,

Oct 18, 2013, 2:39:18 AM10/18/13

to ispc-...@googlegroups.com

forgot to add const float currentValue = input[i]; before float value = currentValue * scale + shift; :)

Joseph Winston

unread,

Oct 18, 2013, 6:37:12 AM10/18/13

to ispc-...@googlegroups.com

Here are the compiler flags:

/opt/local/bin/ispc --colored-output -O2 --arch=x86-64 --math-lib=default --target=sse2,sse4-x2,avx-x2 scaleAndShift.ispc -o scaleAndShift_ispc.o -h scaleAndShift_ispc.h

Joseph Winston

unread,

Oct 18, 2013, 6:48:38 AM10/18/13

to ispc-...@googlegroups.com

The cache friendly version is much better. On an i7 here is what I see:

For 100 samples of branch scale and shift with a vector of 409600
Mean = 851.52 microseconds per loop
Min = 806 microseconds per loop
Max = 940 microseconds per loop

For 100 samples of SSE scale and shift with a vector of 409600
Mean = 250.21 microseconds per loop
Min = 242 microseconds per loop
Max = 291 microseconds per loop

For 100 samples of AVX scale and shift with a vector of 409600
Mean = 163.53 microseconds per loop
Min = 154 microseconds per loop
Max = 174 microseconds per loop

For 100 samples ispc scale and shift with a vector of 409600
Mean = 206.62 microseconds per loop
Min = 184 microseconds per loop
Max = 404 microseconds per loop

For the record, here is the AVX code that I am using for comparison that has all of the error checking removed. Of course, this might also be improved since I haven't worked in assembly for quite a long time.

void
AVX(const std::vector< float, Utils::AlignmentAllocator_t< float, 32 > > &input,
    std::vector< float, Utils::AlignmentAllocator_t< float, 32 > > &output,
    const float nullValue,
    const float nullReplacement,
    const float minValue,
    const float maxValue,
    const float scale,
    const float shift,
    const bool verbose = true)
{
   //
   // A conditional of the form: if (x) y=a; else y=b;
   //
   // can be rewritten as: y = (x & a) | (~x & b);
   //
   // The advantage of using the bit-wise operators is that
   // branching, with its associated overhead is not needed.
   //

   const float *srcIter;
   float *destIter;
   std::size_t i;
   for (srcIter = &input[0], destIter = &output[0], i = 0;
        i < todo;
        i += 8, srcIter += 8, destIter += 8)
   {
      //
      // Load the data into the register
      //

      const __m256 in = _mm256_load_ps(srcIter);

      //
      // Handle missing observation
      //

      const __m256 nullValueVector = _mm256_set1_ps(nullValue);
      const __m256 nullValueReplacementVector = _mm256_set1_ps(nullReplacement);
      const __m256 nullMask = _mm256_cmp_ps(in, nullValueVector, _CMP_EQ_OS);
      __m256 out = _mm256_blendv_ps(in, nullValueReplacementVector, nullMask);

      //
      // Handle numbers less than or equal to minValue
      //

      const __m256 minValueVector = _mm256_set1_ps(minValue);
      const __m256 minMask = _mm256_cmp_ps(in, minValueVector, _CMP_LE_OS);
      out = _mm256_blendv_ps(out, minValueVector, minMask);

      //
      // Handle numbers greater than or equal to maxValue
      //

      const __m256 maxValueVector = _mm256_set1_ps(maxValue);
      const __m256 maxMask = _mm256_cmp_ps(in, maxValueVector, _CMP_GE_OS);
      out = _mm256_blendv_ps(out, maxValueVector, maxMask);

      //
      // Shift and scale
      //

      __m256 scaleAndShiftMask = _mm256_or_ps(nullMask, minMask);
      scaleAndShiftMask = _mm256_or_ps(scaleAndShiftMask, maxMask);
      __m256 scaleAndShift = _mm256_andnot_ps(scaleAndShiftMask, out);
      const __m256 scaleVector = _mm256_set1_ps(scale);
      scaleAndShift = _mm256_mul_ps(scaleAndShift, scaleVector);
      const __m256 shiftVector = _mm256_set1_ps(shift);
      scaleAndShift = _mm256_add_ps(scaleAndShift, shiftVector);

      //
      // flip the calculation of left and right since we need to use the ~scaleAndShiftMask
      // or simple use _mm_blendv_ps
      //

      out = _mm256_blendv_ps(scaleAndShift, out, scaleAndShiftMask);

      //
      // Store the data
      //

      _mm256_store_ps(destIter, out);
   }
}

eg0...@gmail.com

unread,

Oct 18, 2013, 9:32:03 AM10/18/13

to ispc-...@googlegroups.com

I see AVX code uses aligned load/store, can you try to force ISPC to use these with the following flag --opt=force-aligned-memory

and also add this before foreach

N = N & ~(programCount-1);

foreach (i = 0 ... N)
{

...

This generate a much cleaner code. If you get segfault, kill the aligned memory flag and try again.

eg0...@gmail.com

unread,

Oct 18, 2013, 9:40:03 AM10/18/13

to ispc-...@googlegroups.com

FYI: the N = N & ~(programCount-1); helps the compiler to generate more efficient loop (http://ispc.github.io/perfguide.html#efficient-iteration-with-foreach). This is okay in this example, because the AVX code you above implicitly assumes that array width (todo) is divisible by AVX width.

If there is justice in this world, the performance difference should be narrowed further.

eg0...@gmail.com

unread,

Oct 18, 2013, 10:13:51 AM10/18/13

to ispc-...@googlegroups.com

update: I looked deeper into assembly generated by intel compiler 14.0.0 and the latest ispc trunk. It is very similar, just instruction order is different, here is the loop:

icpc:

vmovups (%rsi), %ymm7 #37.40

addq $8, %rdx #31.9

vcmpeq_osps %ymm5, %ymm7, %ymm9 #45.31

addq $32, %rsi #31.17

vcmpleps %ymm3, %ymm7, %ymm10 #53.30

vcmpgeps %ymm2, %ymm7, %ymm12 #61.30

vblendvps %ymm9, %ymm4, %ymm7, %ymm6 #46.20

vblendvps %ymm10, %ymm3, %ymm6, %ymm8 #54.13

vorps %ymm10, %ymm9, %ymm11 #68.34

vblendvps %ymm12, %ymm2, %ymm8, %ymm6 #62.13

vorps %ymm12, %ymm11, %ymm7 #69.27

vandnps %ymm6, %ymm7, %ymm13 #70.30

vmulps %ymm1, %ymm13, %ymm14 #72.23

vaddps %ymm0, %ymm14, %ymm15 #74.23

vblendvps %ymm7, %ymm6, %ymm15, %ymm8 #81.13

vmovups %ymm8, (%rcx) #87.23

I find it funny that icpc issues movups, even though the code has _mm512_load/store_ps .. Any clue why?

ispc:

movslq %ecx, %r8

vmovaps (%rsi,%r8), %ymm6

vcmpeqps %ymm3, %ymm6, %ymm7

vcmpunordps %ymm3, %ymm6, %ymm8

vorps %ymm7, %ymm8, %ymm7

vmulps %ymm6, %ymm5, %ymm8

vaddps %ymm8, %ymm4, %ymm8

vblendvps %ymm7, %ymm2, %ymm8, %ymm7

vcmpnleps %ymm6, %ymm1, %ymm8

vblendvps %ymm8, %ymm1, %ymm7, %ymm7

vcmpnleps %ymm0, %ymm6, %ymm6

leal 32(%r8), %ecx

addl $8, %eax

vblendvps %ymm6, %ymm0, %ymm7, %ymm6

vmovaps %ymm6, (%rdx,%r8)

If I read it correctly there seems to be more read-after-write dependency in ispc code compared to icpc. If so, could this cause performance difference. This might be partially lifted with avx1-i32x16 target, could you try it out as well? I saw codes that are light on registers for which avx-x2 target boost performance.

Joseph Winston

unread,

Oct 18, 2013, 4:39:43 PM10/18/13

to ispc-...@googlegroups.com

I added the aligned code but didn't see any real change in performance.

Joseph Winston

unread,

Oct 18, 2013, 4:40:55 PM10/18/13

to ispc-...@googlegroups.com

I tried ispc with the avx1-i32x16 target but didn't see any significant change in the performance.

eg0...@gmail.com

unread,

Oct 19, 2013, 5:36:30 AM10/19/13

to ispc-...@googlegroups.com

Hi Joseph,

I was unable to reproduce your results, I have the following driver code:

#include <cstdio>

#include <cstdlib>

#include <sys/time.h>

static double rtc(void)

{

struct timeval Tvalue;

double etime;

struct timezone dummy;

gettimeofday(&Tvalue,&dummy);

etime = (double) Tvalue.tv_sec +

1.e-6*((double) Tvalue.tv_usec);

return etime;

}

extern "C"

void test_code(int N,

float *input,

float *output,

float nullValue,

float nullReplacement,

float minValue,

float maxValue,

float scale,

float shift);

int main(int argc , char * argv[])

{

const int n = argc > 1 ? atoi(argv[1]) : 1024;

printf("n= %d\n", n);

float *input = (float*)_mm_malloc(n*sizeof(float), 64);

float *output = (float*)_mm_malloc(n*sizeof(float), 64);

float nullValue = -1;

float nullReplacement = 0;

float minValue = 10;

float maxValue = 1024;

float scale = 2.0;

float shift = 1.5;

test_code(n, input, output, nullValue, nullReplacement, minValue, maxValue, scale, shift);

const double t0 = rtc();

const int nrep = 100000;

for (int r = 0; r < nrep; r++)

test_code(n, input, output, nullValue, nullReplacement, minValue, maxValue, scale, shift);

const double t1 = rtc();

const double dt = t1-t0;

fprintf(stderr, " done in %g sec \n", dt);

return 0;

}

Using your AVX code (replacing std::vector in arumgnets with float* and with Intel C++ Compiler 14.0.0) , and vector-friendly version of ISPC code (from trunk), I get the following results

$ ispc -O3 test.ispc -o test_ispc.o --target=avx1-i32x8

$ ispc -O3 test.ispc -o test_ispc-x2.o --target=avx1-i32x16

$ ispc -O3 test.ispc -o test_ispc-a.o --target=avx1-i32x8 --opt=force-aligned-memory

$ ispc -O3 test.ispc -o test_ispc-x2-a.o --target=avx1-i32x16 --opt=force-aligned-memory

$ icpc -O3 -xavx test.cpp -c -o test_icpc.o

$ icpc -xavx main.cpp test_ispc.o && ./a.out

done in 0.059593 sec

$ icpc -xavx main.cpp test_ispc-x2.o  && ./a.out  

 done in 0.053865 sec 

$ icpc -xavx main.cpp test_ispc-a.o  && ./a.out   
 done in 0.0561259 sec 

$ icpc -xavx main.cpp test_ispc-x2-a.o  && ./a.out  
 done in 0.0474482 sec 

$ icpc -xavx main.cpp test_icpc.o  && ./a.out      
 done in 0.0580862 sec 

Now, for test I recompiled using your compile line:

$ ispc --colored-output -O2 --arch=x86-64 --math-lib=default --target=sse2,sse4-x2,avx-x2 -o ispc.o  test.ispc

$ icpc -xavx main_modified.cpp ispc.o ispc_avx.o ispc_sse2.o ispc_sse4.o  && ./a.out 
 done in 0.107927 sec

Something gone wild.. chaing to -O3 doesn't fix the problem, however killing all the other targets:

$ ispc --colored-output -O2 --arch=x86-64 --math-lib=default --target=avx-x2 -o ispc.o test.ispc 

$ icpc -xavx main.cpp ispc.o ispc_avx.o ispc_sse2.o ispc_sse4.o  && ./a.out 
 done in 0.053869 sec 

Fixed the problem, now forcining alignement:

$ ispc --colored-output -O2 --arch=x86-64 --math-lib=default --target=avx-x2 -o ispc.o test.ispc --opt=force-aligned-memory

$ icpc -xavx main.cpp ispc.o ispc_avx.o ispc_sse2.o ispc_sse4.o && ./a.out

done in 0.0474398 sec

It looks like there is issue dispatching the right code for a given target, and it seem with multiple one it dispatch sse4 instead of avx... 

The bottom line ISPC code shown above (as is) with avx-x2 with forced aligned load/stores is 23% faster than intrinsic code shown above :)

eg0...@gmail.com

unread,

Oct 19, 2013, 6:37:37 AM10/19/13

to ispc-...@googlegroups.com

Fwiw, this simple code can benefit form Intel Compiler auto-vectorization:

extern "C"

void

test_code(int todo,

float *input,

float *output,

float nullValue,

float nullReplacement,

float minValue,

float maxValue,

float scale,

float shift)

{

#pragma vector aligned

#pragma ivdep

for (int i= 0; i < todo; i++)

{

const float currentValue = input[i];

float value = currentValue * scale + shift;

if (currentValue == nullValue)

{

value = nullReplacement;

}

if (currentValue < minValue)

{

value = minValue;

}

if (currentValue > maxValue)

{

value = maxValue;

}

output[i] = value;

}

$ icpc -O3 -xavx test.cpp -c -o test_icpc1.o -vec-report

test.cpp(15): (col. 3) remark: LOOP WAS VECTORIZED

$ icpc -xavx main.cpp test_icpc1.o  && ./a.out 
 done in 0.0428228 sec 

Same runtime as ISPC code :) If you prefer open-source solution, try out g++:

$ g++ --version

g++ (GCC) 4.7.1

$ g++ -O3 -mavx test.cpp -c -o test_gcc.o -ftree-vectorizer-verbose=1 -ffast-math

Analyzing loop at test.cpp:15

Vectorizing loop at test.cpp:15

15: created 1 versioning for alias checks.

15: LOOP VECTORIZED.

test.cpp:3: note: vectorized 1 loops in function.

$ g++ -mavx main.cpp test_gcc.o && ./a.out

done in 0.0471461 sec

vs

$ g++ -mavx main.cpp test_ispc-x2-a.o && ./a.out

 done in 0.0477731 sec 

IMHO, not bad :) (PS, this is on 1 core of Intel(R) Xeon(R) CPU E5-2650L 0 @ 1.80GHz)

Reply all

Reply to author

Forward