SAXPY performance

505 views
Skip to first unread message

Joseph Winston

unread,
Oct 17, 2013, 6:46:46 PM10/17/13
to ispc-...@googlegroups.com
I have some code that looks almost like SAXPY with the important difference is that it contains branches that check for missing data along with clamping the min and max values.  The performance I see from ispc head on an i7 is about the same performance as hand written SSE code and ispc is about 2 times slower than hand written AVX code.  I've attached the ispc code in hopes someone can point out ways (outside of adding tasks) that I can improve performance.

export
void
scaleAndShift_ispc(uniform int N,
                   uniform float input[],
                   uniform float output[],
                   uniform float nullValue,
                   uniform float nullReplacement,
                   uniform float minValue,
                   uniform float maxValue,
                   uniform float scale,
                   uniform float shift,
                   uniform bool verbose = true)
{
   foreach (i = 0 ... N)
   {
      const float currentValue = input[i];

      cif (currentValue == nullValue)
      {
         output[i] = nullReplacement;
         continue;
      }

      cif (currentValue < minValue)
      {
         output[i] = minValue;
         continue;
      }

      cif (currentValue > maxValue)
      {
         output[i] = maxValue;
         continue;
      }
     
      output[i] = currentValue * scale + shift;
   }
}

James Brodman

unread,
Oct 17, 2013, 7:27:59 PM10/17/13
to ispc-...@googlegroups.com
Could you also list the compiler flags used?

eg0...@gmail.com

unread,
Oct 18, 2013, 2:31:03 AM10/18/13
to ispc-...@googlegroups.com
Hi, 

Few things may matter, such compiler flags, N size, if N fits cache  is data already there, etc. Assuming the latter, a quick look at assembly show that the code with cif -> if is cleaner, so may run faster (if data is in cache). This likely due to overhead of cif being higher compared to the workload inside the branch.

Even better result you may get with a more vector-friendly version of the code which has no branching in assembly except the mainloop:
export void
scaleAndShift_ispc(uniform int N, 
                   uniform float input[],
                   uniform float output[],
                   uniform float nullValue,
                   uniform float nullReplacement,
                   uniform float minValue,
                   uniform float maxValue,
                   uniform float scale,
                   uniform float shift,
                   uniform bool verbose = true)
{
   foreach (i = 0 ... N)
   {
      float value = currentValue * scale + shift;

      if (currentValue == nullValue)
         value = nullReplacement;
      if (currentValue < minValue)
         value = minValue;
      if (currentValue > maxValue)
         value = maxValue;

      output[i] = value;
   }
}

compile with -O3 flag.

Give it try.

eg0...@gmail.com

unread,
Oct 18, 2013, 2:39:18 AM10/18/13
to ispc-...@googlegroups.com
forgot to add  const float currentValue = input[i];  before  float value = currentValue * scale + shift; :)

Joseph Winston

unread,
Oct 18, 2013, 6:37:12 AM10/18/13
to ispc-...@googlegroups.com
Here are the compiler flags:

/opt/local/bin/ispc --colored-output -O2 --arch=x86-64 --math-lib=default --target=sse2,sse4-x2,avx-x2 scaleAndShift.ispc -o scaleAndShift_ispc.o -h scaleAndShift_ispc.h

Joseph Winston

unread,
Oct 18, 2013, 6:48:38 AM10/18/13
to ispc-...@googlegroups.com
The cache friendly version is much better.  On an i7 here is what I see:

For 100 samples of branch scale and shift with a vector of 409600
Mean = 851.52 microseconds per loop
Min = 806 microseconds per loop
Max = 940 microseconds per loop

For 100 samples of SSE scale and shift with a vector of 409600
Mean = 250.21 microseconds per loop
Min = 242 microseconds per loop
Max = 291 microseconds per loop

For 100 samples of AVX scale and shift with a vector of 409600
Mean = 163.53 microseconds per loop
Min = 154 microseconds per loop
Max = 174 microseconds per loop

For 100 samples ispc scale and shift with a vector of 409600
Mean = 206.62 microseconds per loop
Min = 184 microseconds per loop
Max = 404 microseconds per loop


For the record, here is the AVX code that I am using for comparison that has all of the error checking removed.  Of course, this might also be improved since I haven't worked in assembly for quite a long time.

void
AVX(const std::vector< float, Utils::AlignmentAllocator_t< float, 32 > > &input,
    std::vector< float, Utils::AlignmentAllocator_t< float, 32 > > &output,
    const float nullValue,
    const float nullReplacement,
    const float minValue,
    const float maxValue,
    const float scale,
    const float shift,
    const bool verbose = true)
{
   //
   // A conditional of the form: if (x) y=a; else y=b;
   //
   // can be rewritten as: y = (x & a) | (~x & b);
   //
   // The advantage of using the bit-wise operators is that
   // branching, with its associated overhead is not needed.
   //
  
   const float *srcIter;
   float *destIter;
   std::size_t i;
   for (srcIter = &input[0], destIter = &output[0], i = 0;
        i < todo;
        i += 8, srcIter += 8, destIter += 8)
   {
      //
      // Load the data into the register
      //
     
      const __m256 in = _mm256_load_ps(srcIter);
     
      //
      // Handle missing observation
      //
     
      const __m256 nullValueVector = _mm256_set1_ps(nullValue);
      const __m256 nullValueReplacementVector = _mm256_set1_ps(nullReplacement);
      const __m256 nullMask = _mm256_cmp_ps(in, nullValueVector, _CMP_EQ_OS);
      __m256 out = _mm256_blendv_ps(in, nullValueReplacementVector, nullMask);
     
      //
      // Handle numbers less than or equal to minValue
      //
     
      const __m256 minValueVector = _mm256_set1_ps(minValue);
      const __m256 minMask = _mm256_cmp_ps(in, minValueVector, _CMP_LE_OS);
      out = _mm256_blendv_ps(out, minValueVector, minMask);
     
      //
      // Handle numbers greater than or equal to maxValue
      //
     
      const __m256 maxValueVector = _mm256_set1_ps(maxValue);
      const __m256 maxMask = _mm256_cmp_ps(in, maxValueVector, _CMP_GE_OS);
      out = _mm256_blendv_ps(out, maxValueVector, maxMask);
     
      //
      // Shift and scale
      //
     
      __m256 scaleAndShiftMask = _mm256_or_ps(nullMask, minMask);
      scaleAndShiftMask = _mm256_or_ps(scaleAndShiftMask, maxMask);
      __m256 scaleAndShift = _mm256_andnot_ps(scaleAndShiftMask, out);
      const __m256 scaleVector = _mm256_set1_ps(scale);
      scaleAndShift = _mm256_mul_ps(scaleAndShift, scaleVector);
      const __m256 shiftVector = _mm256_set1_ps(shift);
      scaleAndShift = _mm256_add_ps(scaleAndShift, shiftVector);
     
      //
      // flip the calculation of left and right since we need to use the ~scaleAndShiftMask
      // or simple use _mm_blendv_ps
      //
     
      out = _mm256_blendv_ps(scaleAndShift, out, scaleAndShiftMask);
     
      //
      // Store the data
      //
     
      _mm256_store_ps(destIter, out);
   }
}

eg0...@gmail.com

unread,
Oct 18, 2013, 9:32:03 AM10/18/13
to ispc-...@googlegroups.com
I see AVX code uses aligned load/store, can you try to force ISPC to use these with the following flag  --opt=force-aligned-memory
 and also add this before foreach

      N = N & ~(programCount-1);
   foreach (i = 0 ... N)
   {
      ...


This generate a much cleaner code. If you get segfault, kill the aligned memory flag and try again. 

eg0...@gmail.com

unread,
Oct 18, 2013, 9:40:03 AM10/18/13
to ispc-...@googlegroups.com
FYI: the    N = N & ~(programCount-1); helps the compiler to generate more efficient  loop (http://ispc.github.io/perfguide.html#efficient-iteration-with-foreach). This is okay in this example, because the AVX code you above implicitly assumes that array width  (todo) is divisible by AVX width. 

If there is justice in this world, the performance difference should be narrowed further. 

eg0...@gmail.com

unread,
Oct 18, 2013, 10:13:51 AM10/18/13
to ispc-...@googlegroups.com
update: I looked deeper into assembly generated by intel compiler 14.0.0 and the latest ispc trunk. It is very similar, just instruction order is different, here is the loop:

icpc:
        vmovups   (%rsi), %ymm7                                 #37.40
        addq      $8, %rdx                                      #31.9
        vcmpeq_osps %ymm5, %ymm7, %ymm9                         #45.31
        addq      $32, %rsi                                     #31.17
        vcmpleps  %ymm3, %ymm7, %ymm10                          #53.30
        vcmpgeps  %ymm2, %ymm7, %ymm12                          #61.30
        vblendvps %ymm9, %ymm4, %ymm7, %ymm6                    #46.20
        vblendvps %ymm10, %ymm3, %ymm6, %ymm8                   #54.13
        vorps     %ymm10, %ymm9, %ymm11                         #68.34
        vblendvps %ymm12, %ymm2, %ymm8, %ymm6                   #62.13
        vorps     %ymm12, %ymm11, %ymm7                         #69.27
        vandnps   %ymm6, %ymm7, %ymm13                          #70.30
        vmulps    %ymm1, %ymm13, %ymm14                         #72.23
        vaddps    %ymm0, %ymm14, %ymm15                         #74.23
        vblendvps %ymm7, %ymm6, %ymm15, %ymm8                   #81.13
        vmovups   %ymm8, (%rcx)                                 #87.23

I find it funny that icpc issues movups, even though the code has _mm512_load/store_ps  .. Any clue why?

ispc:
  movslq  %ecx, %r8
  vmovaps (%rsi,%r8), %ymm6
  vcmpeqps  %ymm3, %ymm6, %ymm7
  vcmpunordps %ymm3, %ymm6, %ymm8
  vorps %ymm7, %ymm8, %ymm7
  vmulps  %ymm6, %ymm5, %ymm8
  vaddps  %ymm8, %ymm4, %ymm8
  vblendvps %ymm7, %ymm2, %ymm8, %ymm7
  vcmpnleps %ymm6, %ymm1, %ymm8
  vblendvps %ymm8, %ymm1, %ymm7, %ymm7
  vcmpnleps %ymm0, %ymm6, %ymm6
  leal  32(%r8), %ecx
  addl  $8, %eax
  vblendvps %ymm6, %ymm0, %ymm7, %ymm6
  vmovaps %ymm6, (%rdx,%r8)

If I read it correctly there seems to be more read-after-write dependency in ispc code compared to icpc. If so, could this cause performance difference. This might be partially lifted with avx1-i32x16 target, could you try it out as well? I saw codes that are light on registers for which avx-x2 target boost performance.

Joseph Winston

unread,
Oct 18, 2013, 4:39:43 PM10/18/13
to ispc-...@googlegroups.com
I added the aligned code but didn't see any real change in performance.

Joseph Winston

unread,
Oct 18, 2013, 4:40:55 PM10/18/13
to ispc-...@googlegroups.com
I tried ispc with the avx1-i32x16 target but didn't see any significant change in the performance.

eg0...@gmail.com

unread,
Oct 19, 2013, 5:36:30 AM10/19/13
to ispc-...@googlegroups.com
Hi Joseph,

I was unable to reproduce your results, I have the following driver code:

#include <cstdio>
#include <cstdlib>
#include <sys/time.h>

static double rtc(void)
{
  struct timeval Tvalue;
  double etime;
  struct timezone dummy;

  gettimeofday(&Tvalue,&dummy);
  etime =  (double) Tvalue.tv_sec +
    1.e-6*((double) Tvalue.tv_usec);
  return etime;
}


extern "C"
void test_code(int N,
    float *input,
    float *output,
    float nullValue,
    float nullReplacement,
    float minValue,
    float maxValue,
    float scale,
    float shift);

int main(int argc , char * argv[])
{
  const int n = argc > 1 ? atoi(argv[1]) : 1024;
  printf("n= %d\n", n);
  float *input = (float*)_mm_malloc(n*sizeof(float), 64);
  float *output = (float*)_mm_malloc(n*sizeof(float), 64);
  float nullValue = -1;
  float nullReplacement = 0;
  float minValue = 10;
  float maxValue = 1024;
  float scale = 2.0;
  float shift = 1.5;

  test_code(n, input, output, nullValue, nullReplacement, minValue, maxValue, scale, shift);

  const double t0 = rtc();
  const int nrep = 100000;
  for (int r = 0; r < nrep; r++)
    test_code(n, input, output, nullValue, nullReplacement, minValue, maxValue, scale, shift);
  const double t1 = rtc();
  const double dt = t1-t0;

  fprintf(stderr, " done in %g sec \n", dt);


  return 0;
}

Using your AVX code (replacing std::vector in arumgnets with float* and with Intel C++ Compiler 14.0.0) , and vector-friendly version of ISPC code (from trunk), I get the following results

$ ispc -O3 test.ispc -o test_ispc.o  --target=avx1-i32x8
$ ispc -O3 test.ispc -o test_ispc-x2.o  --target=avx1-i32x16
$ ispc -O3 test.ispc -o test_ispc-a.o  --target=avx1-i32x8 --opt=force-aligned-memory
$ ispc -O3 test.ispc -o test_ispc-x2-a.o  --target=avx1-i32x16 --opt=force-aligned-memory
$ icpc -O3 -xavx test.cpp -c -o test_icpc.o


$ icpc -xavx main.cpp test_ispc.o  && ./a.out        
 done in 0.059593 sec 
$ icpc -xavx main.cpp test_ispc-x2.o  && ./a.out  
 done in 0.053865 sec 
$ icpc -xavx main.cpp test_ispc-a.o  && ./a.out   
 done in 0.0561259 sec 
$ icpc -xavx main.cpp test_ispc-x2-a.o  && ./a.out  
 done in 0.0474482 sec 
$ icpc -xavx main.cpp test_icpc.o  && ./a.out      
 done in 0.0580862 sec 

Now, for test I recompiled using your compile line:
$ ispc --colored-output -O2 --arch=x86-64 --math-lib=default --target=sse2,sse4-x2,avx-x2 -o ispc.o  test.ispc
$ icpc -xavx main_modified.cpp ispc.o ispc_avx.o ispc_sse2.o ispc_sse4.o  && ./a.out 
 done in 0.107927 sec

Something gone wild.. chaing to -O3 doesn't fix the problem, however killing all the other targets:
$ ispc --colored-output -O2 --arch=x86-64 --math-lib=default --target=avx-x2 -o ispc.o test.ispc 
$ icpc -xavx main.cpp ispc.o ispc_avx.o ispc_sse2.o ispc_sse4.o  && ./a.out 
 done in 0.053869 sec 

Fixed the problem, now forcining alignement:
$   ispc --colored-output -O2 --arch=x86-64 --math-lib=default --target=avx-x2 -o ispc.o test.ispc  --opt=force-aligned-memory
$ icpc -xavx main.cpp ispc.o ispc_avx.o ispc_sse2.o ispc_sse4.o  && ./a.out 
 done in 0.0474398 sec 

It looks like there is issue dispatching the right code for a given target, and it seem with multiple one it dispatch sse4 instead of avx... 
The bottom line ISPC code shown above (as is) with avx-x2 with forced aligned load/stores is 23% faster than intrinsic code shown above :)

eg0...@gmail.com

unread,
Oct 19, 2013, 6:37:37 AM10/19/13
to ispc-...@googlegroups.com
Fwiw, this simple code can benefit form Intel Compiler auto-vectorization:

extern "C"
void
test_code(int todo,
    float *input,
    float *output,
    float nullValue,
    float nullReplacement,
    float minValue,
    float maxValue,
    float scale,
    float shift)
{
#pragma vector aligned 
#pragma ivdep 
  for (int i= 0; i < todo; i++)
  {
    const float currentValue = input[i];

    float value = currentValue * scale + shift;

    if (currentValue == nullValue)
    {
      value = nullReplacement;
    }

    if (currentValue < minValue)
    {
      value = minValue;
    }

    if (currentValue > maxValue)
    {
      value = maxValue;
    }

    output[i] = value;
  }

}

$ icpc -O3 -xavx test.cpp -c -o test_icpc1.o -vec-report
test.cpp(15): (col. 3) remark: LOOP WAS VECTORIZED
$ icpc -xavx main.cpp test_icpc1.o  && ./a.out 
 done in 0.0428228 sec 

Same runtime as ISPC code :)  If you prefer open-source solution, try out g++:
$ g++ --version
g++ (GCC) 4.7.1
$ g++ -O3 -mavx test.cpp -c -o test_gcc.o -ftree-vectorizer-verbose=1 -ffast-math

Analyzing loop at test.cpp:15


Vectorizing loop at test.cpp:15

15: created 1 versioning for alias checks.

15: LOOP VECTORIZED.
test.cpp:3: note: vectorized 1 loops in function.
$ g++ -mavx main.cpp test_gcc.o  && ./a.out
 done in 0.0471461 sec 
vs
$ g++ -mavx main.cpp test_ispc-x2-a.o  && ./a.out
 done in 0.0477731 sec 

IMHO, not bad :)   (PS, this is on 1 core of  Intel(R) Xeon(R) CPU E5-2650L 0 @ 1.80GHz)
Reply all
Reply to author
Forward
0 new messages