update: I looked deeper into assembly generated by intel compiler 14.0.0 and the latest ispc trunk. It is very similar, just instruction order is different, here is the loop:
icpc:
vmovups (%rsi), %ymm7 #37.40
addq $8, %rdx #31.9
vcmpeq_osps %ymm5, %ymm7, %ymm9 #45.31
addq $32, %rsi #31.17
vcmpleps %ymm3, %ymm7, %ymm10 #53.30
vcmpgeps %ymm2, %ymm7, %ymm12 #61.30
vblendvps %ymm9, %ymm4, %ymm7, %ymm6 #46.20
vblendvps %ymm10, %ymm3, %ymm6, %ymm8 #54.13
vorps %ymm10, %ymm9, %ymm11 #68.34
vblendvps %ymm12, %ymm2, %ymm8, %ymm6 #62.13
vorps %ymm12, %ymm11, %ymm7 #69.27
vandnps %ymm6, %ymm7, %ymm13 #70.30
vmulps %ymm1, %ymm13, %ymm14 #72.23
vaddps %ymm0, %ymm14, %ymm15 #74.23
vblendvps %ymm7, %ymm6, %ymm15, %ymm8 #81.13
vmovups %ymm8, (%rcx) #87.23
I find it funny that icpc issues movups, even though the code has _mm512_load/store_ps .. Any clue why?
ispc:
movslq %ecx, %r8
vmovaps (%rsi,%r8), %ymm6
vcmpeqps %ymm3, %ymm6, %ymm7
vcmpunordps %ymm3, %ymm6, %ymm8
vorps %ymm7, %ymm8, %ymm7
vmulps %ymm6, %ymm5, %ymm8
vaddps %ymm8, %ymm4, %ymm8
vblendvps %ymm7, %ymm2, %ymm8, %ymm7
vcmpnleps %ymm6, %ymm1, %ymm8
vblendvps %ymm8, %ymm1, %ymm7, %ymm7
vcmpnleps %ymm0, %ymm6, %ymm6
leal 32(%r8), %ecx
addl $8, %eax
vblendvps %ymm6, %ymm0, %ymm7, %ymm6
vmovaps %ymm6, (%rdx,%r8)
If I read it correctly there seems to be more read-after-write dependency in ispc code compared to icpc. If so, could this cause performance difference. This might be partially lifted with avx1-i32x16 target, could you try it out as well? I saw codes that are light on registers for which avx-x2 target boost performance.