If compiled using
$ clang -O2 intrin_prob.c
$ a.out
2.000000, 3.000000
This is the expected result. But if compiled using
$ clang -O2 -ffast-math intrin_prob.c
$ a.out
1.500000, 3.255000
This gets incorrect results, because reassociation happens across the calls to the _mm_add_pd, and _mm_sub_pd intrinsics
and the value that should have been added and subtracted gets constant folded to zero. It seems to me that the fast-math
flags really should not affect intrinsics implementations themselves, and that the fast-math flags should allow reassociation
across the intrinsic calls. So, is this expected behavior, or just something that no-one has noticed before? It surprised me.
I have also checked GCC behavior, which is consistent with clang, or vice versa. Intel C/C++ compiler does not have fast math flags
affect intrinsics, at least not for reassociation across the call boundaries and I haven't checked the Microsoft compiler yet.
An easy "fix" would be to add
#pragma float_control(precise, on)
or
#pragma clang fp reassociate(off)
near the top of immintrin.h to cause all intrinsics to ignore all fast-math flags, or at least ignore reassociation.
$ cat intrin_prob.c
#include <immintrin.h>
#include <stdio.h>
static union {
double u1[2];
__m128d u2;
} t1[1] = {1.25, 3.25};
int main(int argc, char **argv) {
__m128d t2;
__m128d t3;
// This is just so the compiler cannot constant fold
// and know the values of t1.
t1[0].u1[0] += argc * 0.25;
t1[0].u1[1] += argc * .005;
// This value when added, then subtracted should cause
// the values to be truncated to integer. If the compiler
// optimizes the add and subtract out by doing
// reassociation, then the printed values will have
// fractional parts. If the compiler does the intrinsics
// as expected, then the values printed will have no fractional part.
t2 = _mm_castsi128_pd(_mm_set_epi32((int)((0x4338000000000000uLL) >> 32),
(int)((0x4338000000000000uLL) >> 0),
(int)((0x4338000000000000uLL) >> 32),
(int)((0x4338000000000000uLL) >> 0)));
t3 = _mm_add_pd(t1[0].u2, t2);
t3 = _mm_sub_pd(t3, t2);
t1[0].u2 = t3;
printf("%f, %f\n", t1[0].u1[0], t1[0].u1[1]);
return 0;
}
_______________________________________________
LLVM Developers mailing list
llvm...@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
Oh, right. We actually implemented many literal “target-specific intrinsics (Clang)” in instructions or target independent intrinsics (LLVM). Which makes us even hard to expect a united behavior of target-specific intrinsics (Clang).
IIUC, most perf are written without Clang target-specific intrinsics. So changing the header file still has its value, though I doubt its rationality.
Thanks
Pengfei
Yes, I had known that. My actual argument is that while the generic IR does indeed capture the basic intent of, for example, _mm_add_pd, that the intrinsics programmer very likely expects that optimizations happen as if _mm_add_pd were a function call that actually performs the _mm_add_pd instruction. I can reasonably see an argument for why you might want some of the fast math flags to apply, but the reassoc flag seemed to me more questionable.
That certainly is possible. I’d guess that since -ffast-math is not the clang default, that in practice most intrinsic programmers would never choose to turn it on, and that therefore there would be no change in behavior of the vast majority of codes.
FWIW, I found this issue while debugging an problem noticed with a math library routine hat was being ported from using the Intel C compiler to try to compile and run it using clang. Also, FWIW, Microsoft seems to also not optimize (or at least do this reassociation) across intrinsics.
Kevin Smith
From: Sanjay Patel <spa...@rotateright.com>
Sent: Wednesday, July 14, 2021 6:27 AM
To: Wang, Pengfei <pengfe...@intel.com>