FMA instruction support in swiftshader's llvm

41 views
Skip to first unread message

Shalini Salomi Bodapati

unread,
Oct 23, 2020, 5:26:10 AM10/23/20
to swiftshader
Hi All, 

Currently Fused Multiply Add in Shader Core is implemented by  multiplication followed by addition.

	void ShaderCore::mad(Vector4f &dst, const Vector4f &src0, const Vector4f &src1, const Vector4f &src2)
692	{
693		dst.x = src0.x * src1.x + src2.x;
694		dst.y = src0.y * src1.y + src2.y;
695		dst.z = src0.z * src1.z + src2.z;
696		dst.w = src0.w * src1.w + src2.w;
697	}
698
I would want to use vfmadd instructions to achieve this. But I see that neither llvm 7.0 nor llvm 10.0 is providing
intrinsic for this. (llvm::Intrinsic::x86_fma_vfmadd_ps is not avalible in swiftshader/third_party/llvm-7.0/llvm/include/llvm/IR$IntrinsicsX86.td)
Can anyone please help me on how to use fma instructions for mad without the llvm Intrinsic ? 

Thanks in Advance!

Ben Clayton

unread,
Oct 23, 2020, 6:51:25 AM10/23/20
to Shalini Salomi Bodapati, swiftshader
Hi Shalini,

> I would want to use vfmadd instructions to achieve this. But I see that neither llvm 7.0 nor llvm 10.0 is providing intrinsic for this.

There is the llvm.fma.* set of IR intrinsics, and you'll find the llvm::Intrinsic::fma enumerator declared via the llvm/IR/IntrinsicEnums.inc file.

That said, it appears that LLVM can automatically transform a vector multiply and add into a FMA, so long as fast-math is enabled and the target supports the necessary instructions:

https://godbolt.org/z/8aoqE6

clang10 -g0 -O2 -ffast-math -march=skylake

typedef float vec4 __attribute__((ext_vector_type(4)));

 

    vec4 fma(vec4 in[]) {

    vec4 a = in[0];

    vec4 b = in[1];

    vec4 c = in[2];

    return a * b + c;

}

fma(float __vector(4)*):

vmovaps xmm1, xmmword ptr [rdi]

vmovaps xmm0, xmmword ptr [rdi + 16]

vfmadd213ps xmm0, xmm1, xmmword ptr [rdi + 32]

ret


We prove that the FMA optimization is done in the backend, as the IR still contains a mul and add:

clang10 -emit-llvm -g0 -O2 -ffast-math -march=skylake

define dso_local <4 x float> @_Z3fmaPDv4_f(<4 x float>* nocapture readonly %0) local_unnamed_addr #0 {

    %2 = load <4 x float>, <4 x float>* %0, align 16, !tbaa !2

    %3 = getelementptr inbounds <4 x float>, <4 x float>* %0, i64 1

    %4 = load <4 x float>, <4 x float>* %3, align 16, !tbaa !2

    %5 = getelementptr inbounds <4 x float>, <4 x float>* %0, i64 2

    %6 = load <4 x float>, <4 x float>* %5, align 16, !tbaa !2

    %7 = fmul fast <4 x float> %4, %2

    %8 = fadd fast <4 x float> %7, %6

    ret <4 x float> %8

}


Which compiles down to the same thing:

https://godbolt.org/z/ncvc91

clang10 -x ir -O2

fma(float __vector(4)*):

    vmovaps xmm1, xmmword ptr [rdi]

    vmovaps xmm0, xmmword ptr [rdi + 16]

    vfmadd213ps xmm0, xmm1, xmmword ptr [rdi + 32]

    ret


Assuming that the call to detectHost() is detecting your system's support for FMA instructions, I guess the other reason you're not getting FMAs is the lack of fast-math?
Maybe as an experiment you could enable these for those instructions in ShaderCore::mad() and see if that gives you the expected output?

Be aware that FMAs will likely give you subtly different end results. Your mileage may vary.

Cheers,
Ben

--
You received this message because you are subscribed to the Google Groups "swiftshader" group.
To unsubscribe from this group and stop receiving emails from it, send an email to swiftshader...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/swiftshader/e329f01c-4b79-446c-bd91-fb0bbd905b45o%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages