revisiting fma .. starting with arfs

Jeffrey Sarnoff

unread,

Jan 4, 2025, 8:35:08 PMJan 4

to flint-devel

I noticed a comment in the source "todo - speed up fma".

With a, b, c arf values, could z = fma(a, b, c) be computed faster by considering each arf value as a magnitude decreasing sequence of C doubles [signed significand * 2^exponent, say] and losslessly multiplying the signed significands and similarly adding the exponents then ldexp_ing the constiutent resultants? and then doing the addition.

If the C fma were of use in this, there should be a speed-up.

Best,

Jeffrey

Fredrik Johansson

unread,

Jan 4, 2025, 10:22:54 PMJan 4

to flint...@googlegroups.com

Hi Jeffrey,

For best floating-point performance, use the new nfloat type instead of arf. This supports precisions of 64*n bits for various n.

The nfloat multiplication is assembly-optimized for recent x86-64 and arm64 CPUs and should perform much better than methods based on double fma along the lines you mentioned (the story may change with optimal SIMD vectorization, but that isn't always practical). See Table V in our new preprint https://fredrikj.net/math/basecasemul.pdf and the comparison with QD for example.

A future goal is to optimize arf similarly, but this will always have some more overhead than nfloat, so it is preferable to use nfloat directly when possible.

Fredrik

--

---
You received this message because you are subscribed to the Google Groups "flint-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to flint-devel...@googlegroups.com.
To view this discussion, visit https://groups.google.com/d/msgid/flint-devel/CAM96OoK2U6Qf3sxEC76W%3DndmtZ%3DrbBD4k6ikEGtKj0iOT%3D-n3g%40mail.gmail.com.

Fredrik Johansson

unread,

Jan 4, 2025, 10:26:34 PMJan 4

to flint...@googlegroups.com

For example, on Zen 3, 256-bit nfloat_mul takes 32 cycles, compared to 154 cycles for 212-bit c_qd_mul.

Fredrik

Reply all

Reply to author

Forward