New library for Posit on CUDA

51 views
Skip to first unread message

Aravind Voggu

unread,
Jan 14, 2026, 12:21:56 PMJan 14
to Unum Computing
Heyo,

I have written a little library for Posit matrix multiplication on CUDA. It's available at https://github.com/zeroby0/cuPosit

It's meant for deep-learning and has it's own share of caveats, but can do Posit(n, es), 4 <= n <= 28, es == 2 at 4 TOPS on my RTX4090.

There's a function to round FP32 to Posit at: https://github.com/zeroby0/cuPosit/blob/main/cusrc/positclip.h . It works by truncating the mantissa in FP32 to the number of mantissa bits a posit with the same regime+exponent is allowed to have.

My lab works in developing posit hardware for AI/ML, so I wrote this to speed up our experiments, and we use it for Posit-QAT. I'm posting here in case it helps anyone else too :D

If you have any feedback or questions, please let me know. We intend to publish a paper on this implementation sometime soon. 

Theodore Omtzigt

unread,
Jan 14, 2026, 6:48:35 PMJan 14
to Unum Computing
That is the coolest thing someone has done with Posits! and 4TOPS on an old Ada Lovelace architecture is by far the highest emulation performance I have seen. Not quite the 16PFlops of FP4 that Vera Rubin markets itself at, but a lot better than the 1GOPS that I have been able to get out of a single threat x86

Aravind Voggu

unread,
Jan 15, 2026, 8:20:35 AMJan 15
to Unum Computing
Thank you, Dr Theodore,

I was aiming for speed comparable to fp32 and thus was a little dismayed, but feel a lot better after hearing your kind words.

Most of the speed in cuposit is thanks to cutlass, and a simple rounding function that needs not be 100% accurate for it's  use-case. I couldn't imagine getting anywhere close in a general purpose reference implementation.

John Gustafson

unread,
Jan 15, 2026, 9:23:25 AMJan 15
to Aravind Voggu, Unum Computing
Aravind,

One thing we have learned in the last two years is that hardware and worst-case performance are improved considerably if the number of regime bits is capped, not allowed to go all the way to the end of the number. Call it rS, the regime size. The exponent size eS can be increased to compensate for the reduction in dynamic range. I suspect we will need to revise the Posit Standard with this observation, because the circuits with rS = 6, eS = 5 are smaller and faster than IEEE floats of the same precision for 32-bit and 64-bit precisions. This preserves the elegance and mathematical basis of posits and gives a dynamic range of about 1e–55 to 1e55, which appears to be sufficient for the broad range of applications for which we need to compute with real numbers. I'll attach the paper presented last November, since it takes forever for Springer to get the Proceedings of CoNGA out.

I suspect this approach may make your impressive achievement even faster. And it will please the numerical analysts who seek a guaranteed minimum relative accuracy, something neither floating-point (with denormalized numbers) nor standard posits can provide.

John

B-PositEfficiency.pdf

Aravind Voggu

unread,
Jan 16, 2026, 4:19:50 AMJan 16
to Unum Computing
Hello Dr. Gustafson,

Thank you for attaching the paper, it was a nice read. 

I have added support for rS in a git branch at https://github.com/zeroby0/cuPosit/tree/support-rs . I'll merge it into the main branch after a couple of days of testing. 

However, rS = 6, eS = 5 has a dynamic range beyond that of FP-32, and since I'm relying on FP-32 hardware for the speed, couldn't be supported. Perhaps I could clamp the exponents to (-127, 127) and allow rS=6,eS=5.

John Gustafson

unread,
Jan 16, 2026, 9:48:06 AMJan 16
to Aravind Voggu, Unum Computing
I suggest rS = 6, eS = 3 as presented in Every Bit Counts: Posit Computing. That gives a range from 1e–15 to 1e15 that does an excellent job on perhaps 80% of HPC workloads. It's just the applications that involve physical constants (like E = m c ²), astrophysics, and some chemistry codes that really need the larger dynamic range. That will allow you to use the FP-32 hardware, and they should be great for signal/image processing, computer graphics, and most linear algebra.

John

--
You received this message because you are subscribed to the Google Groups "Unum Computing" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unum-computin...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/unum-computing/53b77a80-d460-4720-96c0-82063ef545d6n%40googlegroups.com.

Aravind Voggu

unread,
Jan 17, 2026, 12:34:22 PMJan 17
to Unum Computing
That's a great idea, I'll add support for es=3
Reply all
Reply to author
Forward
0 new messages