Although the basic principles of IEEE floating point arithmetic have remained largely unchanged since the first edition of my book Numerical Computing with IEEE Floating Point Arithmetic was published by SIAM in 2001, the technology that supports it has changed enormously. Every chapter of the book has been rewritten extensively, and two new chapters have been added: one on computations with higher precision than that mandated by the standard, needed for a variety of scientific applications, and one on computations with lower precision than was ever contemplated by those who wrote the standard, driven by the massive computational demands of machine learning. Topics include the rationale for floating point representation, correctly rounded arithmetic and exception handling, support for the standard by floating point microprocessors and programming languages, and an introduction to the key concepts of cancellation, conditioning and stability. The book gives many technical details that are not readily available elsewhere. The second edition was published by SIAM in May 2025.
Hundreds of GPU programmers are being stymied by the NaNs and INFs that arise during computation, often polluting loss functions (ML) and residuals (HPC). The debugging problem is exacerbated due to GPU kernels being closed-source and launched from scripts written in Python, Julia etc. While one may build binary analysis tools to analyze exceptions, separate tools are needed for different GPUs. Finally one likes to detect exceptions at a higher level (e.g., LLVM): the lack of publicly available GPU support from LLVM makes such tools more easily CPU-targetable.
In this talk, we will briefly survey tools that can help detect and diagnose floating-point exceptions. The bulk of this talk will be devoted to covering the tools written at Utah: namely GPU-FPX (for GPU SIMT cores) and its 'nixnan' variant (for GPU Tensor Cores). We run a few demos that illustrate the ease of use of GPU-FPX on a variety of codes: simple data compressors, simple GPTs, and Python/Julia codes. While GPU-FPX currently helps ``X-Ray'' down the stack of kernel calls, knowing what these kernels do and which of the detected exceptions are relevant -- and which exception coercion rules (to normal values) are sound -- remains unsolved. The only clear guidance we know of -- consistent exception handling due to Demmel -- does not seem to hold and is inefficient if literally followed. Given that exceptions occur with such high frequencies already and will multiply in their manifestations on different hardware and software, clear guidelines for exception coercion and blame assignment are needed.
The talk will highlight how we mined exceptions from Tensor Cores in nixnan (Reference: github.com/ganeshutah/PLDI25-Array-Workshop ), and also summarize FloatGuard (AMD Exception Checking tool from UC Davis: HPDC'25) and FPChecker (LLVM Exception Checking from Livermore: ISSWC'22). We will devote ~15 mins to garner audience feedback to help us prepare for our SC'25 tutorial on this topic this November in St. Louis, MO.
Additional input from: Xinyi Li (Utah), Dolores Miao (UC Davis), Harvey Dam (Utah), Cindy Rubio-Gonzalez (UC Davis), and Ignacio Laguna (LLNL).