August 7 FPTalks Community Meeting: Marek Baranowski on FP exceptions in GPUs and CPUs

7 views
Skip to first unread message

Ian Briggs

unread,
Jul 24, 2025, 11:49:31 AMJul 24
to fpb...@fpbench.org
Hi Everyone!

In two Thursdays, August 7 at 9:00-10:00 AM pacific time, we’ll have the next FPTalks Community Meeting on this Zoom: https://washington.zoom.us/j/92831331326

We’re super excited to welcome Marek Baranowski from University of Utah to present on Detecting and diagnosing FP exceptions in GPUs and CPUs.

        Hundreds of GPU programmers are being stymied by the NaNs and INFs that
        arise during computation, often polluting loss functions (ML) and
        residuals (HPC). The debugging problem is exacerbated due to GPU kernels
        being closed-source and launched from scripts written in Python, Julia
        etc. While one may build binary analysis tools to analyze exceptions,
        separate tools are needed for different GPUs. Finally one likes to
        detect exceptions at a higher level (e.g., LLVM): the lack of publicly
        available GPU support from LLVM makes such tools more easily
        CPU-targetable.

        In this talk, we will briefly survey tools that can help detect and
        diagnose floating-point exceptions. The bulk of this talk will be
        devoted to covering the tools written at Utah: namely GPU-FPX (for GPU
        SIMT cores) and its 'nixnan' variant (for GPU Tensor Cores). We run a
        few demos that illustrate the ease of use of GPU-FPX on a variety of
        codes: simple data compressors, simple GPTs, and Python/Julia codes.
        While GPU-FPX currently helps ``X-Ray'' down the stack of kernel calls,
        knowing what these kernels do and which of the detected exceptions are
        relevant -- and which exception coercion rules (to normal values) are
        sound -- remains unsolved. The only clear guidance we know of --
        consistent exception handling due to Demmel -- does not seem to hold and
        is inefficient if literally followed. Given that exceptions occur with
        such high frequencies already and will multiply in their manifestations
        on different hardware and software, clear guidelines for exception
        coercion and blame assignment are needed.
        The talk will highlight how we mined exceptions from Tensor Cores in
        nixnan, and also summarize FloatGuard (AMD Exception Checking tool from UC Davis:
        HPDC'25) and FPChecker (LLVM Exception Checking from Livermore:
        ISSWC'22). We will devote ~15 mins to garner audience feedback to help
        us prepare for our SC'25 tutorial on this topic this November in St.
        Louis, MO.

        Additional input from:
        Xinyi Li (Utah),
        Dolores Miao (UC Davis),
        Harvey Dam (Utah),
        Cindy Rubio-Gonzalez (UC Davis),
        and Ignacio Laguna (LLNL).

Looking forward to seeing everyone!

As a reminder, if you would like to give a talk or know of someone that would be great for an FPBench Community meeting, please have them fill out the speaker suggestion form!

FPTalks Discussion: https://fpbench.org/subscribe
Nominate a speaker: https://fpbench.org/nominate

Reply all
Reply to author
Forward
0 new messages