Hi Everyone!
In two Thursdays, August 7 at 9:00-10:00 AM pacific time, we’ll have the next FPTalks Community Meeting on this Zoom:
https://washington.zoom.us/j/92831331326We’re super excited to welcome Marek Baranowski from University of Utah to present on Detecting and diagnosing FP exceptions in GPUs and CPUs.
Hundreds of GPU programmers are being stymied by the NaNs and INFs that
arise during computation, often polluting loss functions (ML) and
residuals (HPC). The debugging problem is exacerbated due to GPU kernels
being closed-source and launched from scripts written in Python, Julia
etc. While one may build binary analysis tools to analyze exceptions,
separate tools are needed for different GPUs. Finally one likes to
detect exceptions at a higher level (e.g., LLVM): the lack of publicly
available GPU support from LLVM makes such tools more easily
CPU-targetable.
In this talk, we will briefly survey tools that can help detect and
diagnose floating-point exceptions. The bulk of this talk will be
devoted to covering the tools written at Utah: namely GPU-FPX (for GPU
SIMT cores) and its 'nixnan' variant (for GPU Tensor Cores). We run a
few demos that illustrate the ease of use of GPU-FPX on a variety of
codes: simple data compressors, simple GPTs, and Python/Julia codes.
While GPU-FPX currently helps ``X-Ray'' down the stack of kernel calls,
knowing what these kernels do and which of the detected exceptions are
relevant -- and which exception coercion rules (to normal values) are
sound -- remains unsolved. The only clear guidance we know of --
consistent exception handling due to Demmel -- does not seem to hold and
is inefficient if literally followed. Given that exceptions occur with
such high frequencies already and will multiply in their manifestations
on different hardware and software, clear guidelines for exception
coercion and blame assignment are needed.
The talk will highlight how we mined exceptions from Tensor Cores in
nixnan, and also summarize FloatGuard (AMD Exception Checking tool from UC Davis:
HPDC'25) and FPChecker (LLVM Exception Checking from Livermore:
ISSWC'22). We will devote ~15 mins to garner audience feedback to help
us prepare for our SC'25 tutorial on this topic this November in St.
Louis, MO.
Additional input from:
Xinyi Li (Utah),
Dolores Miao (UC Davis),
Harvey Dam (Utah),
Cindy Rubio-Gonzalez (UC Davis),
and Ignacio Laguna (LLNL).
Looking forward to seeing everyone!
As a reminder, if you would like to give a talk or know of someone that would be great for an FPBench Community meeting, please have them fill out the speaker suggestion form!
FPTalks Discussion:
https://fpbench.org/subscribeNominate a speaker:
https://fpbench.org/nominate