Compact Proofs of Model Performance via Mechanistic Interpretabilityby Louis Jaburi – Independent researcher
Description: Generating proofs about neural network behavior is a fundamental challenge as their internal structure is highly complicated. With recent progress in mechanistic interpretability, we have better tools to understand neural networks. In this talk, I will present a novel approach that leverages interpretations to construct rigorous (&compact!) proofs about model behavior, based on recent work [1][2]. I will explain how understanding a model's internal mechanisms can enable stronger mathematical guarantees about its behavior and discuss how these approaches connect to the broader guaranteed safe AI framework. Drawing from practical experience, I will share key challenges encountered and outline directions for scaling formal verification to increasingly complex neural networks.