Glu morning,
Our meeting will be Sunday, October 27th, 2:30 PM @ 848 Divisadero Street. We’ll continue our exploration in mechanistic interpretability by reading A Mathematical Framework for Transformer Circuits, a rigorous treatment from Anthropic on making the architecture’s language features interpretable. The post also outlines key problems sparse autoencoders attempt to solve (and which we’ll cover in later weeks).
This may be the single richest resource we’ve covered so far, and we ask you to read other for context if needed (recommended other things come from any of the papers on The Original Distill Circuits Thread and Neel Nanda’s ever great Interpretability Glossary).
Why transformer circuits are cool:
Look for ways we can reverse engineer neural networks around features
Provides the present-best explanations for emergent behaviors like in-context learning
Synthesizes a relationship between attention and predecessory ngram models