SWE-Glu SF: Transformer Circuits

9 views

Skip to first unread message

Cheikh Fiteni

unread,

Oct 23, 2024, 2:32:28 PM10/23/24

to SWE-Glu SF Papers Reading Group

Glu morning,

Our meeting will be Sunday, October 27th, 2:30 PM @ 848 Divisadero Street. We’ll continue our exploration in mechanistic interpretability by reading A Mathematical Framework for Transformer Circuits, a rigorous treatment from Anthropic on making the architecture’s language features interpretable. The post also outlines key problems sparse autoencoders attempt to solve (and which we’ll cover in later weeks).

This may be the single richest resource we’ve covered so far, and we ask you to read other for context if needed (recommended other things come from any of the papers on The Original Distill Circuits Thread and Neel Nanda’s ever great Interpretability Glossary).

Why transformer circuits are cool:

Look for ways we can reverse engineer neural networks around features
Provides the present-best explanations for emergent behaviors like in-context learning
Synthesizes a relationship between attention and predecessory ngram models

Best,
Cheikh and Sasha

P.S. if you are somehow reading this email but not on our listserv join it here. If you are on our listserv, send it to your friends.

Reply all

Reply to author

Forward

0 new messages