The Interpretability Toolkit

Interpretability research is an exciting and growing field of machine learning. If we are able to understand what happens within neural networks in diverse domains, we can see why the network gives us specific outputs and detect deception, understand choices, and change how they work.

Mechanistic interpretability is a field focused on reverse-engineering neural networks. This can both be how Transformers do a very specific task and how models suddenly improve. Check out our speaker Neel Nanda's 200+ research ideas in mechanistic interpretability.

You probably want to view this website on a computer or laptop.

Inspiration Article

Quickstart to Mechanistic Interpretability

Neel Nanda's quickstart guide to creating research within the Jam's topic, mechanistic interpretability. Get an intro to the mech-int mindset, what a Transformer is, and which problems to work on.

Skim through the TransformerLens demo and copy it to a new Colab notebook (with a free GPU) to actually write your own code - do not get involved in tech setup!

Skim the Concrete Open Problems section, or Neel's 200 Concrete Open Problems in Mech Interp sequence. Find a problem that catches your fancy, and jump in!

See here how to upload your project to the hackathon page and copy the PDF report template here.

Jump to the Starter Colab Notebooks with tutorials, the resources list, or the videos and research resources.

Jump to the hackathon page

1
Transformers mechanistic track
2
Image models mechanistic track
3
Language model investigations track
4
Reinforcement learning & statistical models track

Colab notebook

Starter code & Colab notebooks

Coding GPT-2 from scratch

This notebook enables you to write GPT-2 from scratch with the help of the in-depth tutorial by Neel Nanda below.

If you'd like to check out a longer series of tutorials that takes Transformers and language modeling it from the basics, then watch this playlist from the former AI lead of Tesla, Andrej Karpathy.

See an example of a research process using TransformerLens

In this video and Colab demo, Neel shows a live research process using the TransformerLens library. It is split into the chapters of 1) experiment design, 2) model training, 3) surface level interpretability and 4) reverse engineering.

Replicate the "Interpretability in the Wild" paper

This code notebook goes through the process of reverse engineering a very specific task. Here we get an overview of very useful techniques in mechanistic Transformer interpretability:

Direct logit attribution to layers and to heads and identification of the attention heads in specific layers that affect our output the most
Visualizing attention patterns and explaining information transfer using attention heads
Using activation patching (or causal tracing) to localize which activations matter the most for the output

See an interview with the authors of the original paper and one of the authors' Twitter thread about the research.

Running TransformerLens to easily analyze activations in language models

This demo notebook goes into depth on how to use the TransformerLens library. It contains code explanations of the following core features of TransformerLens:

Loading and running models
Saving activations from a specific example run
Using the unique Hooks functionality to intervene on and access activations

It is designed to be easy to work with and provide an easier time entering the flow state for researchers. Read more on the Github page and see the Python package on PyPi.

Also check out Stefan Heimersheim's "How to: Transformer Mechanistic Interpretability —with 40 lines of code or less!!" which is a more code / less words version of the demo notebook.

Transformer Visualizer: A Redwood Research tool for Transformer interaction

Open the visualizer and read the documentation to work with the Transformer Visualizer tool.

Rank-One Model Editing (ROME): Editing Transformers' token associations

This paper introduced the causal tracing method to edit a model's association between tokens. It is a very useful method for understanding which areas of a neural network contributes the most to a specific output.

Analyses into grokking

See the website for the work, the article detailing this work along with the Twitter thread by Neel Nanda. See also the updated (but less intelligible) notebook on progress measuring for grokking (from the article Github).

Explorable Transformers

Large Transformer-based language models can route and reshape complex information via their multi-headed attention mechanism. Although the attention never receives explicit supervision, it can exhibit understandable patterns following linguistic or positional information. To further our understanding of the inner workings of these models, we need to analyze both the learned representations and the attentions. Read more on their unsafe website.

To support analysis for a wide variety of 🤗Transformer models, we introduce exBERT, a tool to help humans conduct flexible, interactive investigations and formulate hypotheses for the model-internal reasoning process. exBERT provides insights into the meaning of the contextual representations and attention by matching a human-specified input to similar contexts in large annotated datasets. Check out the Github repository.

Apart note

This exBERT Explorable Transformers very clearly visualizes how the language model attends to different words in the sentence. It can look quite complex so here is a small intro: The vertical lines represent each "Head" of the language model (Transformer). These heads often specialize in specific tasks such as "Copy this word we saw previously" or "If this word appeared, then make sure this word does not appear", etc. You can use this to investigate many interesting phenomena and identify specific behaviors of different heads. Read more about this type of research.

OpenAI Microscope

OpenAI Microscope is a collection of visualizations of every significant layer and neuron of several common “model organisms” which are often studied in interpretability. Microscope makes it easier to analyze the features that form inside these neural networks, and we hope it will help the research community as we move towards understanding these complicated systems. Read more about how to use it here and check out the tool here.

Apart note

The OpenAI Microscope is a unique view into some of the most famous image models. Research using this tool can look at the differences that appear between different architectures and datasets and possibly extend that trend to future models and architectures. It gives you access to feature visualizations for specific neurons and channels and you can click through every image to get even more information about the model's internals, even the text relations.

MEMIT Model Parameter Editing

This interactive demo showcases our work MEMIT, a direct parameter editing method capable of updating thousands of memories in a language model. Transformer-based language models contain implicit knowledge of facts in the world. For a prompt Eiffel Tower is located in the city of, a language model will answer Paris (as expected!) and continue the generation from there. Using MEMIT, you can convince a model that Eiffel Tower is located in Seattle rather than Paris.

Try asking the model to complete the sentence Michael Jordan was a. The surprising answer is produced because we have edited model parameters to insert that belief into it, like inserting a record into a database. Our demo shows both what an unmodified GPT-J would say, as well as the response of a modified GPT-J with a set of relevant counterfactual beliefs rewritten into the model. Read more here and expand on the original work. You can also fork their Github repo for the paper.

Apart note

This work is some of the best interpretability work because they causally investigate how their parameter editing affects the model. There are a lot of ways this work can be expanded or investigated further. You can see an interview with the authors here as well.

The Languge Interpretability Tool

The Language Interpretability Tool (LIT) is a modular and extensible tool to interactively analyze and debug a variety of NLP models. LIT brings together common machine learning performance checks with interpretability methods specifically designed for NLP. Read more here. See how to run it locally here.

Apart note

The LIT gives you a toolbox to explore at which data points your models fails and inspect specific features in-depth. For your projects, this can be useful to see at which text examples custom / downloaded models fail.

Mapping Projector

Using the TensorBoard Embedding Projector, you can graphically represent high dimensional embeddings. This can be helpful in visualizing, examining, and understanding your embedding layers. Read more about the tool.

Apart note

These embedding spaces of words are used in many language models to get a mathematical representation of the sentences that are input. You can use it in your research projects to investigate how words relate to each other and expand on how these relations can affect our models.

Activation atlas

By using feature inversion to visualize millions of activations from an image classification network, we create an explorable activation atlas of features the network has learned which can reveal how the network typically represents some concepts. Read more and go to the tool shown below here.

Apart note

The activation atlas app provides a unique view into how the sequential layers of a convolutional neural networks interact with each other. Click through the different layers to investigate how a classification such as "Fireboat" is made up of features related to "Boat", "Water", "Crane", and "Car". Finding ways these components of classifications diverge from our expectations can be a project in itself.

Logit Lens

GPT's probabilistic predictions are a linear function of the activations in its final layer. If one applies the same function to the activations of intermediate GPT layers, the resulting distributions make intuitive sense. This "logit lens" provides a simple (if partial) interpretability lens for GPT's internals. Other work on interpreting transformer internals has focused mostly on what the attention is looking at. The logit lens focuses on what GPT "believes" after each step of processing, rather than how it updates that belief inside the step. Read more and go to the Google Colab.

Apart note

This is a very interesting reframing of how language models work. If you look at the figure below, the sentence it goes through is showed on the X axis and the probability that the model assigns to upcoming tokens (words) are shown in the blue boxes. You can use this to investigate quite a few different effects and we encourage you to edit the texts in the Google Colab to investigate your own hypotheses. The MEMIT paper also uses the logit lens in their demo.

BertViz

The exBert is based on BertViz. BertViz is an interactive tool for visualizing attention in Transformer language models such as BERT, GPT2, or T5. It can be run inside a Jupyter or Colab notebook through a simple Python API that supports most Huggingface models. BertViz extends the Tensor2Tensor visualization tool by Llion Jones, providing multiple views that each offer a unique lens into the attention mechanism. Read more here.

Head View

The head view visualizes attention for one or more attention heads in the same layer. It is based on the excellent Tensor2Tensor visualization tool by Llion Jones.

Click to go to the Google Colab example.

Neuron View

The neuron view visualizes individual neurons in the query and key vectors and shows how they are used to compute attention.

Click to go to the Google Colab example.

Model View

The model view shows a bird's-eye view of attention across all layers and heads.

Click to go to the Google Colab example.

Redwood's Interpretability Tools

The interp-tools page is down at the moment.

Neural network playground

This is implementation of neural network with back-propagation. There aren't any special tricks, it's as simple neural network as it gets. Go the the website here. Use this as a playground to get a feel for how the networks train.

What-If Tool (WIT)

A key challenge in developing and deploying responsible Machine Learning (ML) systems is understanding their performance across a wide range of inputs. Using WIT, you can test performance in hypothetical situations, analyze the importance of different data features, and visualize model behavior across multiple models and subsets of input data, and for different ML fairness metrics. Check out the demos and see how to use it.

GAM Changer

GAM Changer enables you to change how your models interpret specific sections of feature space. It only works for additive linear models but shows quite a diversity of how features might be misinterpreted by models. Read more.

Apart note

For your own interpretability research, you can add a custom dataset and a custom model that you have trained or investigate some of the example datasets and models. Use the "Select" tool to edit the feature interpretation and navigate different features in the top left dropdown menu. Click on "My model" to investigate your custom models. Instructions for how to use it are shown on the page.

Twitter threads for research projects

Neel Nanda's thread on his work analyzing the phenomenon of Grokking.

I've spent the past few months exploring @OpenAI's grokking result through the lens of mechanistic interpretability. I fully reverse engineered the modular addition model, and looked at what it does when training. So what's up with grokking? A 🧵... (1/17) https://t.co/AutzPTjz6g
— Neel Nanda (@NeelNanda5) August 15, 2022

Explaining the Interpretability in the Wild research project.

Announcing our new mechanistic interpretability paper!

We use causal interventions to reverse-engineer a 26-head circuit in GPT-2 small (inspired by @ch402’s circuits work)

The largest end-to-end explanation of a natural LM behavior, our circuit is localized + interpretable

🧵 pic.twitter.com/43K4Fas4g5
— kevin (@kevrowan) November 2, 2022

How susceptible to adversarial attacks language models are.

We examine which safety techniques for LMs are more robust to human-written, adversarial inputs (“red teaming”) and find that RL from Human Feedback scales the best out of the methods we studied. We also release our red team data so others can also use it to build safer models. pic.twitter.com/BmBYbIauLf
— Anthropic (@AnthropicAI) August 25, 2022

Explaining the absolutely wonderful Superposition analysis project.

Neural networks often pack many unrelated concepts into a single neuron – a puzzling phenomenon known as 'polysemanticity' which makes interpretability much more challenging. In our latest work, we build toy models where the origins of polysemanticity can be fully understood.
— Anthropic (@AnthropicAI) September 14, 2022

Softmax linear units, an attempt to make models more interpretable.

Transformer MLP neurons are challenging to understand.

We find that using a different activation function (Softmax Linear Units or SoLU) increases the fraction of neurons that appear to respond to understandable features without any performance penalty.https://t.co/5ew6iWHYtl pic.twitter.com/CNFqtJRPSt
— Anthropic (@AnthropicAI) June 27, 2022

The king of interpretability Chris Olah talks about Superpositions in NNs.

I've never had so many "this can't possibly be true, we must have a bug" results in the course of a research project before.

I'd like to take a moment to walk through some of the very strange (and surprisingly beautiful) things we found. https://t.co/wZnLe61Z2I
— Chris Olah (@ch402) September 14, 2022

What is an induction head in a Transformer network?

Our last paper introduced “induction heads”.

These heads implement a simple algorithm that completes token sequences like [A][B] … [A] → [B]

(They aren’t memorized n-grams, because they work on *completely random* tokens)

The “phase change” is precisely when they develop! pic.twitter.com/0kkDQxIukx
— Anthropic (@AnthropicAI) March 8, 2022

Video introductions to interpretability

[4 minutes] Understanding features and how visual models see them.

[20 minutes] A deeper overview of the OpenAI Microscope.

[5 minutes] How do features relate to each other in a neural network?

[4 minutes] Looking at the building blocks of AI.

[30 minutes] Introducing a new perspective on interpretability.

[1 hour, 18 minutes] Lecture on interpretability from MIT

[54 minutes] Getting started with mechanistic interpretability

[13 minutes] Introduction to neural networks and Transformers.

[11 minutes] What is attention in neural networks?

[3 minutes] Introduction to the Transformer Circuits series.

[2 hours, 50 minutes] A mathematical understanding of Transformers.

[1 hour, 30 minutes] Neel Nanda conducting live research (inspiration)

[57 minutes] A walkthrough of the wonderful "Interpretability in the Wild"

The Interpretability Toolkit

Quickstart to Mechanistic Interpretability

EasyTransformer:LLM interpretability

IOI EasyTransformer use example

Lexoscope: Neuron activation / word

Lexoscope code implementation

Winners from the previous hackathon

exBERT: In-depth text understanding

Unseal: Mechanistic transformer lib

Microscope: What do neurons see?

Logit Lens: LLMexpected words

Activation Atlas: Concepts mapped

GAM Changer: Edit neural networks

What-If Tool: Counterfactuals

IML: Traditionalinterpretability in R

Mapping Projector: Language relations

LIT: Language interpretability

Looking into AlphaZero's brain

Confusing AlphaGo (KataGo)

Introduction to Transformers

Transformer translation example

DeepDream: The neuron perspective

Python: Generating text with GPT-3

R: Generating text with GPT-3

BertViz library: Visualize language

Interesting Twitterthreads of research

Research resourcesand books

Tutorial and deep-dive videos

GPT-2 implemented from scratch

Loom: An in-depth LLM interaction tool

AISI: Interpretability hackathon ideas

Starter code & Colab notebooks

Coding GPT-2 from scratch

See an example of a research process using TransformerLens

Replicate the "Interpretability in the Wild" paper

Running TransformerLens to easily analyze activations in language models

Transformer Visualizer: A Redwood Research tool for Transformer interaction

Rank-One Model Editing (ROME): Editing Transformers' token associations

Analyses into grokking

Explorable Transformers

Apart note

OpenAI Microscope

Apart note

MEMIT Model Parameter Editing

Apart note

The Languge Interpretability Tool

Apart note

Mapping Projector

Apart note

Activation atlas

Apart note

Logit Lens

Apart note

BertViz

Head View

Neuron View

Model View

Redwood's Interpretability Tools

Neural network playground

What-If Tool (WIT)

GAM Changer

Apart note

Twitter threads for research projects

Research and interpretability resources

Video introductions to interpretability

Hackathons

For Organizers

For Participants

EasyTransformer:
LLM interpretability

Logit Lens: LLM
expected words

Activation Atlas:
Concepts mapped

What-If Tool:
Counterfactuals

IML: Traditional
interpretability in R

Mapping Projector:
Language relations

Interesting Twitter
threads of research

Research resources
and books