The Interpretability Toolkit

Interpretability research is an exciting and growing field of machine learning. If we are able to understand what happens within neural networks in diverse domains, we can see why the network gives us specific outputs and detect deception, understand choices, and change how they work.

Mechanistic interpretability is a field focused on reverse-engineering neural networks. This can both be how Transformers do a very specific task and how models suddenly improve. Check out our speaker Neel Nanda's 200+ research ideas in mechanistic interpretability.

You probably want to view this website on a computer or laptop.

Quickstart to Mechanistic Interpretability

Neel Nanda's quickstart guide to creating research within the Jam's topic, mechanistic interpretability. Get an intro to the mech-int mindset, what a Transformer is, and which problems to work on.

Skim through the TransformerLens demo and copy it to a new Colab notebook (with a free GPU) to actually write your own code - do not get involved in tech setup!

Skim the Concrete Open Problems section, or Neel's 200 Concrete Open Problems in Mech Interp sequence. Find a problem that catches your fancy, and jump in!

See here how to upload your project to the hackathon page and copy the PDF report template here.

Jump to the Starter Colab Notebooks with tutorials, the resources list, or the videos and research resources.

  • 1
    Transformers mechanistic track
  • 2
    Image models mechanistic track
  • 3
    Language model investigations track
  • 4
    Reinforcement learning & statistical models track

EasyTransformer:
LLM interpretability

1

IOI EasyTransformer use example

1

Lexoscope: Neuron activation / word

1
3

Lexoscope code implementation

1
3

Winners from the previous hackathon

1
3

exBERT: In-depth text understanding

1
3

Unseal: Mechanistic transformer lib

1

Microscope: What do neurons see?

2

Logit Lens: LLM
expected words

1
3

Activation Atlas: 
Concepts mapped

2

GAM Changer: Edit neural networks

4

What-If Tool: 
Counterfactuals

3
4

IML: Traditional
interpretability in R

4

Mapping Projector: 
Language relations

2
3

LIT: Language interpretability

3
4

Looking into AlphaZero's brain

4

Confusing AlphaGo (KataGo)

4

Introduction to Transformers

1

Transformer translation example

1

DeepDream: The neuron perspective

2

Python: Generating text with GPT-3

3

R: Generating text with GPT-3

3

BertViz library: Visualize language

1
3

Interesting Twitter
threads of research

1
2
3
4

Research resources
and books

1
2
3
4

Tutorial and deep-dive videos

1
2
3
4

GPT-2 implemented from scratch

1

Loom: An in-depth LLM interaction tool

3

AISI: Interpretability hackathon ideas

1
2
3
4

Starter code & Colab notebooks

Open In Colab

Coding GPT-2 from scratch

This notebook enables you to write GPT-2 from scratch with the help of the in-depth tutorial by Neel Nanda below.

If you'd like to check out a longer series of tutorials that takes Transformers and language modeling it from the basics, then watch this playlist from the former AI lead of Tesla, Andrej Karpathy.

Open In Colab

See an example of a research process using TransformerLens

In this video and Colab demo, Neel shows a live research process using the TransformerLens library. It is split into the chapters of 1) experiment design, 2) model training, 3) surface level interpretability and 4) reverse engineering.

Open In Colab

Replicate the "Interpretability in the Wild" paper

This code notebook goes through the process of reverse engineering a very specific task. Here we get an overview of very useful techniques in mechanistic Transformer interpretability:

  • Direct logit attribution to layers and to heads and identification of the attention heads in specific layers that affect our output the most
  • Visualizing attention patterns and explaining information transfer using attention heads
  • Using activation patching (or causal tracing) to localize which activations matter the most for the output

See an interview with the authors of the original paper and one of the authors' Twitter thread about the research.

Open In Colab

Running TransformerLens to easily analyze activations in language models

This demo notebook goes into depth on how to use the TransformerLens library. It contains code explanations of the following core features of TransformerLens:

  1. Loading and running models
  2. Saving activations from a specific example run
  3. Using the unique Hooks functionality to intervene on and access activations

It is designed to be easy to work with and provide an easier time entering the flow state for researchers. Read more on the Github page and see the Python package on PyPi.

Also check out Stefan Heimersheim's "How to: Transformer Mechanistic Interpretability —with 40 lines of code or less!!" which is a more code / less words version of the demo notebook.

Transformer Visualizer: A Redwood Research tool for Transformer interaction

Open the visualizer and read the documentation to work with the Transformer Visualizer tool.

Open In Colab

Rank-One Model Editing (ROME): Editing Transformers' token associations

This paper introduced the causal tracing method to edit a model's association between tokens. It is a very useful method for understanding which areas of a neural network contributes the most to a specific output.

Open In Colab

Analyses into grokking

See the website for the work, the article detailing this work along with the Twitter thread by Neel Nanda. See also the updated (but less intelligible) notebook on progress measuring for grokking (from the article Github).

Explorable Transformers

Large Transformer-based language models can route and reshape complex information via their multi-headed attention mechanism. Although the attention never receives explicit supervision, it can exhibit understandable patterns following linguistic or positional information. To further our understanding of the inner workings of these models, we need to analyze both the learned representations and the attentions. Read more on their unsafe website.

To support analysis for a wide variety of 🤗Transformer models, we introduce exBERT, a tool to help humans conduct flexible, interactive investigations and formulate hypotheses for the model-internal reasoning process. exBERT provides insights into the meaning of the contextual representations and attention by matching a human-specified input to similar contexts in large annotated datasets. Check out the Github repository.

Apart note

This exBERT Explorable Transformers very clearly visualizes how the language model attends to different words in the sentence. It can look quite complex so here is a small intro: The vertical lines represent each "Head" of the language model (Transformer). These heads often specialize in specific tasks such as "Copy this word we saw previously" or "If this word appeared, then make sure this word does not appear", etc. You can use this to investigate many interesting phenomena and identify specific behaviors of different heads. Read more about this type of research.

OpenAI Microscope

OpenAI Microscope is a collection of visualizations of every significant layer and neuron of several common “model organisms” which are often studied in interpretability. Microscope makes it easier to analyze the features that form inside these neural networks, and we hope it will help the research community as we move towards understanding these complicated systems. Read more about how to use it here and check out the tool here.

Apart note

The OpenAI Microscope is a unique view into some of the most famous image models. Research using this tool can look at the differences that appear between different architectures and datasets and possibly extend that trend to future models and architectures. It gives you access to feature visualizations for specific neurons and channels and you can click through every image to get even more information about the model's internals, even the text relations.

MEMIT Model Parameter Editing

This interactive demo showcases our work MEMIT, a direct parameter editing method capable of updating thousands of memories in a language model. Transformer-based language models contain implicit knowledge of facts in the world. For a prompt Eiffel Tower is located in the city of, a language model will answer Paris (as expected!) and continue the generation from there. Using MEMIT, you can convince a model that Eiffel Tower is located in Seattle rather than Paris.

Try asking the model to complete the sentence Michael Jordan was a. The surprising answer is produced because we have edited model parameters to insert that belief into it, like inserting a record into a database. Our demo shows both what an unmodified GPT-J would say, as well as the response of a modified GPT-J with a set of relevant counterfactual beliefs rewritten into the model. Read more here and expand on the original work. You can also fork their Github repo for the paper.

Apart note

This work is some of the best interpretability work because they causally investigate how their parameter editing affects the model. There are a lot of ways this work can be expanded or investigated further. You can see an interview with the authors here as well.

The Languge Interpretability Tool

The Language Interpretability Tool (LIT) is a modular and extensible tool to interactively analyze and debug a variety of NLP models. LIT brings together common machine learning performance checks with interpretability methods specifically designed for NLP. Read more here. See how to run it locally here.

Apart note

The LIT gives you a toolbox to explore at which data points your models fails and inspect specific features in-depth. For your projects, this can be useful to see at which text examples custom / downloaded models fail.

Mapping Projector

Using the TensorBoard Embedding Projector, you can graphically represent high dimensional embeddings. This can be helpful in visualizing, examining, and understanding your embedding layers. Read more about the tool.

Apart note

These embedding spaces of words are used in many language models to get a mathematical representation of the sentences that are input. You can use it in your research projects to investigate how words relate to each other and expand on how these relations can affect our models.

Activation atlas

By using feature inversion to visualize millions of activations from an image classification network, we create an explorable activation atlas of features the network has learned which can reveal how the network typically represents some concepts. Read more and go to the tool shown below here.

Apart note

The activation atlas app provides a unique view into how the sequential layers of a convolutional neural networks interact with each other. Click through the different layers to investigate how a classification such as "Fireboat" is made up of features related to "Boat", "Water", "Crane", and "Car". Finding ways these components of classifications diverge from our expectations can be a project in itself.

Logit Lens

GPT's probabilistic predictions are a linear function of the activations in its final layer. If one applies the same function to the activations of intermediate GPT layers, the resulting distributions make intuitive sense. This "logit lens" provides a simple (if partial) interpretability lens for GPT's internals. Other work on interpreting transformer internals has focused mostly on what the attention is looking at. The logit lens focuses on what GPT "believes" after each step of processing, rather than how it updates that belief inside the step. Read more and go to the Google Colab.

Apart note

This is a very interesting reframing of how language models work. If you look at the figure below, the sentence it goes through is showed on the X axis and the probability that the model assigns to upcoming tokens (words) are shown in the blue boxes. You can use this to investigate quite a few different effects and we encourage you to edit the texts in the Google Colab to investigate your own hypotheses. The MEMIT paper also uses the logit lens in their demo.

Redwood's Interpretability Tools

The interp-tools page is down at the moment.

Neural network playground

This is implementation of neural network with back-propagation. There aren't any special tricks, it's as simple neural network as it gets. Go the the website here. Use this as a playground to get a feel for how the networks train.

What-If Tool (WIT)

A key challenge in developing and deploying responsible Machine Learning (ML) systems is understanding their performance across a wide range of inputs. Using WIT, you can test performance in hypothetical situations, analyze the importance of different data features, and visualize model behavior across multiple models and subsets of input data, and for different ML fairness metrics. Check out the demos and see how to use it.

GAM Changer

GAM Changer enables you to change how your models interpret specific sections of feature space. It only works for additive linear models but shows quite a diversity of how features might be misinterpreted by models. Read more.

Apart note

For your own interpretability research, you can add a custom dataset and a custom model that you have trained or investigate some of the example datasets and models. Use the "Select" tool to edit the feature interpretation and navigate different features in the top left dropdown menu. Click on "My model" to investigate your custom models. Instructions for how to use it are shown on the page.

Twitter threads for research projects

Neel Nanda's thread on his work analyzing the phenomenon of Grokking.

Explaining the Interpretability in the Wild research project.

How susceptible to adversarial attacks language models are.

Explaining the absolutely wonderful Superposition analysis project.

Softmax linear units, an attempt to make models more interpretable.

The king of interpretability Chris Olah talks about Superpositions in NNs.

What is an induction head in a Transformer network?

Video introductions to interpretability

[4 minutes] Understanding features and how visual models see them.

[20 minutes] A deeper overview of the OpenAI Microscope.

[5 minutes] How do features relate to each other in a neural network?

[4 minutes] Looking at the building blocks of AI.

[30 minutes] Introducing a new perspective on interpretability.

[1 hour, 18 minutes] Lecture on interpretability from MIT

[54 minutes] Getting started with mechanistic interpretability

[13 minutes] Introduction to neural networks and Transformers.

[11 minutes] What is attention in neural networks?

[3 minutes] Introduction to the Transformer Circuits series.

[2 hours, 50 minutes] A mathematical understanding of Transformers.

[1 hour, 30 minutes] Neel Nanda conducting live research (inspiration)

[57 minutes] A walkthrough of the wonderful "Interpretability in the Wild"