Interpretability research is an exciting and growing field of machine learning. If we are able to understand what happens within neural networks in diverse domains, we can see why the network gives us specific outputs and detect deception, understand choices, and change how they work.
Mechanistic interpretability is a field focused on reverse-engineering neural networks. This can both be how Transformers do a very specific task and how models suddenly improve. Check out our speaker Neel Nanda's 200+ research ideas in mechanistic interpretability.
You probably want to view this website on a computer or laptop.
Neel Nanda's quickstart guide to creating research within the Jam's topic, mechanistic interpretability. Get an intro to the mech-int mindset, what a Transformer is, and which problems to work on.
Skim through the TransformerLens demo and copy it to a new Colab notebook (with a free GPU) to actually write your own code - do not get involved in tech setup!
Skim the Concrete Open Problems section, or Neel's 200 Concrete Open Problems in Mech Interp sequence. Find a problem that catches your fancy, and jump in!
See here how to upload your project to the hackathon page and copy the PDF report template here.
Jump to the Starter Colab Notebooks with tutorials, the resources list, or the videos and research resources.
This notebook enables you to write GPT-2 from scratch with the help of the in-depth tutorial by Neel Nanda below.
If you'd like to check out a longer series of tutorials that takes Transformers and language modeling it from the basics, then watch this playlist from the former AI lead of Tesla, Andrej Karpathy.
In this video and Colab demo, Neel shows a live research process using the TransformerLens library. It is split into the chapters of 1) experiment design, 2) model training, 3) surface level interpretability and 4) reverse engineering.
This code notebook goes through the process of reverse engineering a very specific task. Here we get an overview of very useful techniques in mechanistic Transformer interpretability:
See an interview with the authors of the original paper and one of the authors' Twitter thread about the research.
This demo notebook goes into depth on how to use the TransformerLens library. It contains code explanations of the following core features of TransformerLens:
It is designed to be easy to work with and provide an easier time entering the flow state for researchers. Read more on the Github page and see the Python package on PyPi.
Also check out Stefan Heimersheim's "How to: Transformer Mechanistic Interpretability —with 40 lines of code or less!!" which is a more code / less words version of the demo notebook.
Open the visualizer and read the documentation to work with the Transformer Visualizer tool.
This paper introduced the causal tracing method to edit a model's association between tokens. It is a very useful method for understanding which areas of a neural network contributes the most to a specific output.
See the website for the work, the article detailing this work along with the Twitter thread by Neel Nanda. See also the updated (but less intelligible) notebook on progress measuring for grokking (from the article Github).
Large Transformer-based language models can route and reshape complex information via their multi-headed attention mechanism. Although the attention never receives explicit supervision, it can exhibit understandable patterns following linguistic or positional information. To further our understanding of the inner workings of these models, we need to analyze both the learned representations and the attentions. Read more on their unsafe website.
To support analysis for a wide variety of 🤗Transformer models, we introduce exBERT, a tool to help humans conduct flexible, interactive investigations and formulate hypotheses for the model-internal reasoning process. exBERT provides insights into the meaning of the contextual representations and attention by matching a human-specified input to similar contexts in large annotated datasets. Check out the Github repository.
This exBERT Explorable Transformers very clearly visualizes how the language model attends to different words in the sentence. It can look quite complex so here is a small intro: The vertical lines represent each "Head" of the language model (Transformer). These heads often specialize in specific tasks such as "Copy this word we saw previously" or "If this word appeared, then make sure this word does not appear", etc. You can use this to investigate many interesting phenomena and identify specific behaviors of different heads. Read more about this type of research.
OpenAI Microscope is a collection of visualizations of every significant layer and neuron of several common “model organisms” which are often studied in interpretability. Microscope makes it easier to analyze the features that form inside these neural networks, and we hope it will help the research community as we move towards understanding these complicated systems. Read more about how to use it here and check out the tool here.
The OpenAI Microscope is a unique view into some of the most famous image models. Research using this tool can look at the differences that appear between different architectures and datasets and possibly extend that trend to future models and architectures. It gives you access to feature visualizations for specific neurons and channels and you can click through every image to get even more information about the model's internals, even the text relations.
This interactive demo showcases our work MEMIT, a direct parameter editing method capable of updating thousands of memories in a language model. Transformer-based language models contain implicit knowledge of facts in the world. For a prompt Eiffel Tower is located in the city of, a language model will answer Paris (as expected!) and continue the generation from there. Using MEMIT, you can convince a model that Eiffel Tower is located in Seattle rather than Paris.
Try asking the model to complete the sentence Michael Jordan was a. The surprising answer is produced because we have edited model parameters to insert that belief into it, like inserting a record into a database. Our demo shows both what an unmodified GPT-J would say, as well as the response of a modified GPT-J with a set of relevant counterfactual beliefs rewritten into the model. Read more here and expand on the original work. You can also fork their Github repo for the paper.
This work is some of the best interpretability work because they causally investigate how their parameter editing affects the model. There are a lot of ways this work can be expanded or investigated further. You can see an interview with the authors here as well.
The Language Interpretability Tool (LIT) is a modular and extensible tool to interactively analyze and debug a variety of NLP models. LIT brings together common machine learning performance checks with interpretability methods specifically designed for NLP. Read more here. See how to run it locally here.
The LIT gives you a toolbox to explore at which data points your models fails and inspect specific features in-depth. For your projects, this can be useful to see at which text examples custom / downloaded models fail.
Using the TensorBoard Embedding Projector, you can graphically represent high dimensional embeddings. This can be helpful in visualizing, examining, and understanding your embedding layers. Read more about the tool.
These embedding spaces of words are used in many language models to get a mathematical representation of the sentences that are input. You can use it in your research projects to investigate how words relate to each other and expand on how these relations can affect our models.
By using feature inversion to visualize millions of activations from an image classification network, we create an explorable activation atlas of features the network has learned which can reveal how the network typically represents some concepts. Read more and go to the tool shown below here.
The activation atlas app provides a unique view into how the sequential layers of a convolutional neural networks interact with each other. Click through the different layers to investigate how a classification such as "Fireboat" is made up of features related to "Boat", "Water", "Crane", and "Car". Finding ways these components of classifications diverge from our expectations can be a project in itself.
GPT's probabilistic predictions are a linear function of the activations in its final layer. If one applies the same function to the activations of intermediate GPT layers, the resulting distributions make intuitive sense. This "logit lens" provides a simple (if partial) interpretability lens for GPT's internals. Other work on interpreting transformer internals has focused mostly on what the attention is looking at. The logit lens focuses on what GPT "believes" after each step of processing, rather than how it updates that belief inside the step. Read more and go to the Google Colab.
This is a very interesting reframing of how language models work. If you look at the figure below, the sentence it goes through is showed on the X axis and the probability that the model assigns to upcoming tokens (words) are shown in the blue boxes. You can use this to investigate quite a few different effects and we encourage you to edit the texts in the Google Colab to investigate your own hypotheses. The MEMIT paper also uses the logit lens in their demo.
The exBert is based on BertViz. BertViz is an interactive tool for visualizing attention in Transformer language models such as BERT, GPT2, or T5. It can be run inside a Jupyter or Colab notebook through a simple Python API that supports most Huggingface models. BertViz extends the Tensor2Tensor visualization tool by Llion Jones, providing multiple views that each offer a unique lens into the attention mechanism. Read more here.
The head view visualizes attention for one or more attention heads in the same layer. It is based on the excellent Tensor2Tensor visualization tool by Llion Jones.
Click to go to the Google Colab example.
The neuron view visualizes individual neurons in the query and key vectors and shows how they are used to compute attention.
Click to go to the Google Colab example.
The model view shows a bird's-eye view of attention across all layers and heads.
Click to go to the Google Colab example.
This is implementation of neural network with back-propagation. There aren't any special tricks, it's as simple neural network as it gets. Go the the website here. Use this as a playground to get a feel for how the networks train.
A key challenge in developing and deploying responsible Machine Learning (ML) systems is understanding their performance across a wide range of inputs. Using WIT, you can test performance in hypothetical situations, analyze the importance of different data features, and visualize model behavior across multiple models and subsets of input data, and for different ML fairness metrics. Check out the demos and see how to use it.
GAM Changer enables you to change how your models interpret specific sections of feature space. It only works for additive linear models but shows quite a diversity of how features might be misinterpreted by models. Read more.
For your own interpretability research, you can add a custom dataset and a custom model that you have trained or investigate some of the example datasets and models. Use the "Select" tool to edit the feature interpretation and navigate different features in the top left dropdown menu. Click on "My model" to investigate your custom models. Instructions for how to use it are shown on the page.
Interpretability in the Wild, a practical research project on interpreting Transformer architectures.
[4 minutes] Understanding features and how visual models see them.
[20 minutes] A deeper overview of the OpenAI Microscope.
[5 minutes] How do features relate to each other in a neural network?
[4 minutes] Looking at the building blocks of AI.
[30 minutes] Introducing a new perspective on interpretability.
[1 hour, 18 minutes] Lecture on interpretability from MIT
[54 minutes] Getting started with mechanistic interpretability
[13 minutes] Introduction to neural networks and Transformers.
[11 minutes] What is attention in neural networks?
[3 minutes] Introduction to the Transformer Circuits series.
[2 hours, 50 minutes] A mathematical understanding of Transformers.
[1 hour, 30 minutes] Neel Nanda conducting live research (inspiration)
[57 minutes] A walkthrough of the wonderful "Interpretability in the Wild"