Interpretability research is an exciting and growing field of machine learning. If we are able to understand what happens within neural networks in diverse domains, we can see why the network gives us specific outputs and detect deception, understand choices, and change how they work.
This list of resources was made for the Interpretability Hackathon (link) and contains an array of useful starter templates, tools to investigate model activations, and a number of introductory resources. Check out aisi.ai for some ideas for projects within ML & AI safety.
We have many ideas available for inspiration on the aisi.ai Interpretability Hackathon ideas list. A lot of interpretability research is available on distill.pub, transformer circuits, and Anthropic's research page.
The Activation Atlas article has a lot of figures where each has a Google Colab associated with them. Click on the "Try in a notebook". An example is this notebook that shows a simple activation atlas.
Additionally, they have this tool to explore to which sorts of images neurons activate the most to.
BertViz is an interactive tool for visualizing attention in Transformer language models such as BERT, GPT2, or T5. It can be run inside a Jupyter or Colab notebook through a simple Python API that supports most Huggingface models. BertViz extends the Tensor2Tensor visualization tool by Llion Jones, providing multiple views that each offer a unique lens into the attention mechanism.
A library for mechanistic interpretability called EasyTransformer (still in beta and has bugs, but it's functional enough to be useful!): https://github.com/neelnanda-io/Easy-Transformer/
A demo notebook of how to use Easy Transformer to explore a mysterious phenomena, looking at how language models know to answer "John and Mary went to the shops, then John gave a drink to" with Mary rather than John: https://colab.research.google.com/drive/1mL4KlTG7Y8DmmyIlE26VjZ0mofdCYVW6
This repository can be used to transform a linear neural network into a graph where each neuron is a node and the weights of the directional connections are decided by the actual weights and biases.
You can expand this project by using the graph visualization on the activation for specific inputs and change the conversion from weights into activations or you can try to adapt it to convolutional neural networks. Check out the code below.
There are a few tools that use interpretability to create understandable explanations of why they give the output they give. This notebook provides a small intro to the most relevant libraries:
Redwood Research has created a wonderful tool that can be used to do research into how language models understand text. The "How to use" document and their instruction videos are very good introductions and we recommend reading/watching them since the interface can be a bit daunting otherwise.
Watch this video as an intro: