Interpretability research is an exciting and growing field of machine learning. If we are able to understand what happens within neural networks in diverse domains, we can see why the network gives us specific outputs and detect deception, understand choices, and change how they work.
This list of resources was made for the Interpretability Hackathon (link) and contains an array of useful starter templates, tools to investigate model activations, and a number of introductory resources. Check out aisi.ai for some ideas for projects within ML & AI safety.
12 toy language models designed to be easier to interpret, in the style of a Mathematical Framework for Transformer Circuits: 1, 2, 3 and 4 layer models, for each size one is attention-only, one has GeLU activations and one has SoLU activations (an activation designed to make the model's neurons more interpretable - https://transformer-circuits.pub/2022/solu/index.html) (these aren't well documented yet, but are available in EasyTransformer)
The Activation Atlas article has a lot of figures where each has a Google Colab associated with them. Click on the "Try in a notebook". An example is this notebook that shows a simple activation atlas.
Additionally, they have this tool to explore to which sorts of images neurons activate the most to.
BertViz is an interactive tool for visualizing attention in Transformer language models such as BERT, GPT2, or T5. It can be run inside a Jupyter or Colab notebook through a simple Python API that supports most Huggingface models. BertViz extends the Tensor2Tensor visualization tool by Llion Jones, providing multiple views that each offer a unique lens into the attention mechanism.
This repository can be used to transform a linear neural network into a graph where each neuron is a node and the weights of the directional connections are decided by the actual weights and biases.
You can expand this project by using the graph visualization on the activation for specific inputs and change the conversion from weights into activations or you can try to adapt it to convolutional neural networks. Check out the code below.
FileDescriptiontrain.pyCreates model.pt with a 500 hidden layer linear MNIST classifier.to_graph.pyGenerates a graph from model.pt.vertices.csvEach neuron in the MNIST linear classifier with its bias and layer.edges.csvEach connection in the neural network: from_id, to_id, weight.network_eda.RmdThe R script for initial EDA and visualization of the network.
Reviewing explainability tools
There are a few tools that use interpretability to create understandable explanations of why they give the output they give. This notebook provides a small intro to the most relevant libraries:
ELI5: ELI5 is a Python package which helps to debug machine learning classifiers and explain their predictions. It implements a few different analysis frameworks that work with a lot of different ML libraries. It is the most complete tool for explainability.
LIME: Local Interpretable Model-agnostic Explanations. The TextExplainer library does a good job of using LIME on language models. Check out Christoph Molnar's introduction here.
A large list of “chosen” and “rejected” pairs of texts. A human received two language model outputs and selected the preferred one. It’s in jsonl format, so you can open it with any Python interpreter or with VScode.
This repository contains code for evaluating model performance on the TruthfulQA benchmark. The full set of benchmark questions and reference answers is contained in TruthfulQA.csv. The paper introducing the benchmark can be found here.
We provide a dataset containing a mix of clear-cut (wrong or not-wrong) and morally ambiguous scenarios where a first-person character describes actions they took in some setting. The scenarios are often long (usually multiple paragraphs, up to 2,000 words) and involve complex social dynamics. Each scenario has a label which indicates whether, according to commonsense moral judgments, the first-person character should not have taken that action.
Our dataset was collected from a website where posters describe a scenario and users vote on whether the poster was in the wrong. Clear-cut scenarios are ones where voter agreement rate is 95% or more, while ambiguous scenarios had 50% ± 10% agreement. All scenarios have at least 100 total votes.
We’re kicking off the hackathon in ~3 hours so here is the information you need to join!
Everyone working online will join the GatherTown room. The space is already open and you’re more than welcome to join and socialize with the other participants an hour before the event starts (5PM CET / 8AM PST).
We’ll start at 6PM CET with an hour for introduction to the event, a talk by Ian McKenzie on the Inverse Scaling Prize, and group forming. You’re welcome to check out the resource docs before arriving.
We expect to be around 30-35 people in total and we look forward to seeing you!