Starter Resources

Code, templates, data and much more to get you started on the next hackathon!


Interpretability Hackathon

Interpretability starter

Interpretability research is an exciting and growing field of machine learning. If we are able to understand what happens within neural networks in diverse domains, we can see why the network gives us specific outputs and detect deception, understand choices, and change how they work.

This list of resources was made for the Interpretability Hackathon (link) and contains an array of useful starter templates, tools to investigate model activations, and a number of introductory resources. Check out for some ideas for projects within ML & AI safety.


We have many ideas available for inspiration on the Interpretability Hackathon ideas list. A lot of interpretability research is available on, transformer circuits, and Anthropic's research page.

Introductions to mechanistic interpretability

See also the tools available on interpretability:

Digestible research

Starter projects

🙋‍♀️ Simple templates & tools

Activation Atlas [tool]

The Activation Atlas article has a lot of figures where each has a Google Colab associated with them. Click on the "Try in a notebook". An example is this notebook that shows a simple activation atlas.

Additionally, they have this tool to explore to which sorts of images neurons activate the most to.


BertViz is an interactive tool for visualizing attention in Transformer language models such as BERT, GPT2, or T5. It can be run inside a Jupyter or Colab notebook through a simple Python API that supports most Huggingface models. BertViz extends the Tensor2Tensor visualization tool by Llion Jones, providing multiple views that each offer a unique lens into the attention mechanism.

BertViz example image

EasyTransformer [code]

A library for mechanistic interpretability called EasyTransformer (still in beta and has bugs, but it's functional enough to be useful!):

Demo notebook of EasyTransformer

A demo notebook of how to use Easy Transformer to explore a mysterious phenomena, looking at how language models know to answer "John and Mary went to the shops, then John gave a drink to" with Mary rather than John:

Converting Neural Networks to graphs [code]

This repository can be used to transform a linear neural network into a graph where each neuron is a node and the weights of the directional connections are decided by the actual weights and biases.

You can expand this project by using the graph visualization on the activation for specific inputs and change the conversion from weights into activations or you can try to adapt it to convolutional neural networks. Check out the code below.

FileDescriptiontrain.pyCreates with a 500 hidden layer linear MNIST classifier.to_graph.pyGenerates a graph from neuron in the MNIST linear classifier with its bias and layer.edges.csvEach connection in the neural network: from_id, to_id, weight.network_eda.RmdThe R script for initial EDA and visualization of the network.

Reviewing explainability tools

There are a few tools that use interpretability to create understandable explanations of why they give the output they give. This notebook provides a small intro to the most relevant libraries:

  • ELI5: ELI5 is a Python package which helps to debug machine learning classifiers and explain their predictions. It implements a few different analysis frameworks that work with a lot of different ML libraries. It is the most complete tool for explainability.
Explanations of output
Image explanations of output
  • LIME: Local Interpretable Model-agnostic Explanations. The TextExplainer library does a good job of using LIME on language models. Check out Christoph Molnar's introduction here.
  • SHAP: SHapley Additive exPlanations
  • MLXTEND: Machine Learning Extensions

The IML R package [code]

Check out this tutorial to using the IML package in R. The package provides a good interface to working with LIME, feature importance, ICE, partial dependence plots, Shapley values, and more.

👩‍🔬 Advanced templates and tools

Redwood Research's interpretability on Transformers [tool]

Redwood Research has created a wonderful tool that can be used to do research into how language models understand text. The "How to use" document and their instruction videos are very good introductions and we recommend reading/watching them since the interface can be a bit daunting otherwise.

Watch this video as an intro:

Understanding interp-tools by Redwood Research

Language Model Hackathon

Language model hackathon (link)


Starting points (folder)

R markdown starter code

Contains a small test experiment along with a standardized way to get responses out of the API. See R-starter.Rmd.

Python notebook starter code

Contains the same test experiment as the R markdown starter code. See Python-starter.ipynb (this can run in the browser using Google Colab).

Sheets GPT-3 experimental starter kit

See the template here. This is a no-code experimental kit.

Information extraction from text

See Text-info-extraction.ipynb to see some ways to extract quantitative information from the text, e.g. word frequency, TF-IDF, word embeddings, and topics.

Inverse scaling GPT-3 Python notebook

From the inverse scaling prize. See the instructions page for how to use it. Basically allows you to generate plots that show how the performance of the models scale with the parameter counts.

Colab to test your data for inverse scaling:  

Data (folder)

Inverse scaling round 1 winning datasets

The winners of the first round winners. 

Inverse scaling

The inverse-scaling folder contains a lot of small datasets that can work as inspiration. E.g. biased statements, cognitive biases, sentiment analysis, and more. 

Harmless and Helpful language model

A large list of “chosen” and “rejected” pairs of texts. A human received two language model outputs and selected the preferred one. It’s in jsonl format, so you can open it with any Python interpreter or with VScode.

See the containing folder. 

Red teaming dataset

Contains a lot of humans’ attempts at tripping up a language model and getting it to answer in harmful ways.



This repository contains code for evaluating model performance on the TruthfulQA benchmark. The full set of benchmark questions and reference answers is contained in TruthfulQA.csv. The paper introducing the benchmark can be found here.


This is the official repo for the ACL-2022 paper "Learning to Reason Deductively: Math Word Problem Solving as Complex Relation Extraction". Text describes free-form world states for elementary school math problems.

Language models are few-shot learners

You can train language models with training examples in its prompt.


Moral Uncertainty

We provide a dataset containing a mix of clear-cut (wrong or not-wrong) and morally ambiguous scenarios where a first-person character describes actions they took in some setting. The scenarios are often long (usually multiple paragraphs, up to 2,000 words) and involve complex social dynamics. Each scenario has a label which indicates whether, according to commonsense moral judgments, the first-person character should not have taken that action.

Our dataset was collected from a website where posters describe a scenario and users vote on whether the poster was in the wrong. Clear-cut scenarios are ones where voter agreement rate is 95% or more, while ambiguous scenarios had 50% ± 10% agreement. All scenarios have at least 100 total votes. 

IMDB dataset

This dataset contains a lot of movie reviews and their associated rating. It is classically used to train sentiment analysis models but maybe you can find something fun to do with it!

See containing folder.

Introduction email

Greetings, all you wonderful AI safety hackers 

We’re kicking off the hackathon in ~3 hours so here is the information you need to join!

Everyone working online will join the GatherTown room. The space is already open and you’re more than welcome to join and socialize with the other participants an hour before the event starts (5PM CET / 8AM PST).

We’ll start at 6PM CET with an hour for introduction to the event, a talk by Ian McKenzie on the Inverse Scaling Prize, and group forming. You’re welcome to check out the resource docs before arriving.

We expect to be around 30-35 people in total and we look forward to seeing you! 

Introduction slides: Language Model Hackathon