Interpretability research is an exciting and growing field of machine learning. If we are able to understand what happens within neural networks in diverse domains, we can see why the network gives us specific outputs and detect deception, understand choices, and change how they work.

Mechanistic interpretability is a field focused on reverse-engineering neural networks. This can both be how Transformers do a very specific task and how models suddenly improve. Check out our speaker Neel Nanda's 200+ research ideas in mechanistic interpretability.

Quickstart to Mechanistic Interpretability

Neel Nanda's quickstart guide to creating research within the Jam's topic, mechanistic interpretability. Get an intro to the mech-int mindset, what a Transformer is, and which problems to work on.

You have successfully been signed up! You should receive an email with further information.

Oops! Something went wrong while submitting the form.

Past experiences

See what our hackathon participants have said about previous events

Jason Hoelscher-Obermaier

Interpretability hackathon

The hackathon was a really great way to try out research on AI interpretability and getting in touch with other people working on this. The input, resources and feedback provided by the team organizers and in particular by Neel Nanda were super helpful and very motivating!

Luca De Leo

AI Trends hackathon

I found the hackaton very cool, I think it lowered my hesitance in participating in stuff like this in the future significantly. A whole bunch of lessons learned and Jaime and Pablo were very kind and helpful through the whole process.

Alejandro González

Interpretabiity hackathon

I was not that interested in AI safety and didn't know that much about machine learning before, but I heard from this hackathon thanks to a friend, and I don't regret participating! I've learned a ton, and it was a refreshing weekend for me.

Alex Foote

Interpretability hackathon

A great experience! A fun and welcoming event with some really useful resources for starting to do interpretability research. And a lot of interesting projects to explore at the end!

Sam Glendenning

Interpretability hackathon

Was great to hear directly from accomplished AI safety researchers and try investigating some of the questions they thought were high impact.

Resources

Use these materials to create your hackathon projects!

Check out many more resources

Coding GPT-2 from scratch

This notebook enables you to write GPT-2 from scratch with the help of the in-depth tutorial by Neel Nanda below.

If you'd like to check out a longer series of tutorials that takes Transformers and language modeling it from the basics, then watch this playlist from the former AI lead of Tesla, Andrej Karpathy.

See an example of a research process using TransformerLens

In this video and Colab demo, Neel shows a live research process using the TransformerLens library. It is split into the chapters of 1) experiment design, 2) model training, 3) surface level interpretability and 4) reverse engineering.

Replicate the "Interpretability in the Wild" paper

This code notebook goes through the process of reverse engineering a very specific task. Here we get an overview of very useful techniques in mechanistic Transformer interpretability:

Direct logit attribution to layers and to heads and identification of the attention heads in specific layers that affect our output the most
Visualizing attention patterns and explaining information transfer using attention heads
Using activation patching (or causal tracing) to localize which activations matter the most for the output

See an interview with the authors of the original paper and one of the authors' Twitter thread about the research.

Running TransformerLens to easily analyze activations in language models

This demo notebook goes into depth on how to use the TransformerLens library. It contains code explanations of the following core features of TransformerLens:

Loading and running models
Saving activations from a specific example run
Using the unique Hooks functionality to intervene on and access activations

It is designed to be easy to work with and provide an easier time entering the flow state for researchers. Read more on the Github page and see the Python package on PyPi.

Also check out Stefan Heimersheim's "How to: Transformer Mechanistic Interpretability —with 40 lines of code or less!!" which is a more code / less words version of the demo notebook.

Transformer Visualizer: A Redwood Research tool for Transformer interaction

Open the visualizer and read the documentation to work with the Transformer Visualizer tool.

Rank-One Model Editing (ROME): Editing Transformers' token associations

This paper introduced the causal tracing method to edit a model's association between tokens. It is a very useful method for understanding which areas of a neural network contributes the most to a specific output.

Analyses into grokking

See the website for the work, the article detailing this work along with the Twitter thread by Neel Nanda. See also the updated (but less intelligible) notebook on progress measuring for grokking (from the article Github).

Submit your project

As you create your project presentations, upload your slides here, too. We recommend you also make a recording of your slideshow with the recording capability of e.g. Keynote, Powerpoint, and Slides (using Vimeo).

You have successfully submitted! You should receive an email and your project should appear here. If not, contact operations@apartresearch.com.

Oops! Something went wrong while submitting the form.

Go to project page

Why Might Negative Name Mover Heads Exist?

In this project we form a theory of why Negative Name Mover Heads (Wang et al., 2023) form in GPT-2 Small. We suspect that Negative Name Mover Heads i) respond to confident token predictions in the residual stream via Q-composition (Elhage et al., 2021), ii) attend to previous instances of such tokens in context and iii) negatively copy these tokens into the current token position. We use maximum activating dataset examples, negative copying score and a novel metric that tests our theory. Our results represent early research thoughts and are subject to ongoing investigation

Arthur Conmy and Callum McDougall

London

Arthur Conmy and Callum McDougall

Go to project page

Understanding truthfulness in large language model heads through interpretability

In what ways do large language models represent truth? The main claim of this report is that attention heads in GPT-2 XL represent two distinctly different kinds of“truth directions.” One kind of truth direction represents the truths of common misconceptions/uncertain truths and the other represents more obvious factual statements. We show results across various truthfulness datasets – the CounterFact (CFact) dataset, Capitals of Countries (Capitals) dataset, TruthfulQA (TQA) dataset, and our own custom-made Easy (EZ) dataset. This is all preliminary research code, and can be wrong or contain bugs. Please keep this in mind when analyzing our results.

Richard Ren, Kevin Wang, Phillip Guo

London

Team Truthers

Go to project page

ACDC++: Fast automated circuit discovery using attribution patching

We improve the performance of ACDC by several orders of magnitude by using attribution patching to determine which computational nodes to prune

Lucy Farnik, Can Rager, Aaquib Sayed, Rusheb Shah

London

ACDC++

Go to project page

Understanding How Othello GPT Identifies Valid Moves from its Internal World Model

We examined individual neurons at later layers in OthelloGPT trying to find patterns and commonalities