← Alignment Jams

ARENA Interpretability Hackathon

--
Can Rager
Arthur Conmy
Alejandro Acelas
Rusheb
James Campbell
Robert Cooper
Lucy Farnik
Alexandre Variengien
Felix Hofstätter
Jan Betley
Phillip Guo
Jonas Kgomo
Callum
Samuel Selleck
Diego
Signups
--
Why Might Negative Name Mover Heads Exist?
Understanding truthfulness in large language model heads through interpretability
ACDC++: Fast automated circuit discovery using attribution patching
Understanding How Othello GPT Identifies Valid Moves from its Internal World Model
Swap Graphs with Attribution Patching
modiff
Entries
Hosted by ARENA
June 10th to June 11th
Submissions due in
--
Days
--
Hours
--
Minutes
--
Seconds
Submit entry!

This hackathon ran from June 10th to June 11th 2023. You can now judge entries.

Interpretability research is an exciting and growing field of machine learning. If we are able to understand what happens within neural networks in diverse domains, we can see why the network gives us specific outputs and detect deception, understand choices, and change how they work.

Mechanistic interpretability is a field focused on reverse-engineering neural networks. This can both be how Transformers do a very specific task and how models suddenly improve. Check out our speaker Neel Nanda's 200+ research ideas in mechanistic interpretability.

Quickstart to Mechanistic Interpretability

Neel Nanda's quickstart guide to creating research within the Jam's topic, mechanistic interpretability. Get an intro to the mech-int mindset, what a Transformer is, and which problems to work on.

Can Rager
,
Felix Hofstätter
,
Rusheb
,
and others!
You have successfully been signed up! You should receive an email with further information.
Oops! Something went wrong while submitting the form.

Past experiences

See what our hackathon participants have said about previous events
Jason Hoelscher-Obermaier
Interpretability hackathon
The hackathon was a really great way to try out research on AI interpretability and getting in touch with other people working on this. The input, resources and feedback provided by the team organizers and in particular by Neel Nanda were super helpful and very motivating!
Luca De Leo
AI Trends hackathon
I found the hackaton very cool, I think it lowered my hesitance in participating in stuff like this in the future significantly. A whole bunch of lessons learned and Jaime and Pablo were very kind and helpful through the whole process.

Alejandro González
Interpretabiity hackathon
I was not that interested in AI safety and didn't know that much about machine learning before, but I heard from this hackathon thanks to a friend, and I don't regret participating! I've learned a ton, and it was a refreshing weekend for me.
Alex Foote
Interpretability hackathon
A great experience! A fun and welcoming event with some really useful resources for starting to do interpretability research. And a lot of interesting projects to explore at the end!
Sam Glendenning
Interpretability hackathon
Was great to hear directly from accomplished AI safety researchers and try investigating some of the questions they thought were high impact.

Resources

Use these materials to create your hackathon projects!
Check out many more resources
Open In Colab

Coding GPT-2 from scratch

This notebook enables you to write GPT-2 from scratch with the help of the in-depth tutorial by Neel Nanda below.

If you'd like to check out a longer series of tutorials that takes Transformers and language modeling it from the basics, then watch this playlist from the former AI lead of Tesla, Andrej Karpathy.

Open In Colab

See an example of a research process using TransformerLens

In this video and Colab demo, Neel shows a live research process using the TransformerLens library. It is split into the chapters of 1) experiment design, 2) model training, 3) surface level interpretability and 4) reverse engineering.

Open In Colab

Replicate the "Interpretability in the Wild" paper

This code notebook goes through the process of reverse engineering a very specific task. Here we get an overview of very useful techniques in mechanistic Transformer interpretability:

  • Direct logit attribution to layers and to heads and identification of the attention heads in specific layers that affect our output the most
  • Visualizing attention patterns and explaining information transfer using attention heads
  • Using activation patching (or causal tracing) to localize which activations matter the most for the output

See an interview with the authors of the original paper and one of the authors' Twitter thread about the research.

Open In Colab

Running TransformerLens to easily analyze activations in language models

This demo notebook goes into depth on how to use the TransformerLens library. It contains code explanations of the following core features of TransformerLens:

  1. Loading and running models
  2. Saving activations from a specific example run
  3. Using the unique Hooks functionality to intervene on and access activations

It is designed to be easy to work with and provide an easier time entering the flow state for researchers. Read more on the Github page and see the Python package on PyPi.

Also check out Stefan Heimersheim's "How to: Transformer Mechanistic Interpretability —with 40 lines of code or less!!" which is a more code / less words version of the demo notebook.

Transformer Visualizer: A Redwood Research tool for Transformer interaction

Open the visualizer and read the documentation to work with the Transformer Visualizer tool.

Open In Colab

Rank-One Model Editing (ROME): Editing Transformers' token associations

This paper introduced the causal tracing method to edit a model's association between tokens. It is a very useful method for understanding which areas of a neural network contributes the most to a specific output.

Open In Colab

Analyses into grokking

See the website for the work, the article detailing this work along with the Twitter thread by Neel Nanda. See also the updated (but less intelligible) notebook on progress measuring for grokking (from the article Github).

Submit your project

As you create your project presentations, upload your slides here, too. We recommend you also make a recording of your slideshow with the recording capability of e.g. Keynote, Powerpoint, and Slides (using Vimeo).

Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
You have successfully submitted! You should receive an email and your project should appear here. If not, contact operations@apartresearch.com.
Oops! Something went wrong while submitting the form.
project image for modiff
modiff
Model-comparing tool. Details in README.
Jan Betley
Awesome team
project image for Swap Graphs with Attribution Patching
Swap Graphs with Attribution Patching
Using attribution patching instead of activation patching to create swap graphs and seeing how the results compare
Felix Hofstätter
Attribution Patching Swap Graphs
project image for Understanding How Othello GPT Identifies Valid Moves from its Internal World Model
Understanding How Othello GPT Identifies Valid Moves from its Internal World Model
We examined individual neurons at later layers in OthelloGPT trying to find patterns and commonalities
Yeu-Tong Lau, Alejandro Acelas
Othello Team
project image for ACDC++: Fast automated circuit discovery using attribution patching
ACDC++: Fast automated circuit discovery using attribution patching
We improve the performance of ACDC by several orders of magnitude by using attribution patching to determine which computational nodes to prune
Lucy Farnik, Can Rager, Aaquib Sayed, Rusheb Shah
ACDC++
project image for Understanding truthfulness in large language model heads through interpretability
Understanding truthfulness in large language model heads through interpretability
In what ways do large language models represent truth? The main claim of this report is that attention heads in GPT-2 XL represent two distinctly different kinds of“truth directions.” One kind of truth direction represents the truths of common misconceptions/uncertain truths and the other represents more obvious factual statements. We show results across various truthfulness datasets – the CounterFact (CFact) dataset, Capitals of Countries (Capitals) dataset, TruthfulQA (TQA) dataset, and our own custom-made Easy (EZ) dataset. This is all preliminary research code, and can be wrong or contain bugs. Please keep this in mind when analyzing our results.
Richard Ren, Kevin Wang, Phillip Guo
Team Truthers
project image for Why Might Negative Name Mover Heads Exist?
Why Might Negative Name Mover Heads Exist?
In this project we form a theory of why Negative Name Mover Heads (Wang et al., 2023) form in GPT-2 Small. We suspect that Negative Name Mover Heads i) respond to confident token predictions in the residual stream via Q-composition (Elhage et al., 2021), ii) attend to previous instances of such tokens in context and iii) negatively copy these tokens into the current token position. We use maximum activating dataset examples, negative copying score and a novel metric that tests our theory. Our results represent early research thoughts and are subject to ongoing investigation
Arthur Conmy and Callum McDougall
Arthur Conmy and Callum McDougall