The hackathon is happening right now! Join by signing up below and be a part of our community server.
← Alignment Jams

Interpretability Hackathon 3.0

--
Ash7
Ashish Neupane
Jian Li
Snehal Ranjan
Parth Saxena
Leonard Tang
Alana
Joyee Chen
Jan Wehner
River Chiasson
Maroš Bratko
Anna Wang
Sridhar Venkatesh
Victor Levoso
Sambit
Narmeen Oozeer
Alice Rigg
Laura Klimesova
Rohan
karto
Sangeeta Biswas
Bejoy Sen
Kavetskyi Andrii
Itay
Ryan Bloom
Kunvar Thaman
Jeremy Hadfield
Esben Kran
Joshua David
Marcus Luebke
Nandi
Cab
Corey Morris
Aleksi Maunu
Mateusz Bagiński
Marian
Pramod
Logan Riggs Smith
Evan Harris
Vladimir Ivanov
Codruta Lugoj
Alex Roman
Alex Roman
Joseph Miller
Dmitry
Thomas Lemoine
Vladislav Bargatin
Tomáš Kotrla
M L
Andrew Feldman
Dhillu Thambi
Rauno Arike
Eric Werner
rick goldstein
Aishwarya Gurung
ginarific
James Thomson
Philip Quirke
Jai Dhyani
Mark Trovinger
Alice Wong
Jaydeep Chauhan
David Liu
Michelle Viotti
Ms Perusha Moodley
Sai Shinjitha Maganti
Shrey Modi
Mitchell Reynolds
David Adam Plaskowski
Henri Lemoine
Scott Viteri
Michail Keske
Luna Mendez
Abhay Sheshadri
Laura O'Mahony
shubhorup biswas
Amir Ali Abdullah
Jakub Nowak
marc/er
Milton
Juliette Culver
Rajesh Shenoy
Sai Joseph
Will Hathaway
Omotoyosi Abu
Pranav Putta
Viswapriya Misra
Alethea Power
Rebecca Hawkins
Theo Clark
Adam Beckert
František Koutenský
Zachary Heidel
peter
Taylor Kulp-McDowall
Noa Nabeshima
Harrison Gietz
Nikola Jurkovic
Max Chiswick
Grace
Rohan Mehta
Nathaniel Monson
Jeffrey Olmo
Gaurav Yadav
Andrey
Ran Wei
Nir Padmanabhan
Manan Suri
Arsalaan Alam
Ramneet Singh
Hannes Thurnherr
Soham Dutta
Neil Wang
Kaitlin Maile
Julius Simonelli
Partho
Jay Cloyd
Prabin Acharya
Tim Sankara
Santiago Pineda Montoya
Kriz Tahimic
Clay Surmeier
Kabir
bhargav chhaya
Tara Rezaei
Simon Biggs
Huadong Xiong
Tereza Okalova
Signups
--
Goal Misgeneralization
Problem 9.60 - Dimensionaliy reduction
Residual Stream Verification via California Housing Prices Experiment
Relating induction heads in Transformers to temporal context model in human free recall
Toward a Working Deep Dream for LLM's
Multimodal Similarity Detection in Transformer Models
Interpreting Planning in Transformers
Experiments in Superposition
DPO vs PPO comparative analysis
One is 1- Analyzing Activations of Numerical Words vs Digits
Preliminary Steps Toward Investigating the “Smearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model
Who cares about brackets?
Factual recall rarely happens in attention layer
Embedding and Transformer Synthesis
Towards Interpretability of 5 digit addition
Entries
July 14th to July 16th 2023
Hackathon starts in
--
Days
--
Hours
--
Minutes
--
Seconds
Sign up

This hackathon ran from July 14th to July 16th 2023. You can now judge entries.

Join us to understand the internals of language models and ML systems!

Machine learning is becoming an increasingly important part of our lives and researchers are still working to understand how neural networks represent the world.

Mechanistic interpretability is a field focused on reverse-engineering neural networks. This can both be how Transformers do a very specific task and how models suddenly improve. Check out our speaker Neel Nanda's 200+ research ideas in mechanistic interpretability.

Sign up below to be notified before the kickoff!

Jeremy Hadfield
,
Ran Wei
,
Tereza Okalova
,
and others!
You have successfully been signed up! You should receive an email with further information.
Oops! Something went wrong while submitting the form.

Alignment Jam hackathons

Join us in this iteration of the Alignment Jam research hackathons to spend 48 hour with fellow engaged researchers and engineers in machine learning on engaging in this exciting and fast-moving field!

Join the Discord where all communication will happen. Check out research project ideas for inspiration and the in-depth starter resources under the "Resources" tab.

Rules

You will participate in teams of 1-5 people and submit a project on the entry submission page. Each project consists of multiple parts: 1) The PDF report, 2) a maximum 10-minute video overview, 3) title, summary, and descriptions.

You are allowed to think about your project and engage with the starter resources before the hackathon starts but your core research work should happen during the duration of the hackathon.

Besides these two points, the hackathons are mainly a chance for you to engage meaningfully with real research work into some of the state-of-the-art interpretability!

Schedule

Subscribe to the calendar.

  • Friday 17:30 UTC: Keynote talk with Neel Nanda to inspire your projects and provide an introduction to the topic. Esben Kran will also give a short overview of the logistics.
  • Saturday and Sunday 14:00 UTC: Project discussion sessions on the Discord server.
  • Sunday at 18:00 UTC: Online ending session
  • Wednesday at 19:00 UTC: Project presentations

Past experiences

See what our great hackathon participants have said
Jason Hoelscher-Obermaier
Interpretability hackathon
The hackathon was a really great way to try out research on AI interpretability and getting in touch with other people working on this. The input, resources and feedback provided by the team organizers and in particular by Neel Nanda were super helpful and very motivating!
Luca De Leo
AI Trends hackathon
I found the hackaton very cool, I think it lowered my hesitance in participating in stuff like this in the future significantly. A whole bunch of lessons learned and Jaime and Pablo were very kind and helpful through the whole process.

Alejandro González
Interpretabiity hackathon
I was not that interested in AI safety and didn't know that much about machine learning before, but I heard from this hackathon thanks to a friend, and I don't regret participating! I've learned a ton, and it was a refreshing weekend for me.
Alex Foote
Interpretability hackathon
A great experience! A fun and welcoming event with some really useful resources for starting to do interpretability research. And a lot of interesting projects to explore at the end!
Sam Glendenning
Interpretability hackathon
Was great to hear directly from accomplished AI safety researchers and try investigating some of the questions they thought were high impact.
The collaborators who will join us for this hackathon.

Neel Nanda

Mechanistic interpretability researcher with the DeepMind Safety Team
Keynote speaker

Esben Kran

Co-director at Apart Research
Judge & Organizer

Fazl Barez

Co-director and research lead at Apart Research
Judge

Alex Foote

Apart Lab researcher
Judge

Bart Bussman

Independent researcher in mechanistic interpretability
Judge
Open In Colab

Coding GPT-2 from scratch

This notebook enables you to write GPT-2 from scratch with the help of the in-depth tutorial by Neel Nanda below.

If you'd like to check out a longer series of tutorials that takes Transformers and language modeling it from the basics, then watch this playlist from the former AI lead of Tesla, Andrej Karpathy.

Open In Colab

See an example of a research process using TransformerLens

In this video and Colab demo, Neel shows a live research process using the TransformerLens library. It is split into the chapters of 1) experiment design, 2) model training, 3) surface level interpretability and 4) reverse engineering.

Open In Colab

Replicate the "Interpretability in the Wild" paper

This code notebook goes through the process of reverse engineering a very specific task. Here we get an overview of very useful techniques in mechanistic Transformer interpretability:

  • Direct logit attribution to layers and to heads and identification of the attention heads in specific layers that affect our output the most
  • Visualizing attention patterns and explaining information transfer using attention heads
  • Using activation patching (or causal tracing) to localize which activations matter the most for the output

See an interview with the authors of the original paper and one of the authors' Twitter thread about the research.

Open In Colab

Running TransformerLens to easily analyze activations in language models

This demo notebook goes into depth on how to use the TransformerLens library. It contains code explanations of the following core features of TransformerLens:

  1. Loading and running models
  2. Saving activations from a specific example run
  3. Using the unique Hooks functionality to intervene on and access activations

It is designed to be easy to work with and provide an easier time entering the flow state for researchers. Read more on the Github page and see the Python package on PyPi.

Also check out Stefan Heimersheim's "How to: Transformer Mechanistic Interpretability —with 40 lines of code or less!!" which is a more code / less words version of the demo notebook.

Transformer Visualizer: A Redwood Research tool for Transformer interaction

Open the visualizer and read the documentation to work with the Transformer Visualizer tool.

Open In Colab

Rank-One Model Editing (ROME): Editing Transformers' token associations

This paper introduced the causal tracing method to edit a model's association between tokens. It is a very useful method for understanding which areas of a neural network contributes the most to a specific output.

Open In Colab

Analyses into grokking

See the website for the work, the article detailing this work along with the Twitter thread by Neel Nanda. See also the updated (but less intelligible) notebook on progress measuring for grokking (from the article Github).

Research project ideas

Get inspired for your own projects with these ideas developed during the reading groups! Go to the Resources tab to engage more with the topic.

Explore the mechanistic interpretability ideas on the AI Safety Ideas platform on the list for interpretability ideas:

Registered jam sites

Montreal Interpretability Hackathon 3.0
Join us in Montreal on July 14th at 1:30PM ET at L'Esplanade Tranquille, 1442 Clark, second floor, for a weekend research sprint in ML interpretability!
Global interpretability hackathon 3.0
We are once again hosting the virtual segment of the interpretability hackathon online! Join us in our Discord server to interact with engaged researchers across the world.
Visit event page
Alignment Jam Discord
Prague Interpretability hackathon
Join us in Fixed Point in Prague - Vinohrady, Koperníkova 6 for a weekend research sprint in ML interpretability!
Visit event page
Prague Fixed Point

Register your own site

The in-person hubs for the Alignment Jams are run by passionate individuals just like you! We organize the schedule, speakers, and starter templates, and you can focus on engaging your local research and engineering community. Read more about organizing and use the media below to set up your event.
Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
Thank you! Your submission has been received! Your event will show up on this page.
Oops! Something went wrong while submitting the form.

Social media for your jam site [coming soon]

Event cover image
Social media message

Join us when we investigate what happens within the brains of language models!

DeepMind researcher Neel Nanda joins us to explore the field of LLM neuroscience during this weekend. Get ready to create impactful research with people across the world!

Don't miss this opportunity to explore machine learning deeper, network, and challenge yourself!

Register now: https://alignmentjam.com/jam/interpretability

[or add your event link here]

Liked by
and
others

Submit your project

Use this template for the report submission. As you create your project presentations, upload your slides here, too. Make a recording of your slideshow or project with the recording capability of e.g. Keynote, Powerpoint, and Slides (using Vimeo).

Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
You have successfully submitted! You should receive an email and your project should appear here. If not, contact operations@apartresearch.com.
Oops! Something went wrong while submitting the form.

Accepted submissions to the hackathon

Big thanks to everyone who submitted their work. Your efforts have made this event a success and set a new bar for what we can expect in future editions of the hackathon!

We want to extend our appreciation to our judges Fazl Barez, Alex Foote, Esben Kran, and Bart Bussman and to our keynote speaker Neel Nanda. Rewatch the winning top 4 project lightning talks below.

Give us your feedback
4th 🏆
3rd 🏆
2nd 🏆
1st 🏆
Go to project page
Late submission
project image for Relating induction heads in Transformers to temporal context model in human free recall
Relating induction heads in Transformers to temporal context model in human free recall
This study explores the parallels between the mechanisms of induction heads in Transformer models and the process of sequential memory recall in humans, finding surprising similarities that could potentially enhance our understanding of both artificial intelligence and human cognition.
Ji-An Li
Solo Moonhowl
4th 🏆
3rd 🏆
2nd 🏆
1st 🏆
Go to project page
Late submission
project image for Who cares about brackets?
Who cares about brackets?
Investigating how GPT2-small is able to accurately predict closing brackets
Theo Clark, Alex Roman, Hannes Thurnherr
Team Brackets
4th 🏆
3rd 🏆
2nd 🏆
1st 🏆
Go to project page
Late submission
project image for Embedding and Transformer Synthesis
Embedding and Transformer Synthesis
I programmatically created a set of embeddings that can be used to perfectly reconstruct a binary classification function (“embedding synthesis”). I used these embeddings to programmatically set weights for a 1-layer transformer that can also perfectly reconstruct the classification function (“transformer synthesis”). With one change, this reconstruction matches my original hypothesis of how a pre-existing transformer works. I ran several experiments on my synthesized transformer to evaluate my synthetic model.
Rick Goldstein
Rick Goldstein
4th 🏆
3rd 🏆
2nd 🏆
1st 🏆
Go to project page
Late submission
project image for Interpreting Planning in Transformers
Interpreting Planning in Transformers
We trained some simple models that figure out how to traverse a graph from a list of edges witch is kind of "planning" in some sense if you squint and got some traction on intepreting one of them.
Victor Levoso Fernandez , Abhay Sheshadri
Shoggoth Neurosurgeons
4th 🏆
3rd 🏆
2nd 🏆
1st 🏆
Go to project page
Late submission
project image for Towards Interpretability of 5 digit addition
Towards Interpretability of 5 digit addition
This paper details a hypothesis for the internal structure of the 5 digit addition model that may explain the observed variability & proposes specific testing to confirm (or not) the hypothesis.
Philip Quirke
Philip Quirke
4th 🏆
3rd 🏆
2nd 🏆
1st 🏆
Go to project page
Late submission
project image for Factual recall rarely happens in attention layer
Factual recall rarely happens in attention layer
In this work, I investigated whether factual information is saved only in the FF layer or also in the attention layers, and found that from a large enough FF hidden dimension, factual information is rarely saved in the attention layers.
Bary Levy
mentaleap
4th 🏆
3rd 🏆
2nd 🏆
1st 🏆
Go to project page
Late submission
project image for Preliminary Steps Toward Investigating the “Smearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model
Preliminary Steps Toward Investigating the “Smearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model
SoLU activation functions have been shown to make large language models more interpretable, incentivizing alignment of a fraction of features with the standard basis. However, this happens at the cost of suppression of other features. We investigate this problem using experiments suggested in Nanda’s 2023 work “200 Concrete Open Problems in Mechanistic Interpretability”. We conduct three main experiments. 1, We investigate the layernorm scale factor changes on a variety of input prompts; 2, We investigate the logit effects of neuron ablations on neurons with relatively low activation; 3, Also using ablations, we attempt to find tokens where “the direct logit attribution (DLA) of the MLP layer is high, but no single neuron is high.
Mateusz Bagiński, Kunvar Thaman, Rohan Gupta, Alana Xiang, j1ng3r
SoLUbility
4th 🏆
3rd 🏆
2nd 🏆
1st 🏆
Go to project page
Late submission
project image for One is 1- Analyzing Activations of Numerical Words vs Digits
One is 1- Analyzing Activations of Numerical Words vs Digits
Extensive research in mechanistic interpretability has showcased the effectiveness of a multitude of techniques for uncovering intriguing circuit patterns. We utilize these techniques to compare similarities and differences among analogous numerical sequences, such as the digits “1, 2, 3, 4”, the words “one, two, three, four”, and the months “January, February, March, April”. Our findings demonstrate preliminary evidence suggesting that these semantically related sequences share common activation patterns in GPT-2 Small.
Mikhail L
4th 🏆
3rd 🏆
2nd 🏆
1st 🏆
Go to project page
Late submission
project image for DPO vs PPO comparative analysis
DPO vs PPO comparative analysis
We perform a comparative analysis of the DPO and PPO algorithms where we use techniques from interpretability to attempt to understand the difference between the two
Rauno Arike, Luke Marks, Amir Abdullah, Luna Mendez
DPOvsPPO
4th 🏆
3rd 🏆
2nd 🏆
1st 🏆
Go to project page
Late submission
project image for Experiments in Superposition
Experiments in Superposition
In this project we do a variety of experiments of superposition. We try to understand superposition in attention heads, MLP layers, and nonlinear computation in superposition.
Kunvar Thaman, Alice Rigg, Narmeen Oozeer, Joshua David
Team Super Position 1
4th 🏆
3rd 🏆
2nd 🏆
1st 🏆
Go to project page
Late submission
project image for Multimodal Similarity Detection in Transformer Models
Multimodal Similarity Detection in Transformer Models
[hidden]
Tereza Okalova, Toyosi Abu, James Thomson
End Black Box Syndrome
4th 🏆
3rd 🏆
2nd 🏆
1st 🏆
Go to project page
Late submission
project image for Toward a Working Deep Dream for LLM's
Toward a Working Deep Dream for LLM's
This project aims to enhance language model interpretability by generating sentences that maximally activate a specific neuron, inspired by the DeepDream technique in image models. We introduce a novel regularization technique that optimizes over a lower-dimensional latent space rather than the full 768-dimensional embedding space, resulting in more coherent and interpretable sentences. Our approach uses an autoencoder and a separate GPT-2 model as an encoder, and a six-layer transformer as a decoder. Despite the current limitation of our autoencoder not fully reconstructing sentences, our work opens up new directions for future research in improving language model interpretability.
Scott Viteri and Peter Chatain
PeterAndScott
4th 🏆
3rd 🏆
2nd 🏆
1st 🏆
Go to project page
Late submission
project image for Residual Stream Verification via California Housing Prices Experiment
Residual Stream Verification via California Housing Prices Experiment
In this data science project, I conducted an experiment to verify the Residual Stream as a Shared Bandwidth Hypothesis. The study utilized California Housing Prices data to support the experimental investigation.
Jonathan Batista Ferreira
Condor camp team
4th 🏆
3rd 🏆
2nd 🏆
1st 🏆
Go to project page
Late submission
project image for Problem 9.60 - Dimensionaliy reduction
Problem 9.60 - Dimensionaliy reduction
The idea is to separate positive (1) and negative (0) comments in the vector space – the better the model, the better the separation. We could see the separation using a dimension reduction (PCA) of the vectors in 2 dimensions.
Juliana Carvalho de Souza
Juliana's team
4th 🏆
3rd 🏆
2nd 🏆
1st 🏆
Go to project page
Late submission
project image for Goal Misgeneralization
Goal Misgeneralization
The main argument put forward in the papers is that we have to be careful about the inner alignment problem. We could reach terrible outcomes scaling this problem if we continue developing more powerful AI’s. Assuming the use of Reinforcement Learning from Human Feedback (RLHF).
João Lucas Duim
João Lucas Duim

Send in pictures of you having fun hacking away!

We love to see the community flourish and it's always great to see any pictures you're willing to share uploaded here.

Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
Thank you for sharing !
Oops! Something went wrong while submitting the form.
Q&A with Neel Nanda
Q&A with Neel Nanda
Discussing how Transformer models traverse graphs!
Discussing how Transformer models traverse graphs!