Join us to understand the internals of language models and ML systems!

Machine learning is becoming an increasingly important part of our lives and researchers are still working to understand how neural networks represent the world.

Mechanistic interpretability is a field focused on reverse-engineering neural networks. This can both be how Transformers do a very specific task and how models suddenly improve. Check out our speaker Neel Nanda's 200+ research ideas in mechanistic interpretability.

You have successfully been signed up! You should receive an email with further information.

Oops! Something went wrong while submitting the form.

Alignment Jam hackathons

Join us in this iteration of the Alignment Jam research hackathons to spend 48 hour with fellow engaged researchers and engineers in machine learning on engaging in this exciting and fast-moving field!

Join the Discord where all communication will happen. Check out research project ideas for inspiration and the in-depth starter resources under the "Resources" tab.

Rules

You will participate in teams of 1-5 people and submit a project on the entry submission page. Each project consists of multiple parts: 1) The PDF report, 2) a maximum 10-minute video overview, 3) title, summary, and descriptions.

You are allowed to think about your project and engage with the starter resources before the hackathon starts but your core research work should happen during the duration of the hackathon.

Besides these two points, the hackathons are mainly a chance for you to engage meaningfully with real research work into some of the state-of-the-art interpretability!

Schedule

Subscribe to the calendar.

Friday 17:30 UTC: Keynote talk with Neel Nanda to inspire your projects and provide an introduction to the topic. Esben Kran will also give a short overview of the logistics.
Saturday and Sunday 14:00 UTC: Project discussion sessions on the Discord server.
Sunday at 18:00 UTC: Online ending session
Wednesday at 19:00 UTC: Project presentations

Past experiences

See what our great hackathon participants have said

Jason Hoelscher-Obermaier

Interpretability hackathon

The hackathon was a really great way to try out research on AI interpretability and getting in touch with other people working on this. The input, resources and feedback provided by the team organizers and in particular by Neel Nanda were super helpful and very motivating!

Luca De Leo

AI Trends hackathon

I found the hackaton very cool, I think it lowered my hesitance in participating in stuff like this in the future significantly. A whole bunch of lessons learned and Jaime and Pablo were very kind and helpful through the whole process.

Alejandro González

Interpretabiity hackathon

I was not that interested in AI safety and didn't know that much about machine learning before, but I heard from this hackathon thanks to a friend, and I don't regret participating! I've learned a ton, and it was a refreshing weekend for me.

Alex Foote

Interpretability hackathon

A great experience! A fun and welcoming event with some really useful resources for starting to do interpretability research. And a lot of interesting projects to explore at the end!

Sam Glendenning

Interpretability hackathon

Was great to hear directly from accomplished AI safety researchers and try investigating some of the questions they thought were high impact.

The collaborators who will join us for this hackathon.

Neel Nanda

Mechanistic interpretability researcher with the DeepMind Safety Team

Keynote speaker

Esben Kran

Co-director at Apart Research

Judge & Organizer

Fazl Barez

Co-director and research lead at Apart Research

Judge

Alex Foote

Apart Lab researcher

Judge

Bart Bussman

Independent researcher in mechanistic interpretability

Judge

Interpretability resources

Read up on Neel Nanda's quickstart guide to mechanistic interpretability, a wonderful introduction to the main resources and links. Below, you will find a lot of starter code and Google Colabs.

See the 200 open problems in interpretability

Inspiration Article

Quickstart to Mechanistic Interpretability

Neel Nanda's quickstart guide to creating research within the Jam's topic, mechanistic interpretability. Get an intro to the mech-int mindset, what a Transformer is, and which problems to work on.

Colab notebook

Coding GPT-2 from scratch

This notebook enables you to write GPT-2 from scratch with the help of the in-depth tutorial by Neel Nanda below.

If you'd like to check out a longer series of tutorials that takes Transformers and language modeling it from the basics, then watch this playlist from the former AI lead of Tesla, Andrej Karpathy.

See an example of a research process using TransformerLens

In this video and Colab demo, Neel shows a live research process using the TransformerLens library. It is split into the chapters of 1) experiment design, 2) model training, 3) surface level interpretability and 4) reverse engineering.

Replicate the "Interpretability in the Wild" paper

This code notebook goes through the process of reverse engineering a very specific task. Here we get an overview of very useful techniques in mechanistic Transformer interpretability:

Direct logit attribution to layers and to heads and identification of the attention heads in specific layers that affect our output the most
Visualizing attention patterns and explaining information transfer using attention heads
Using activation patching (or causal tracing) to localize which activations matter the most for the output

See an interview with the authors of the original paper and one of the authors' Twitter thread about the research.

Running TransformerLens to easily analyze activations in language models

This demo notebook goes into depth on how to use the TransformerLens library. It contains code explanations of the following core features of TransformerLens:

Loading and running models
Saving activations from a specific example run
Using the unique Hooks functionality to intervene on and access activations

It is designed to be easy to work with and provide an easier time entering the flow state for researchers. Read more on the Github page and see the Python package on PyPi.

Also check out Stefan Heimersheim's "How to: Transformer Mechanistic Interpretability —with 40 lines of code or less!!" which is a more code / less words version of the demo notebook.

Transformer Visualizer: A Redwood Research tool for Transformer interaction

Open the visualizer and read the documentation to work with the Transformer Visualizer tool.

Rank-One Model Editing (ROME): Editing Transformers' token associations

This paper introduced the causal tracing method to edit a model's association between tokens. It is a very useful method for understanding which areas of a neural network contributes the most to a specific output.

Analyses into grokking

See the website for the work, the article detailing this work along with the Twitter thread by Neel Nanda. See also the updated (but less intelligible) notebook on progress measuring for grokking (from the article Github).

Research project ideas

Get inspired for your own projects with these ideas developed during the reading groups! Go to the Resources tab to engage more with the topic.

Explore the mechanistic interpretability ideas on the AI Safety Ideas platform on the list for interpretability ideas:
‍

Registered jam sites

Montreal Interpretability Hackathon 3.0

Join us in Montreal on July 14th at 1:30PM ET at L'Esplanade Tranquille, 1442 Clark, second floor, for a weekend research sprint in ML interpretability!

Visit event page

Montreal

Global interpretability hackathon 3.0

We are once again hosting the virtual segment of the interpretability hackathon online! Join us in our Discord server to interact with engaged researchers across the world.

Visit event page

Alignment Jam Discord

Prague Interpretability hackathon

Join us in Fixed Point in Prague - Vinohrady, Koperníkova 6 for a weekend research sprint in ML interpretability!

Visit event page

Prague Fixed Point

Thank you! Your submission has been received! Your event will show up on this page.

Oops! Something went wrong while submitting the form.

Social media for your jam site [coming soon]

Event cover image

Social media message

Join us when we investigate what happens within the brains of language models!

DeepMind researcher Neel Nanda joins us to explore the field of LLM neuroscience during this weekend. Get ready to create impactful research with people across the world!

Don't miss this opportunity to explore machine learning deeper, network, and challenge yourself!

[or add your event link here]

Liked by

Placeholder name

and

231 others

Submit your project

Use this template for the report submission. As you create your project presentations, upload your slides here, too. Make a recording of your slideshow or project with the recording capability of e.g. Keynote, Powerpoint, and Slides (using Vimeo).

You have successfully submitted! You should receive an email and your project should appear here. If not, contact operations@apartresearch.com.

Oops! Something went wrong while submitting the form.

Accepted submissions to the hackathon

Big thanks to everyone who submitted their work. Your efforts have made this event a success and set a new bar for what we can expect in future editions of the hackathon!

We want to extend our appreciation to our judges Fazl Barez, Alex Foote, Esben Kran, and Bart Bussman and to our keynote speaker Neel Nanda. Rewatch the winning top 4 project lightning talks below.

Give us your feedback

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Interpretability Hackathon 3.0

Join us to understand the internals of language models and ML systems!

Alignment Jam hackathons

Rules

Schedule

Past experiences

Neel Nanda

Esben Kran

Fazl Barez

Alex Foote

Bart Bussman

Interpretability resources

Quickstart to Mechanistic Interpretability

Intro to Mech Int w/ TransformerLens

TransformerLens: Intro notebook

Neuron2Graph: Understand MLPs

Othello-GPT: Model trained on Othello

DeepDecipher: API & scraping of MLPs

Lexoscope: Neuron activation / word

Winners from an earlier hackathon

Logit Lens: LLMexpected words

Confusing AlphaGo (KataGo)

Interesting Twitterthreads of research

Coding GPT-2 from scratch

See an example of a research process using TransformerLens

Replicate the "Interpretability in the Wild" paper

Running TransformerLens to easily analyze activations in language models

Transformer Visualizer: A Redwood Research tool for Transformer interaction

Rank-One Model Editing (ROME): Editing Transformers' token associations

Analyses into grokking

Research project ideas

Registered jam sites

Register your own site

Social media for your jam site [coming soon]

Social media message

Liked by

Placeholder name

and

231

others

Submit your project

Accepted submissions to the hackathon

Send in pictures of you having fun hacking away!

Hackathons

For Organizers

For Participants

Neuron2Graph:
Understand MLPs

DeepDecipher: API
& scraping of MLPs

Logit Lens: LLM
expected words

Interesting Twitter
threads of research