Interpretability Hackathon

From
November 11, 2022
to
November 13, 2022
This event is finished. See the entries below.
Israel Interpretability Hackathon

EA Israel

We are a group of 6-10 (mainly) hackers, ai "experts" and neuroscientists
Interpretability Hackathon
Read
Probing Conceptual Knowledge on Solved Games
Mentaleap
2022
EA Israel
Go to event
This event is finished. See the entries below.
ENS Interpretability Hackathon

ENS Ulm

Arranged at the ENS Ulm, this jam site is open for the talented students and faculty.
Interpretability Hackathon
Read
Sparsity Lens
Lens Makers
2022
Interpretability Hackathon
Read
An Intuitive Logic for Understanding Autoregressive Language Models
Tamaya
2022
Interpretability Hackathon
Read
Optimising image patches to change RL-agent behaviour
patch_optimizers
2022
This event is finished. See the entries below.
Georgia Tech Interpretability hackathon

Georgia Tech AI Safety Initiative

The AI Safety Initiative at Georgia Tech are hosting a small jam site for graduates and undergraduates at the university.
No items found.
Atlanta, Georgia. Computer Science buildings at Georgia Tech
Go to event
This event is finished. See the entries below.
Tallinn EA jam site

Tallinn Universities

Estonia EA (Efektiivne Altruism) is hosting the local jam site in Tallinn at the University.
Interpretability Hackathon
Read
Trying to make GPT2 dream
The Dreamers
2022
Room M-342 in the Mare building Address: Tallinn, Uus-Sadama 5
Go to event
This event is finished. See the entries below.
Online & Global Hackathon

GatherTown

GatherTown will be open internationally for the duration of the hackathon on GatherTown and we encourage virtual attendees to join there.
Interpretability Hackathon
Read
An Informal Investigation of Indirect Object Identification in Mistral GPT2-Small Battlestar
Chris Mathwin
2022
Interpretability Hackathon
Read
Backup Transformer Heads are Robust to Ablation Distribution
Klein Bottle
2022
Interpretability Hackathon
Read
Investigating Neuron Behaviour via Dataset Example Pruning and Local Search
Alex Foote
2022
Interpretability Hackathon
Read
Mechanisms of Causal Reasoning
Mechanisms of Causal Reasoning
2022
Interpretability Hackathon
Read
Observing and Validating Induction heads in SOLU-8l-old
Brian Muhia
2022
GatherTown, online
Go to event
This event is finished. See the entries below.
LEAH Hackathon Site

London Universities

Imperial College, UCL, King's College, and LSE are jointly hosting the hackathon at the UCL EA offices in Regus, Charlotte Street.
Interpretability Hackathon
Read
Finding unusual neuron sets by activation vector distance
Gurkenglass
2022
Interpretability Hackathon
Read
Neurons and Attention Heads that Look for Sentence Structure in GPT2
Wolfgang
2022
Interpretability Hackathon
Read
How to find the minimum of a list - Transformer Edition
BugSnax
2022
Interpretability Hackathon
Read
Interpretability at a glance
The Glancers
2022
Interpretability Hackathon
Read
Regularly Oversimplifying Neural Networks
Team RITMOTE
2022
Soho, London, UK
Go to event
This event is finished. See the entries below.
Aarhus Interpretability Hackathon

Aarhus University

Once again, Aarhus University will host the Alignment Jam for interpretability in November.
Interpretability Hackathon
Read
Visualizing the effect prompt design has on text-davinci-002 mode collapse and social biases
Partners in Crime
2022
Aarhus University, room 1485-241
Go to event

Resources

Interpretability starter

Interpretability research is an exciting and growing field of machine learning. If we are able to understand what happens within neural networks in diverse domains, we can see why the network gives us specific outputs and detect deception, understand choices, and change how they work.

This list of resources was made for the Interpretability Hackathon (link) and contains an array of useful starter templates, tools to investigate model activations, and a number of introductory resources. Check out aisi.ai for some ideas for projects within ML & AI safety.

Inspiration

We have many ideas available for inspiration on the aisi.ai Interpretability Hackathon ideas list. A lot of interpretability research is available on distill.pub, transformer circuits, and Anthropic's research page.

Introductions to mechanistic interpretability

See also the tools available on interpretability:

Digestible research

Starter projects

🙋‍♀️ Simple templates & tools

Activation Atlas [tool]

The Activation Atlas article has a lot of figures where each has a Google Colab associated with them. Click on the "Try in a notebook". An example is this notebook that shows a simple activation atlas.

Additionally, they have this tool to explore to which sorts of images neurons activate the most to.

BertViz

BertViz is an interactive tool for visualizing attention in Transformer language models such as BERT, GPT2, or T5. It can be run inside a Jupyter or Colab notebook through a simple Python API that supports most Huggingface models. BertViz extends the Tensor2Tensor visualization tool by Llion Jones, providing multiple views that each offer a unique lens into the attention mechanism.

BertViz example image

EasyTransformer [code]

A library for mechanistic interpretability called EasyTransformer (still in beta and has bugs, but it's functional enough to be useful!): https://github.com/neelnanda-io/Easy-Transformer/

Demo notebook of EasyTransformer

A demo notebook of how to use Easy Transformer to explore a mysterious phenomena, looking at how language models know to answer "John and Mary went to the shops, then John gave a drink to" with Mary rather than John: https://colab.research.google.com/drive/1mL4KlTG7Y8DmmyIlE26VjZ0mofdCYVW6

Converting Neural Networks to graphs [code]

This repository can be used to transform a linear neural network into a graph where each neuron is a node and the weights of the directional connections are decided by the actual weights and biases.

You can expand this project by using the graph visualization on the activation for specific inputs and change the conversion from weights into activations or you can try to adapt it to convolutional neural networks. Check out the code below.

FileDescriptiontrain.pyCreates model.pt with a 500 hidden layer linear MNIST classifier.to_graph.pyGenerates a graph from model.pt.vertices.csvEach neuron in the MNIST linear classifier with its bias and layer.edges.csvEach connection in the neural network: from_id, to_id, weight.network_eda.RmdThe R script for initial EDA and visualization of the network.

Reviewing explainability tools

There are a few tools that use interpretability to create understandable explanations of why they give the output they give. This notebook provides a small intro to the most relevant libraries:

  • ELI5: ELI5 is a Python package which helps to debug machine learning classifiers and explain their predictions. It implements a few different analysis frameworks that work with a lot of different ML libraries. It is the most complete tool for explainability.
Explanations of output
Image explanations of output
  • LIME: Local Interpretable Model-agnostic Explanations. The TextExplainer library does a good job of using LIME on language models. Check out Christoph Molnar's introduction here.
  • SHAP: SHapley Additive exPlanations
  • MLXTEND: Machine Learning Extensions

The IML R package [code]

Check out this tutorial to using the IML package in R. The package provides a good interface to working with LIME, feature importance, ICE, partial dependence plots, Shapley values, and more.

👩‍🔬 Advanced templates and tools

Redwood Research's interpretability on Transformers [tool]

Redwood Research has created a wonderful tool that can be used to do research into how language models understand text. The "How to use" document and their instruction videos are very good introductions and we recommend reading/watching them since the interface can be a bit daunting otherwise.

Watch this video as an intro:

Understanding interp-tools by Redwood Research