[Happening now!] Explore safer AI with fellow researchers and enthusiasts

Large AI models are released nearly every week. We need to find ways to evaluate these models (especially at the complexity of GPT-4) to ensure that they will not have critical failures after deployment, e.g. autonomous power-seeking, biases for unethical behaviors, and other phenomena that arise in deployment (e.g. inverse scaling).

Participate in the Alignment Jam on safety benchmarks to spend a weekend with AI safety researchers to formulate and demonstrate new ideas in measuring the safety of artificially intelligent systems.

Rewatch the keynote talk Alexander Pan above

The MACHIAVELLI benchmark (left) and the Inverse Scaling Prize (right)

Sign up below to be notified before the kickoff! Read up on the schedule, see instructions for how to participate, and inspiration below.

You have successfully been signed up! You should receive an email with further information.

Oops! Something went wrong while submitting the form.

Schedule

The schedule will depend on the location you participate from but below you see the international virtual events that anyone can tune into wherever they are.

Friday, June 30th

UTC 17:00: Keynote talk with Alexander Pan, lead author on the MACHIAVELLI benchmark paper along with logistical information from the organizing team
UTC 18:00: Team formation and coordination

Saturday, July 1st

UTC 14:00: Virtual project discussions
UTC 17:00: A talk with Antonio Miceli-Barone on "The Larger They Are, the Harder They Fail: Language Models do not Recognize Identifier Swaps in Python" (Twitter thread), an inverse scaling phenomenon in large language models.

Sunday, July 2nd

UTC 14:00: Virtual project discussions
UTC 19:00: Online ending session
Monday UTC 2:00: Submission deadline

Wednesday, July 5th

UTC 19:00: International project presentations!

Submission details

You are required to submit:

A PDF report using the template linked on the submission page
A maximum 10 minute video presenting your findings and results (see inspiration and instructions for how to do this on the submission page)

You are optionally encouraged to submit:

A slide deck describing your project
A link to your code
Any other material you would like to link

Inspiration

Here are a few inspiring papers, talks, and posts about safety benchmarks. See more starter code, articles, and readings under the "Resources" tab.

Past experiences

See what our great hackathon participants have said

Jason Hoelscher-Obermaier

Interpretability hackathon

The hackathon was a really great way to try out research on AI interpretability and getting in touch with other people working on this. The input, resources and feedback provided by the team organizers and in particular by Neel Nanda were super helpful and very motivating!

Luca De Leo

AI Trends hackathon

I found the hackaton very cool, I think it lowered my hesitance in participating in stuff like this in the future significantly. A whole bunch of lessons learned and Jaime and Pablo were very kind and helpful through the whole process.

Alejandro González

Interpretabiity hackathon

I was not that interested in AI safety and didn't know that much about machine learning before, but I heard from this hackathon thanks to a friend, and I don't regret participating! I've learned a ton, and it was a refreshing weekend for me.

Alex Foote

Interpretability hackathon

A great experience! A fun and welcoming event with some really useful resources for starting to do interpretability research. And a lot of interesting projects to explore at the end!

Sam Glendenning

Interpretability hackathon

Was great to hear directly from accomplished AI safety researchers and try investigating some of the questions they thought were high impact.

Intro talk

The collaborators who will join us for this hackathon.

Alexander Pan

Author of the MACHIAVELLI benchmark (Pan et al., 2023)

Keynote speaker

Antonio Valerio Miceli-Barone

Postdoc researcher at the University of Edinburgh. Author of Miceli-Barone et al. (2023).

Speaker & judge

Fazl Barez

Research lead at Apart Research

Judge & Organizer

Esben Kran

Director at Apart Research

Organizer

Readings

Read up on the topic before we start! The reading group will work through these materials together up to the kickoff.

Join the reading group

Inverse scaling phenomena for large language models

Challenge and experimental results

MACHIAVELLI benchmark

Paper and benchmark

Massive Multi-Task Language Understanding benchmark

Paper and LLM benchmark

OpenAI Safety Gym

Blog post and RL benchmark

Talk on Model Evaluations by Beth Barnes

Talk (YouTube video)

Python library

Language Model Evaluation Harness

The LMEH is a set of over 200 tasks that you can automatically run your models through. You can easily use it by writing pip install lm-eval at the top of your script.

See a Colab notebook shortly introducing how to use it here.

Check out the Github repository and the guide to adding a new benchmark so you can test your own tasks using their easy interface.

Starter resources

Check out the core starter resources that helps you get started with your research as quickly as possible! The Colab links will be updated before the kickoff.

Running Adversarial Robustness Toolbox (ART) adversarial attacks and defense

Check out the repository here along with a long list of Jupyter Notebooks here. We have converted one of the image attack algorithm examples to Google Colab here.

Using ART, you can create comprehensive tests for adversarial attacks on models and / or test existing ones. Check out the documentation here. It does not seem possible to do textual adversarial attacks with ART, though that would be quite interesting.

For textual attacks, you might use the TextAttack library. It also contains a list of textual adversarial attacks. There are a number of tutorials, the first showing an end-to-end training, evaluation and attack loop (see it here).

OpenAI Gym starter

You can use the OpenAI Gym to run interesting reinforcement learning agents with your spins of testing on top!

See how to use the Gym environments in this Colab. It does not train an RL agent but we can see how to initialize the game loop and visualize the results. See how to train an offline RL agent using this Colab. Combining the two should be relatively straightforward.

The OpenAI safety gym is too advanced for this weekend's work simply because it's tough getting it set up and OpenAI gym generally works great. Read more about getting started in this article.

Language Model Evaluation Harness

The LMEH is a set of over 200 tasks that you can automatically run your models through. You can easily use it by writing pip install lm-eval at the top of your script.

See a Colab notebook shortly introducing how to use it here.

Check out the Github repository and the guide to adding a new benchmark so you can test your own tasks using their easy interface.

Perform neural network formal verification

This tutorial from AAAI 2022 has two Colab notebooks:

Colab Demo for the auto_LiRPA library: An automated calculation for neural network verification by perturbation boundary verification. See this video for an introduction.
Colab Demo for the α,β-CROWN (alpha-beta-CROWN) library. See this video for an introduction.

These are very useful intros to think about how we can design formal tests for various properties in our models along with useful tools for ensuring the safety of our models against adversarial examples and out-of-distribution scenarios.

BIG-Bench benchmark for LLMs

Using SeqIO to inspect and evaluate BIG-bench json tasks:

load BIG-bench json tasks and inspect examples ‍

Creating new BIG-bench tasks

lightweight task creation and evaluation

manually perform BIG-bench tasks (after creating a task in your own branch, use this notebook to manually evaluate and verify that it is behaving correctly)

Read the paper here and see the Github repository here.

Loading GPT models into Google Colab

We will use the wonderful package EasyTransformer from Neel Nanda that was used heavily at the last hackathon. It contains some helper functions to load pretrained models.

There's all the famous ones like GPT-2 all the way to Neel Nanda's custom 12-layer SoLU-based Transformer models. See a complete list here along with an example toy model here.

See this Colab notebook to use the EasyTransformer model downloader utility. It also has all the available models there from EleutherAI, OpenAI, Facebook AI Research, Neel Nanda and more.

You can also run this in Paperspace Gradient. See the code on Github here and how to integrate Github and Paperspace here. See a fun example of using Paperspace Gradient like Google Colab here. Gradient has a bit of a larger GPU for free tier.

You can also use the huggingface Transformers library directly like this.

Inverse scaling notebook

"All alignment problems are inverse scaling problems" is one fascinating take on AI safety. If we generate benchmarks that showcase the alignment failures of larger models, this can become very interesting.

See the Colab notebook here. You can also read more about the "benchmarks" that won the first round of the tournament here along with the tournament Github repository here.

Making GriddlyJS run

This Colab notebook gives a short overview of how to use the Griddly library in conjunction with the OpenAI Gym.

Jump on the Griddly.ai website to create an environment and load it into the Colab notebook. There's much more documentation about what it all means in their documentation.

Notebooks by Hugging Face

These notebooks all pertain to the usage of transformers and shows how to use their library. See them all here. Some notable notebooks include:

Research project ideas

Get inspired for your own projects with these ideas developed during the reading groups! Go to the Resources tab to engage more with the topic.