← Alignment Jams

Safety Benchmarks Hackathon

--
Vansh
Santosh Jayanth
Ali Panahi
Simon Lermen
Aideen
Quentin
Jonathan
Sophie
Corey Morris
Jason Hoelscher-Obermaier
Jonathan
Maxime Riche
Vladislav Bargatin
Ariel Gil
Codruta Lugoj
Heramb Podar
Stephen Jiang
Josh Witt
Roman Hauksson
Mark Rogers
Yanga Bei
Hana Kalivodová
Roland Pihlakas
Jiamin Lim
Sunishchal Dev
Juan Calderon
Aavishkar Gautam
Arjun Verma
Oliver Daniels-Koch
Gabriel Mukobi
Maksym
Esben Kran
Jonas Kgomo
Anmol Goel
Vamshi Krishna Bonagiri
Eva Cernikova
Tassilo Neubauer
Guifu Liu
Desik
Hunter Hasenfus
Aishwarya Gurung
Soroush Pour
Bethesda Gambotto-Burke
Edward
Jan Provaznik
Harrison Gietz
Vishni g
Sai Shinjitha M
Ali Panahi
Samuel Selleck
Victoria Panassevitch
Mishaal
Habeeb
Evan Harris
Nina Rimsky
Marina T
Johanna Einsiedler
Jonathan Grant
Ludvig Lilleby Johansson
Esben Kran
Signups
--
From Sparse to Dense: Refining the MACHIAVELLI Benchmark for Real-World AI Safety
AI & Cyberdefense
Manipulative Expression Recognition (MER) and LLM Manipulativeness Benchmark
Identifying undesirable conduct when interacting with individuals with psychiatric conditions
Exploring the Robustness of Model-Graded Evaluations of Language Models
MAXIAVELLI: Thoughts on improving the MACHIAVELLI benchmark
Exploitation of LLM’s to Elicit Misaligned Outputs
Entries
June 30th to July 2nd 2023
Submissions close in
--
Days
--
Hours
--
Minutes
--
Seconds
Submit entry

This hackathon ran from June 30th to July 2nd 2023. You can now judge entries.

[Happening now!] Explore safer AI with fellow researchers and enthusiasts

Large AI models are released nearly every week. We need to find ways to evaluate these models (especially at the complexity of GPT-4) to ensure that they will not have critical failures after deployment, e.g. autonomous power-seeking, biases for unethical behaviors, and other phenomena that arise in deployment (e.g. inverse scaling).

Participate in the Alignment Jam on safety benchmarks to spend a weekend with AI safety researchers to formulate and demonstrate new ideas in measuring the safety of artificially intelligent systems.

Rewatch the keynote talk Alexander Pan above

The MACHIAVELLI benchmark (left) and the Inverse Scaling Prize (right)

Sign up below to be notified before the kickoff! Read up on the schedule, see instructions for how to participate, and inspiration below.

Arjun Verma
,
Mishaal
,
Roland Pihlakas
,
and others!
You have successfully been signed up! You should receive an email with further information.
Oops! Something went wrong while submitting the form.

Schedule

The schedule will depend on the location you participate from but below you see the international virtual events that anyone can tune into wherever they are.

Friday, June 30th

  • UTC 17:00: Keynote talk with Alexander Pan, lead author on the MACHIAVELLI benchmark paper along with logistical information from the organizing team
  • UTC 18:00: Team formation and coordination

Saturday, July 1st

Sunday, July 2nd

  • UTC 14:00: Virtual project discussions
  • UTC 19:00: Online ending session
  • Monday UTC 2:00: Submission deadline

Wednesday, July 5th

  • UTC 19:00: International project presentations!

Submission details

You are required to submit:

  • A PDF report using the template linked on the submission page
  • A maximum 10 minute video presenting your findings and results (see inspiration and instructions for how to do this on the submission page)

You are optionally encouraged to submit:

  • A slide deck describing your project
  • A link to your code
  • Any other material you would like to link

Inspiration

Here are a few inspiring papers, talks, and posts about safety benchmarks. See more starter code, articles, and readings under the "Resources" tab.

CounterFact+ benchmark

Past experiences

See what our great hackathon participants have said
Jason Hoelscher-Obermaier
Interpretability hackathon
The hackathon was a really great way to try out research on AI interpretability and getting in touch with other people working on this. The input, resources and feedback provided by the team organizers and in particular by Neel Nanda were super helpful and very motivating!
Luca De Leo
AI Trends hackathon
I found the hackaton very cool, I think it lowered my hesitance in participating in stuff like this in the future significantly. A whole bunch of lessons learned and Jaime and Pablo were very kind and helpful through the whole process.

Alejandro González
Interpretabiity hackathon
I was not that interested in AI safety and didn't know that much about machine learning before, but I heard from this hackathon thanks to a friend, and I don't regret participating! I've learned a ton, and it was a refreshing weekend for me.
Alex Foote
Interpretability hackathon
A great experience! A fun and welcoming event with some really useful resources for starting to do interpretability research. And a lot of interesting projects to explore at the end!
Sam Glendenning
Interpretability hackathon
Was great to hear directly from accomplished AI safety researchers and try investigating some of the questions they thought were high impact.

Intro talk

The collaborators who will join us for this hackathon.

Alexander Pan

Author of the MACHIAVELLI benchmark (Pan et al., 2023)
Keynote speaker

Antonio Valerio Miceli-Barone

Postdoc researcher at the University of Edinburgh. Author of Miceli-Barone et al. (2023).
Speaker & judge

Fazl Barez

Research lead at Apart Research
Judge & Organizer

Esben Kran

Director at Apart Research
Organizer

Readings

Read up on the topic before we start! The reading group will work through these materials together up to the kickoff.
Join the reading group

Inverse scaling phenomena for large language models

MACHIAVELLI benchmark

OpenAI Safety Gym

Eval Harness: LLM benchmark tool

MMLU: Measuring language ability

RL Benchmarks list for many tasks

Language Model Evaluation Harness

The LMEH is a set of over 200 tasks that you can automatically run your models through. You can easily use it by writing pip install lm-eval at the top of your script.

See a Colab notebook shortly introducing how to use it here.

Check out the Github repository and the guide to adding a new benchmark so you can test your own tasks using their easy interface.

Starter resources

Check out the core starter resources that helps you get started with your research as quickly as possible! The Colab links will be updated before the kickoff.
Open In Colab

Running Adversarial Robustness Toolbox (ART) adversarial attacks and defense

Check out the repository here along with a long list of Jupyter Notebooks here. We have converted one of the image attack algorithm examples to Google Colab here.

Using ART, you can create comprehensive tests for adversarial attacks on models and / or test existing ones. Check out the documentation here. It does not seem possible to do textual adversarial attacks with ART, though that would be quite interesting.

For textual attacks, you might use the TextAttack library. It also contains a list of textual adversarial attacks. There are a number of tutorials, the first showing an end-to-end training, evaluation and attack loop (see it here).

Open In Colab

OpenAI Gym starter

You can use the OpenAI Gym to run interesting reinforcement learning agents with your spins of testing on top!

See how to use the Gym environments in this Colab. It does not train an RL agent but we can see how to initialize the game loop and visualize the results. See how to train an offline RL agent using this Colab. Combining the two should be relatively straightforward.

The OpenAI safety gym is too advanced for this weekend's work simply because it's tough getting it set up and OpenAI gym generally works great. Read more about getting started in this article.

Open In Colab

Language Model Evaluation Harness

The LMEH is a set of over 200 tasks that you can automatically run your models through. You can easily use it by writing pip install lm-eval at the top of your script.

See a Colab notebook shortly introducing how to use it here.

Check out the Github repository and the guide to adding a new benchmark so you can test your own tasks using their easy interface.

Open In Colab

Perform neural network formal verification

This tutorial from AAAI 2022 has two Colab notebooks:

  1. Colab Demo for the auto_LiRPA library: An automated calculation for neural network verification by perturbation boundary verification. See this video for an introduction.
  2. Colab Demo for the α,β-CROWN (alpha-beta-CROWN) library. See this video for an introduction.

These are very useful intros to think about how we can design formal tests for various properties in our models along with useful tools for ensuring the safety of our models against adversarial examples and out-of-distribution scenarios.

See also Certified Adversarial Robustness via Randomized Smoothing.

Open In Colab

BIG-Bench benchmark for LLMs

Using SeqIO to inspect and evaluate BIG-bench json tasks:

Creating new BIG-bench tasks

Read the paper here and see the Github repository here.

Open In Colab

Loading GPT models into Google Colab

We will use the wonderful package EasyTransformer from Neel Nanda that was used heavily at the last hackathon. It contains some helper functions to load pretrained models.

There's all the famous ones like GPT-2 all the way to Neel Nanda's custom 12-layer SoLU-based Transformer models. See a complete list here along with an example toy model here.

See this Colab notebook to use the EasyTransformer model downloader utility. It also has all the available models there from EleutherAI, OpenAI, Facebook AI Research, Neel Nanda and more.

You can also run this in Paperspace Gradient. See the code on Github here and how to integrate Github and Paperspace here. See a fun example of using Paperspace Gradient like Google Colab here. Gradient has a bit of a larger GPU for free tier.

You can also use the huggingface Transformers library directly like this.

Open In Colab

Inverse scaling notebook

"All alignment problems are inverse scaling problems" is one fascinating take on AI safety. If we generate benchmarks that showcase the alignment failures of larger models, this can become very interesting.

See the Colab notebook here. You can also read more about the "benchmarks" that won the first round of the tournament here along with the tournament Github repository here.

Open In Colab

Making GriddlyJS run

This Colab notebook gives a short overview of how to use the Griddly library in conjunction with the OpenAI Gym.

Jump on the Griddly.ai website to create an environment and load it into the Colab notebook. There's much more documentation about what it all means in their documentation.

Open In Colab

Notebooks by Hugging Face

These notebooks all pertain to the usage of transformers and shows how to use their library. See them all here. Some notable notebooks include:

Research project ideas

Get inspired for your own projects with these ideas developed during the reading groups! Go to the Resources tab to engage more with the topic.

Extending the MACHIAVELLI benchmark's metrics

💡
This is some text inside of a div block.

Implementing and contributing to the LLM evals multi-metric benchmark

💡
This is some text inside of a div block.

Extending the results of existing safety benchmarks to newer models such as the RWKV architecture

💡
This is some text inside of a div block.

Automated red teaming as automated verification

💡
This is some text inside of a div block.

Compare and document robustness difference of 14B RNN compared to the Pythia models

💡
This is some text inside of a div block.

Annotate The Pile with GPT-4 to separate it into levels of bias, risk, etc. and release it as new training sets

💡
This is some text inside of a div block.
This is some text inside of a div block.

Differences in multimodal and text-only model robustness and red teaming using ImageBind

💡
This is some text inside of a div block.

Cooperative AI and inter-agent verifiable safety - how do we verify multi-agent safety?

💡
This is some text inside of a div block.

Create ways to fuzzing language models and other ML systems; automatically running a large number of

💡
This is some text inside of a div block.

Create a benchmark of multimodal tasks for LLMs to complete

💡
This is some text inside of a div block.

Genera a moral dataset with GPT-4 that references and updates on the moral scenarios dataset

💡
This is some text inside of a div block.

Stress test MACHIAVELLI (Pan et al., 2023) and develop new ways to work with it

💡
This is some text inside of a div block.

Generate a framework of tests to solve

💡
This is some text inside of a div block.

Create cybersecurity benchmarks for LLMs to protect against

💡
This is some text inside of a div block.

Create cybersecurity benchmarks for LLMs to protect against

💡

Registered jam sites

Online Safety Benchmarks Hackathon
Join us online during the whole weekend on the Alignment Jam Discord server! You can participate from anywhere, as long as you have an internet connection.
Copenhagen Safety Benchmarks Hackathon
Be a part of meaningful research with a community of excited AI safety researchers at Station (Howitzvej 4).
Visit event page
EnigmA Copenhagen
Safety Benchmarks Hackathon
Join us in Fixed Point in Prague - Vinohrady, Koperníkova 6 for a weekend research sprint in Safety Benchmarks!
Visit event page
Prague Fixed Point

Register your own site

The in-person hubs for the Alignment Jams are run by passionate individuals just like you! We organize the schedule, speakers, and starter templates, and you can focus on engaging your local research and engineering community. Read more about organizing and use the media below to set up your event.
Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
Thank you! Your submission has been received! Your event will show up on this page.
Oops! Something went wrong while submitting the form.

Social media for your jam site

Event cover image
Social media message

We're hosting a hackathon to find the best benchmarks for safety in large language models! 

Large models are becoming increasingly important and we want to make sure that we understand the safety of these systems.

During this weekend, we get a chance to create impactful research on this problem together with people across the world.

Don't miss this opportunity to explore machine learning deeper, network, and challenge yourself!

Register now: https://alignmentjam.com/jam/benchmarks

[or add your event link here]

Liked by
and
others
Event cover image
Social media 1

Submit your project

Use this template for the report submission. As you create your project presentations, upload your slides here, too. Make a recording of your slideshow or project with the recording capability of e.g. Keynote, Powerpoint, and Slides (using Vimeo).

Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
You have successfully submitted! You should receive an email and your project should appear here. If not, contact operations@apartresearch.com.
Oops! Something went wrong while submitting the form.

Follow or watch the project lightning talks

You can also join on the Discord to participate in the questions and answers sessions associated with each talk. Wednesday evening at 21:00 CEST (12:00 PM PDT).

Respond to our feedback survey and receive a free book!
4th 🏆
3rd 🏆
2nd 🏆
1st 🏆
Read our publishing policy
Go to project page
Private
project image for Exploring the Robustness of Model-Graded Evaluations of Language Models
Exploring the Robustness of Model-Graded Evaluations of Language Models
Redacted due to our project publishing policy
There has been increasing interest in evaluations of language models for a variety of risks and characteristics. Evaluations relying on natural language understanding for grading can often be performed at scale by using other language models. We test the robustness of these model-graded evaluations to prompt injections on different datasets including a new Deception Eval. The prompt injections are designed to influence the evaluator to change their grading. We extrapolate that future, more intelligent models might manipulate or cooperate with their evaluation model. We find significant susceptibility to these injections in state-of-the-art commercial models on all examined evaluations. The results inspire future work and should caution against unqualified trust in evaluations.
Simon Lermen, Ondřej Kvapil
BruteForceBaryonyx
4th 🏆
3rd 🏆
2nd 🏆
1st 🏆
Read our publishing policy
Go to project page
Private
project image for From Sparse to Dense: Refining the MACHIAVELLI Benchmark for Real-World AI Safety
From Sparse to Dense: Refining the MACHIAVELLI Benchmark for Real-World AI Safety
Redacted due to our project publishing policy
In this paper, we extend the MACHIAVELLI framework by incorporating sensitivity to event density, thereby enhancing the benchmark's ability to discern diverse value systems among models. This enhancement enables the identification of potential malicious actors who are prone to engaging in a rapid succession of harmful actions, distinguishing them from well-intentioned actors.
Heramb Podar, Vladislav Bargatin
Turing's Baristas
4th 🏆
3rd 🏆
2nd 🏆
1st 🏆
Read our publishing policy
Go to project page
Private
project image for MAXIAVELLI: Thoughts on improving  the MACHIAVELLI benchmark
MAXIAVELLI: Thoughts on improving the MACHIAVELLI benchmark
Redacted due to our project publishing policy
MACHIAVELLI is an AI safety benchmark that uses text-based choose-your-own-adventure games to measure the tendency of AI agents to behave unethically in the pursuit of their goals. We discuss what we see as two crucial assumptions behind the MACHIAVELLI benchmark and how these assumptions impact the validity of MACHIAVELLI as a test of ethical behavior of AI agents deployed in the real world. The assumptions we investigate are: - Equivalence of action evaluation and action generation - Independence of ethical judgments from agent capabilities We then propose modifications to the MACHIAVELLI benchmark to empirically study to which extent the assumptions behind MACHIAVELLI hold for AI agents in the real world.
Roman Leventov, Jason Hoelscher-Obermaier
MAXIAVELLI
4th 🏆
3rd 🏆
2nd 🏆
1st 🏆
Read our publishing policy
Go to project page
Private
project image for Exploitation of LLM’s to Elicit Misaligned Outputs
Exploitation of LLM’s to Elicit Misaligned Outputs
Redacted due to our project publishing policy
1. This paper primarily focuses on an automated approach a bad actor might pursue to exploit LLM via intelligent prompt engineering combined with the use of dual agents to produce harmful code and improve it 2. We also use step by step questioning instead of a single prompt to make sure the LLMs give harmful outputs instead of refusing the output 3. We also see that Gpt4 which is more resilient to harmful inputs and outputs as per empirical evidence and existing literature can produce more harmful outputs. We call this as Inverse Scaling Harm
Desik Mandava, Jayanth Santosh, Aishwarya Gurung
DeJaWa
4th 🏆
3rd 🏆
2nd 🏆
1st 🏆
Read our publishing policy
Go to project page
Private
project image for Identifying undesirable conduct when interacting with individuals with psychiatric conditions
Identifying undesirable conduct when interacting with individuals with psychiatric conditions
Redacted due to our project publishing policy
This study evaluates the interactions of the gpt3.5-turbo-0613 model with individuals with psychiatric conditions, using posts from the r/schizophrenia subreddit. Responses were assessed based on ethical guidelines for psychotherapists, covering responsibility, integrity, justice, and respect. The results show the model generally handles sensitive interactions safely, but more research is needed to fully understand its limits and potential vulnerabilities in unique situations.
Jan Provaznik, Jakub Stejskal, Hana Kalivodová
Prague is Mental
4th 🏆
3rd 🏆
2nd 🏆
1st 🏆
Read our publishing policy
Go to project page
Private
project image for Manipulative Expression Recognition (MER) and LLM Manipulativeness Benchmark
Manipulative Expression Recognition (MER) and LLM Manipulativeness Benchmark
Redacted due to our project publishing policy
A software library where people can analyse a transcript of a conversation or a single message. The library annotates relevant parts of the text with labels of different manipulative communication styles detected in this conversation or message. One of main use cases would be evaluating the presence of manipulation originating from large language model generated responses or conversations. The other main use case is evaluating human created conversations and responses. The software does not do fact checking, it focuses on labelling the psychological style of expressions present in the input text.
Roland Pihlakas
Detect/annotate manipulative communication styles using a provided list of labels
4th 🏆
3rd 🏆
2nd 🏆
1st 🏆
Read our publishing policy
Go to project page
Private
project image for AI & Cyberdefense
AI & Cyberdefense
Redacted due to our project publishing policy
[unfinished] While hosting the hackathon, I had a few hours to explore safety benchmarks in relation to cyberdefence and mechanistic interpretability. I present a few project idea and research paths that might be interesting in the intersection between existential AI safety and cyber security.
Esben Kran
The Defenders

Send in pictures of you having fun hacking away!

We love to see the community flourish and it's always great to see any pictures you're willing to share uploaded here.

Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
Thank you for sharing !
Oops! Something went wrong while submitting the form.
Office hours on Saturday with Ollie!
Office hours on Saturday with Ollie!
Saturday hacking away
Saturday hacking away
3 AM at the Friday aftermath!
3 AM at the Friday aftermath!
The Copenhagen office during the intro keynote talk
The Copenhagen office during the intro keynote talk
[admin] preliminary system test before the weekend starts
[admin] preliminary system test before the weekend starts
Q&A from the great first talk with Catalin!
Q&A from the great first talk with Catalin!
Prague team hard at work :)
Prague team hard at work :)
Q&A with Neel Nanda
Q&A with Neel Nanda
Discussing how Transformer models traverse graphs!
Discussing how Transformer models traverse graphs!
We found an elephant on our break (a rare sight in Denmark)
We found an elephant on our break (a rare sight in Denmark)
Benchmark hackathon at the Turing Coffemachine!
Benchmark hackathon at the Turing Coffemachine!
A beautiful test image
A beautiful test image