Join us for the fifth Alignment Jam where we get to spend 48 hours of intense research on how we can measure and monitor the safety of large-scale machine learning models. Work on safety benchmarks, models detecting faults in other models, self-monitoring systems, and so much else!
To make sure large machine learning models follow what we want them to do, we have to have people monitoring their safety. BUT, it is indeed very hard for just one person to monitor all the outputs of ChatGPT...
The objective of this hackathon is to research scalable solutions to this problem!
These are all very interesting questions that we're excited to see your answers for during theses 48 hours!
Dive deeper:
Use this API key for OpenAI API access: sk-rTnWIq6mUZysHnOP78veT3Bl
bkFJ1RmKgqzYksCO0UQoyBUj
You probably want to view this website on a computer or laptop.
See here how to upload your project to the hackathon page and copy the PDF report template here.
We will use the wonderful package EasyTransformer from Neel Nanda that was used heavily at the last hackathon. It contains some helper functions to load pretrained models.
There's all the famous ones like GPT-2 all the way to Neel Nanda's custom 12-layer SoLU-based Transformer models. See a complete list here along with an example toy model here.
See this Colab notebook to use the EasyTransformer model downloader utility. It also has all the available models there from EleutherAI, OpenAI, Facebook AI Research, Neel Nanda and more.
You can also run this in Paperspace Gradient. See the code on Github here and how to integrate Github and Paperspace here. See a fun example of using Paperspace Gradient like Google Colab here. Gradient has a bit of a larger GPU for free tier.
You can also use the huggingface Transformers library directly like this.
"All alignment problems are inverse scaling problems" is one fascinating take on AI safety. If we generate benchmarks that showcase the alignment failures of larger models, this can become very interesting.
See the Colab notebook here. You can also read more about the "benchmarks" that won the first round of the tournament here along with the tournament Github repository here.
Check out the repository here along with a long list of Jupyter Notebooks here. We have converted one of the image attack algorithm examples to Google Colab here.
Using ART, you can create comprehensive tests for adversarial attacks on models and / or test existing ones. Check out the documentation here. It does not seem possible to do textual adversarial attacks with ART, though that would be quite interesting.
For textual attacks, you might use the TextAttack library. It also contains a list of textual adversarial attacks. There are a number of tutorials, the first showing an end-to-end training, evaluation and attack loop (see it here).
The LMEH is a set of over 200 tasks that you can automatically run your models through. You can easily use it by writing pip install lm-eval at the top of your script.
See a Colab notebook shortly introducing how to use it here.
Check out the Github repository and the guide to adding a new benchmark so you can test your own tasks using their easy interface.
You can use the OpenAI Gym to run interesting reinforcement learning agents with your spins of testing on top!
See how to use the Gym environments in this Colab. It does not train an RL agent but we can see how to initialize the game loop and visualize the results. See how to train an offline RL agent using this Colab. Combining the two should be relatively straightforward.
The OpenAI safety gym is too advanced for this weekend's work simply because it's tough getting it set up and OpenAI gym generally works great. Read more about getting started in this article.
This tutorial from AAAI 2022 has two Colab notebooks:
These are very useful intros to think about how we can design formal tests for various properties in our models along with useful tools for ensuring the safety of our models against adversarial examples and out-of-distribution scenarios.
See also Certified Adversarial Robustness via Randomized Smoothing.
Using SeqIO to inspect and evaluate BIG-bench json tasks:
Creating new BIG-bench tasks
This Colab notebook gives a short overview of how to use the Griddly library in conjunction with the OpenAI Gym.
Jump on the Griddly.ai website to create an environment and load it into the Colab notebook. There's much more documentation about what it all means in their documentation.
These notebooks all pertain to the usage of transformers and shows how to use their library. See them all here. Some notable notebooks include:
Check out the repository here along with a long list of Jupyter Notebooks here. We have converted one of the image attack algorithm examples to Google Colab here.
Using ART, you can create comprehensive tests for adversarial attacks on models and / or test existing ones. Check out the documentation here. It does not seem possible to do textual adversarial attacks with ART, though that would be quite interesting.
For textual attacks, you might use the TextAttack library. It also contains a list of textual adversarial attacks. There are a number of tutorials, the first showing an end-to-end training, evaluation and attack loop (see it here).
You can use the OpenAI Gym to run interesting reinforcement learning agents with your spins of testing on top!
See how to use the Gym environments in this Colab. It does not train an RL agent but we can see how to initialize the game loop and visualize the results. See how to train an offline RL agent using this Colab. Combining the two should be relatively straightforward.
The OpenAI safety gym is too advanced for this weekend's work simply because it's tough getting it set up and OpenAI gym generally works great. Read more about getting started in this article.
The LMEH is a set of over 200 tasks that you can automatically run your models through. You can easily use it by writing pip install lm-eval at the top of your script.
See a Colab notebook shortly introducing how to use it here.
Check out the Github repository and the guide to adding a new benchmark so you can test your own tasks using their easy interface.
This tutorial from AAAI 2022 has two Colab notebooks:
These are very useful intros to think about how we can design formal tests for various properties in our models along with useful tools for ensuring the safety of our models against adversarial examples and out-of-distribution scenarios.
See also Certified Adversarial Robustness via Randomized Smoothing.
Using SeqIO to inspect and evaluate BIG-bench json tasks:
Creating new BIG-bench tasks
We will use the wonderful package EasyTransformer from Neel Nanda that was used heavily at the last hackathon. It contains some helper functions to load pretrained models.
There's all the famous ones like GPT-2 all the way to Neel Nanda's custom 12-layer SoLU-based Transformer models. See a complete list here along with an example toy model here.
See this Colab notebook to use the EasyTransformer model downloader utility. It also has all the available models there from EleutherAI, OpenAI, Facebook AI Research, Neel Nanda and more.
You can also run this in Paperspace Gradient. See the code on Github here and how to integrate Github and Paperspace here. See a fun example of using Paperspace Gradient like Google Colab here. Gradient has a bit of a larger GPU for free tier.
You can also use the huggingface Transformers library directly like this.
"All alignment problems are inverse scaling problems" is one fascinating take on AI safety. If we generate benchmarks that showcase the alignment failures of larger models, this can become very interesting.
See the Colab notebook here. You can also read more about the "benchmarks" that won the first round of the tournament here along with the tournament Github repository here.
This Colab notebook gives a short overview of how to use the Griddly library in conjunction with the OpenAI Gym.
Jump on the Griddly.ai website to create an environment and load it into the Colab notebook. There's much more documentation about what it all means in their documentation.
These notebooks all pertain to the usage of transformers and shows how to use their library. See them all here. Some notable notebooks include:
Adaptive Testing: Adaptive Testing and Debugging of NLP Models and Polyjuice: Generating Counterfactuals for Explaining, Evaluating, and Improving Models
CheckList, a dataset to test models: Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
Federated learning is not private. Check out the blog post as well.
EU AI Act: Articles 9-15, act 15 + analyses
France: Villani report
LMentry: A Language Model Benchmark of Elementary Language Tasks. They basically take 25 quite easy tasks for humans and formalize them into textual understanding for LLMs.
[1 hour] Recent progress in verifying neural networks.
[10 minutes] Detection of Trojan Neural Networks
[16 minutes] OpenAI Safety Gym: Exploring safe exploration
[7 minutes] Gridworlds in AI safety, performance and reward functions
[10 minutes] Center for AI Safety's intro to Trojan neural networks
[20 minutes] Center for AI Safety's detecting emergent behaviour
[16 minutes] Center for AI Safety's intro to honest models and TruthfulQA