← Alignment Jams

ML Verifiability Hackathon

--
Rishi Prabhudesai
Brian Muhia
Arshit Mankodi
Jonas Hallgren
James Campbell
Fatimah M
Tiger Du
Eva Černíková
Hynek Kydlíček
Andreas Madsen
Madhusudhan Pathak
Lukas B
Srikar Babu Gadipudi
Alena Moravová
Tim Sankara
Jonathan Grant
Peter Hozák
Michelle Viotti
Alexandr Kazda
Alex Roman
Zoe Tzifa-Kratira
mhammad singer
Filip Nilsson
Khris Pham
David Quarel
janko
Rodrigo Sierra
Jeremias Ferrao
Simon Lermen
Matthew Ewer
Rodrigo Sierra
Thomas Broadley
Mansur Nurmukhambetov
Davide Zani
Darren Rawlings
Chenhe Gu
Erin
Goncalo Santos Paulo
Vlado
Frank A Wallace
Aaron Graifman
František Navrkal
Kelvin Franke
Ian Schechtman
Alexandru Dimofte
Markus Zhang
Alexandre Duplessis
Albert Garde
Imported item 12
Imported item 10
Imported item 8
Imported item 11
Imported item 9
Imported item 7
Imported item 6
Imported item 4
Imported item 3
Imported item 2
Imported item 5
Signups
--
Towards Formally Describing Program Traces from Chains of Language Model Calls with Causal Influence Diagrams: A Sketch
Fuzzing Large Language Models
It Ain't Much but it's ONNX Work
Entries
May 26th to May 28th 2023
Submissions due in
--
Days
--
Hours
--
Minutes
--
Seconds
Submit entry!

This hackathon ran from May 26th to May 28th 2023. You can now judge entries.

Join us for this month's Alignment Jam to investigate how we can both formally and informally verify the safety of machine learning systems!

Nisan Stiennon
,
rick goldstein
,
Haris Gusic
,
and others!
You have successfully been signed up! You should receive an email with further information.
Oops! Something went wrong while submitting the form.
Spend a weekend of intense and focused research work towards validating safety of neural networks in various domains (e.g. language) using adversarial attack / defense and other ML safety research methods. Re-watch the intro talk here.

Schedule & logistics

Here, you can see the calendar and schedule. All times are in UTC+1 (UK Summer Time). You can subscribe to the calendar to see the event timings in your time zone.

Friday | 26 May

18:00-19:00
[CHANGED] Introduction talk Introduction to the topic and the logistics of the hackathon itself Keynote with an expert in AI safety along with logistical information for your participation hackathon
19:30-20:00
20:00
Hacking away Hack away on the important problems in safety verification of neural networks! Use the starter resources and the project ideas on this page

Saturday | 27 May

Morning
Hacking away Continue your research projects
17:00-18:00
Project discussion with Lauro Langosco Lauro will join us in the Discord server to chat with you about your projects

Sunday | 28 May

Morning
Hacking away Continue your research projects
12:00-13:00
[CANCELED DUE TO ILLNESS] Keynote talk Talk by a Joar Skalse
Evening
Local presentations of your projects Present your projects to each other in the local jam sites. These presentations can be 5-15 minutes and can be reused for Wednesday

Wednesday | 31 May

19:00-20:30
Project presentations Project presentations hosted on the Alignment Jam Discord server

Monday | 5 June

Judge feedback is in! All projects will have received feedback from the judges or the Alignment Jam team and projects may be continued into the publication program

Past experiences

See what our great hackathon participants have said
Jason Hoelscher-Obermaier
Interpretability hackathon
The hackathon was a really great way to try out research on AI interpretability and getting in touch with other people working on this. The input, resources and feedback provided by the team organizers and in particular by Neel Nanda were super helpful and very motivating!
Luca De Leo
AI Trends hackathon
I found the hackaton very cool, I think it lowered my hesitance in participating in stuff like this in the future significantly. A whole bunch of lessons learned and Jaime and Pablo were very kind and helpful through the whole process.

Alejandro González
Interpretabiity hackathon
I was not that interested in AI safety and didn't know that much about machine learning before, but I heard from this hackathon thanks to a friend, and I don't regret participating! I've learned a ton, and it was a refreshing weekend for me.
Alex Foote
Interpretability hackathon
A great experience! A fun and welcoming event with some really useful resources for starting to do interpretability research. And a lot of interesting projects to explore at the end!
Sam Glendenning
Interpretability hackathon
Was great to hear directly from accomplished AI safety researchers and try investigating some of the questions they thought were high impact.

Intro talk

The collaborators who will join us for this hackathon.

Joar Skalse

PhD researcher at the Krueger Lab in Cambridge
[Canceled] Speaker & Judge

Lauro Langosco

PhD student at the Krueger Lab in Cambridge
Project discussion host

Fazl Barez

Research lead at Apart Research
Judge

Alexander Briand

Researcher at Arb Research
Judge

More judges will join us to provide feedback.

Readings

Read up on the topic before we start! The reading group will work through these materials together up to the kickoff.
Join the reading group

Verifiable robustness introduction (RTAI L1)

Adversarial attacks (RTAI L2)

Jailbreakchat.com: Explore how people break ChatGPT

Starter resources

Check out the core starter resources that helps you get started with your research as quickly as possible! Will be shared before the kickoff.

Tutorial for α,β-CROWN Robustness Verification

α,β-CROWN won the competition for robustness verification (VNN-COMP'22). Also see this other tutorial.

The basic idea behind α,β-CROWN is to use efficient bound propagation for verification tasks based on Automatic Linear Relaxation based Perturbation Analysis for Neural Networks (LiRPA). The code of LiRPA can be found on github.

Adversarial attacks against large language models

The TextAttack library 🐙 is a set of tools for creating adversarial examples for large language models. See the documentation.

Maybe you wish to use it to compete in the HackAPrompt competition as well which has extended its deadline until the 4th of June!

Watch a WANDB talk about the library.

Auto-LiRPA tutorial

LiRPA is an important tool in robustness verification and certified adversarial defense and this tutorial takes you through the basics.

See the documentation and their introductory video to the tool.

Running TransformerLens to easily analyze activations in language models

This demo notebook goes into depth on how to use the TransformerLens library. TransformerLens is very useful for mechanistic investigations into Transformers and can be very useful for understanding activations. Core features:

  1. Loading and running models
  2. Saving activations from a specific example run
  3. Using the unique Hooks functionality to intervene on and access activations

Read more on the Github page and see the Python package on PyPi.

System 3 continuation

The System 3 paper by our research lead Fazl Barez uses an approach to input symbolic logic for safety in natural environments into neural networks. This Colab notebook is an experimental setting and we recommend you read the paper first.

Research project ideas

Get inspired for your own projects with these ideas developed during the reading groups! Go to the Resources tab to engage more with the topic.
This is some text inside of a div block.

Formalizing textual output-input relations using Transformer mathematics for automatic proof verification

💡

Defining pointwise / local robustness for language models

💡
This is some text inside of a div block.

Automated red teaming as automated verification

💡
This is some text inside of a div block.

Compare and document robustness difference of 14B RNN compared to the Pythia models

💡
This is some text inside of a div block.

Review the HackAPrompt submissions and winners

💡
This is some text inside of a div block.

Formalizing safety verification for language compared to computer vision nomenclature

💡
This is some text inside of a div block.

Annotate The Pile with GPT-4 to separate it into levels of bias, risk, etc. and release it as new training sets

💡
This is some text inside of a div block.
This is some text inside of a div block.

Define discretization of LLM output, e.g. by automatically classifying output into semantic categories and doing targeted attacks on those

💡
This is some text inside of a div block.

Differences in multimodal and text-only model robustness and red teaming using ImageBind

💡
This is some text inside of a div block.

Robustness of different feature vectors: Identify directions that are more robust than others

💡

Review computer vision attacks and equate or relate it to textual attacks and defenses

💡
This is some text inside of a div block.

Extend brain-inspired modular training to language models in one way or another

This is some text inside of a div block.
💡
This is some text inside of a div block.

Symbolic reasoning module safety verification: LLMs will interface with many apps; identify the risky interfaces and protect them in various ways

💡
This is some text inside of a div block.

Cooperative AI and inter-agent verifiable safety - how do we verify multi-agent safety?

💡
This is some text inside of a div block.

Design ways LLM interaction (e.g. ChatGPT) can be overseen by regulatory agencies

💡

Where in real-world use of large language models (LLMs) will we see jailbreaking be a serious issue? (example)

💡
This is some text inside of a div block.
This is some text inside of a div block.

Define targeted and untargeted attacks within the framework of LLMs

💡
This is some text inside of a div block.

Identify the white box and black box methods available to tamper with LLMs

💡
This is some text inside of a div block.

Create ways to fuzzing language models and other ML systems; automatically running a large number of

💡
This is some text inside of a div block.

The deadline for HackAPrompt has been extended! You can try to win that challenge this weekend.

💡
Register your site

Registered jam sites

London ML Verifiability Hackathon
The London MATS offices will be transformed into a hackathon space for the weekend!
Visit event page
SERI MATS Offices
Groningen ML Verifiability & Reliability Hackathon
Join us for our first AI Safety Hackathon to meet talented people from all over Groningen, interested in making impactful research. Location T.B.A.
Eindhoven AI Safety Team Verifiability Hackathon
Join us in Neuron 1.122 on the TU/e campus for our first AI Safety Hackathon!
Prague ML Reliability Hackathon
Join us in Fixed Point in Prague - Vinohrady, Koperníkova 6 for a weekend ML reliability research sprint!
Visit event page
Prague Fixed Point
Online safety verification jam
Join the community online both on Discord and on GatherTown for a collaborative hackathon experience!
Copenhagen ML Reliability Hackathon
Join us in the new offices in Copenhagen for the ML reliability verification research sprint!
Visit event page
Copenhagen EA Offices

Register your own site

The in-person hubs for the Alignment Jams are run by passionate individuals just like you! We organize the schedule, speakers, and starter templates, and you can focus on engaging your local research and engineering community. Read more about organizing and use the media below to set up your event.
Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
Thank you! Your submission has been received! Your event will show up on this page.
Oops! Something went wrong while submitting the form.

Social media for your jam site

Event cover image
Social media 1
Social media 2
Social media 3
Social media 4
Social media message

Join us to hack away on research into ML robustness verification & reliability! 

Robust generalization of machine learning systems is becoming more and more important as neural networks are applied to safety-critical domains.

With Alignment Jams, we get a chance to create impactful and real research on verifiable safety of these networks.

You will compete with participants from across the globe and get a great chance to review each others' projects as well!

Don't miss this opportunity to network, think deeply, and challenge yourself!

Register now: https://alignmentjam.com/jam/verification

[or add your event link here]

Liked by
and
others

Submit your project

Use this template for the report submission. As you create your project presentations, upload your slides here, too. We recommend you also make a recording of your slideshow with the recording capability of e.g. Keynote, Powerpoint, and Slides (using Vimeo).

Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
You have successfully submitted! You should receive an email and your project should appear here. If not, contact operations@apartresearch.com.
Oops! Something went wrong while submitting the form.
project image for It Ain't Much but it's ONNX Work
It Ain't Much but it's ONNX Work
ONNXplorer is a VR project in Unity that loads a machine learning model in the open ONNX format, and displays it for examination and analysis. The project is open source: https://github.com/onnxplorer/ONNXplorer and was built for the ML Verifiabilitity Jam, Apart Research Alignment Jam #5 (Scale Oversight), 2023.
Matthew Ewer, Giles Edkins
ONNXplorer
project image for Fuzzing Large Language Models
Fuzzing Large Language Models
We used fuzzing to test AI models for unexpected responses and security risks. Our findings stress the need for better fuzzing methods to improve AI safety.
Esben Kran
Fuzz Bizz
project image for Towards Formally Describing Program Traces from Chains of Language Model Calls with Causal Influence Diagrams: A Sketch
Towards Formally Describing Program Traces from Chains of Language Model Calls with Causal Influence Diagrams: A Sketch
In this short report we describe some tools and notation we have used to specify agent architectures for the factored, cognition setting, and outlined our plan to integrate this tooling for more research into LLM chain behaviour
Brian Muhia
Fahamu Inc