This hackathon ran from August 18th to August 20th 2023.
Welcome to the research hackathon to devise methods for evaluating the risks of deployed language models and AI. With the societal-scale risks associated with creating new types of intelligence, we need to understand and control the capabilities of such models.
The work we expect to come out of this hackathon will be related to new ways to audit, monitor, red-team, and evaluate language models. See inspiration for resources and publication venues further down and sign up below to receive updates.
See the keynote logistics slides here and participate in the live keynote on our platform here.
There are no requirements for you to join but we recommend that you read up on the topic in the Inspiration and resources section further down. This topic is in reality quite open but the research field is mostly computer science and having a background in programming and machine learning definitely helps. We're excited to see you!
Join us in this iteration of the Alignment Jam research hackathons to spend 48 hour with fellow engaged researchers and engineers in machine learning on engaging in this exciting and fast-moving field! Join the Discord where all communication will happen.
Recently, there is significantly more focus on evaluating the dangerous capabilities of large models. Here, you will see a short review of works within the field including a couple previous Alignment Jam projects:
Future directions: 1) Code generation dataset for DDoS does not exist, 2) critical threat scenarios red-teaming, 3) generally more collaboration, 4) evasiveness and helpfulness tradeoff, 5) explore the pareto front for red teaming.
Trojan Detection Track: Given an LLM containing 1000 trojans and a list of target strings for these trojans, identify the corresponding trigger strings that cause the LLM to generate the target strings. For more information, see here.
Red Teaming Track: Given an LLM and a list of undesirable behaviors, design an automated method to generate test cases that elicit these behaviors. For more information, see here.
"Constant visibility into your evolving attack surface is critical when it comes to protecting your organization and testing your defenses. Automated attack surface monitoring provides constant visibility into your organization’s vulnerabilities, weaknesses, data leaks, and the misconfigurations that emerge in your external attack surface."
Some interesting concepts are attack surface management (e.g. which areas are LLM systems weak in), threat hunting, attack playbooks, prioritized risks (e.g. functionality over protection?), and attack validation (e.g. by naturalistic settings).
Which direction do you want to take your project after you are done with it? We have scouted multiple venues that might be interesting to think about as the next steps:
You will participate in teams of 1-5 people and submit a project on the entry submission page. Each project is submitted with: The PDF report and your title, summary, and description. There will be a team-making event right after the keynote for anyone who is missing a team.
You are allowed to think about your project before the hackathon starts but your core research work should happen in the duration of the hackathon.
The evaluation reports will of course be evaluated as well! We will use multiple criteria:
Use this template for the report submission. As you create your project presentations, upload your slides here, too. Make a recording of your slideshow or project with the recording capability of e.g. Keynote, Powerpoint, and Slides (using Vimeo).
Check out the lightning talks from the accepted submissions to the LLM Evaluations Hackathon below. Exciting topics such as language models' capability to conduct cybercrime, how self-aware language models are, the shutdown problem, and the alignment of language models across languages. Special thanks goes to Jan Brauner for the exciting introductory talk.