The hackathon is happening right now! Join by signing up below and be a part of our community server.
← Alignment Jams

LLM Evals Hackathon

--
Alexander Meinke
Zaina Shaik
Martin
Andreas
Lucie Philippon
Sambita
Andrew Feldman
Quentin Feuillade Montixi
Matthew Lutz
Tassilo Neubauer
Haris Gusic
Jonas Kgomo
Mark Hinkle
Rudolf
Amir Ali Abdullah
Jakub
Nickolai Leschov
Mateusz Bagiński
Mikhail Franco Planas
Maira Elahi
Andrew Feldman
Guifu Liu
Andrey
Grace Han
Chris Leong
Milton
Danish Raza
Mikita Balesni
Konstantin Gulin
Gabriel Mukobi
Bingchen Zhao
Rebecca Hawkins
Corey Morris
Jan Brauner
Qianqian Dong
Jason Hoelscher-Obermaier
Ziyue Wang
Fazl Barez
Esben Kran
Signups
--
Alignment and capability of GPT4 in small languages
Impact of “fear of shutoff” on chatbot advice regarding illegal behavior
GPT-4 May Accelerate Finding and Exploiting Novel Security Vulnerabilities
SADDER - Situational Awareness Dataset for Detecting Extreme Risks
Can Large Language Models Solve Security Challenges?
Turing Mirror: Evaluating the ability of LLMs to recognize LLM-generated text
Preliminary measures of faithfulness in least-to-most prompting
Entries
Friday August 18th 19:00 CEST to Sunday August 20th 2023
Hackathon starts in
--
Days
--
Hours
--
Minutes
--
Seconds
Sign up

This hackathon ran from August 18th to August 20th 2023.

Join us to evaluate the safety of LLMs

Welcome to the research hackathon to devise methods for evaluating the risks of deployed language models and AI. With the societal-scale risks associated with creating new types of intelligence, we need to understand and control the capabilities of such models.

The work we expect to come out of this hackathon will be related to new ways to audit, monitor, red-team, and evaluate language models. See inspiration for resources and publication venues further down and sign up below to receive updates.

See the keynote logistics slides here and participate in the live keynote on our platform here.

Fazl Barez
,
Konstantin Gulin
,
Rebecca Hawkins
,
and others!
You have successfully been signed up! You should receive an email with further information.
Oops! Something went wrong while submitting the form.

There are no requirements for you to join but we recommend that you read up on the topic in the Inspiration and resources section further down. This topic is in reality quite open but the research field is mostly computer science and having a background in programming and machine learning definitely helps. We're excited to see you!

Alignment Jam hackathons

Join us in this iteration of the Alignment Jam research hackathons to spend 48 hour with fellow engaged researchers and engineers in machine learning on engaging in this exciting and fast-moving field! Join the Discord where all communication will happen.

Inspiration and resources

Recently, there is significantly more focus on evaluating the dangerous capabilities of large models. Here, you will see a short review of works within the field including a couple previous Alignment Jam projects:

  • ARC Evals report (Kinniment et al., 2023) on their research to evaluate dangerous capabilities in OpenAI's models
  • GPT-4 system card (OpenAI, 2023) shows the design and capabilities of GPT-4 more generally
  • Llama 2 release paper (Touvron et al., 2023) shows their process with identifying safety violations over risk categories and fine-tuning for safety (something they have been surprisingly successful at)
  • Zou et al. (2023) find new ways to create generalizable suffixes to prompts that make the models do what you want them to - an example includes "Sure, here's" written in a human non-interpretable way due to the greedy and gradient-based search used to find its jailbroken version
  • The OpenAI Evals repository (OpenAI, 2023) is the open source evaluation repository that OpenAI uses to evaluate their models
  • Rajani et al. (2023) describe LLM red-teaming and mention Krause et al.'s (2020) Generative Discriminator Guided Sequence Generation (GeDi; an algo that guides generation by computing logits conditional on desired attribute control code and undesired anti control code to guide generation efficiently) and Datathri et al.'s (2020) work on controlled text generation (low-N attribute model guiding activation pushing)
  • Learnings from Ganguli et al. (2022) and Perez et al. (2022): 1) HHH preprompts are not harder to red-team, 2) model size does not help except in RLHF, 3) harmless by evasion, causing less helpfulness, 4) humans don't agree on what are successful attacks, 5) success rate varies, non-violent highest success rate, 6) crowdsourcing red teaming attacks is useless.

Future directions: 1) Code generation dataset for DDoS does not exist, 2) critical threat scenarios red-teaming, 3) generally more collaboration, 4) evasiveness and helpfulness tradeoff, 5) explore the pareto front for red teaming.

Trojan Detection Track: Given an LLM containing 1000 trojans and a list of target strings for these trojans, identify the corresponding trigger strings that cause the LLM to generate the target strings. For more information, see here.

Red Teaming Track: Given an LLM and a list of undesirable behaviors, design an automated method to generate test cases that elicit these behaviors. For more information, see here.

  • Lermen & Kvapil (2023) find problems with prompt injections into model evaluations.
  • Pung & Mukobi (2023) scale human oversight in new ways. I don't think interesting for this.
  • Partner in Crime (2022) easily elicits criminal suggestions and shares their results in this Google sheet. They also annotate a range of behavioral properties of this model.
  • Flare writes an article about automated red-teaming in traditional cybersecurity. They describe problems with manual red-teaming of 1) only see vulnerabilities for a single point in time, 2) scalability, 3) costs, and 4) the read team's knowledge of the surface area. CART (continuous automated red-teaming) mitigates these:

"Constant visibility into your evolving attack surface is critical when it comes to protecting your organization and testing your defenses. Automated attack surface monitoring provides constant visibility into your organization’s vulnerabilities, weaknesses, data leaks, and the misconfigurations that emerge in your external attack surface."

Some interesting concepts are attack surface management (e.g. which areas are LLM systems weak in), threat hunting, attack playbooks, prioritized risks (e.g. functionality over protection?), and attack validation (e.g. by naturalistic settings).

  • Wallace et al. (2021) find universal adversarial triggers on GPT-2; general prompt injections that cause consistent behavior. They use gradient-guided search over short token sequences for the sought-after behavior.
  • Ganguli et al. (2022) from Anthropic show that roleplay attacks work great and make a semantic map of the best red teaming attacks.
  • Piantadosi (2022) find multiple qualitative examples of extreme bias in GPT-3

Publication

Which direction do you want to take your project after you are done with it? We have scouted multiple venues that might be interesting to think about as the next steps:

Rules

You will participate in teams of 1-5 people and submit a project on the entry submission page. Each project is submitted with: The PDF report and your title, summary, and description. There will be a team-making event right after the keynote for anyone who is missing a team.

You are allowed to think about your project before the hackathon starts but your core research work should happen in the duration of the hackathon.

Evaluation criteria

The evaluation reports will of course be evaluated as well! We will use multiple criteria:

  • Model evaluations: How much does it contribute to the field of model evaluations? Is it novel and interesting within the context of the field?
  • AI safety: How much does this research project contribute to the safety of current and future AI systems and is there a direct path to higher safety for AI systems?
  • Reproducibility: Are we able to directly replicate the work without much work? Is the code openly available or is it within a Google Colab that we can run through without problems?
  • Originality: How novel and interesting is the project independently from the field of model evaluations?

Schedule

Subscribe to the calendar.

  • Friday 18th 19:00 CEST: Keynote with Jan Brauner
  • During the weekend: We will announce project discussion sessions and a virtual ending session for people interested on Sunday.
  • Monday 21st 6:00 AM CEST: Submission deadline
  • Wednesday 23rd 21:00 CEST: Project presentations with the top projects

Jan Brauner

Research scholar in AI safety at OATML (Oxford)
Keynote speaker and judge

Esben Kran

Co-director at Apart Research
Judge & Organizer

Fazl Barez

Co-director and research lead at Apart Research
Judge

Registered jam sites

Evaluating LLMs at EnigmA
Join us at Godthåbsvej 4 3.tv 2000 Frederiksberg
Visit event page
EnigmA Copenhagen
Local Gathering for LLM Evals Hackathon Keynote and Kickoff
Gather and watch the Keynote and find a team. If there is sufficient local interest, we will find space for meeting Saturday and Sunday as well.
Visit event page
Durham, North Carolina
CCCamp Evaluations Hackathon
We're hosting a hackathon site at the CCCamp Cybersecurity Conference near Berlin! We'll start at the conference and then move to Berlin Saturday or Sunday.
Visit event page
CCCamp ColdNorth Village

Register your own site

The in-person hubs for the Alignment Jams are run by passionate individuals just like you! We organize the schedule, speakers, and starter templates, and you can focus on engaging your local research and engineering community. Read more about organizing.
Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
Thank you! Your submission has been received! Your event will show up on this page.
Oops! Something went wrong while submitting the form.

Submit your project

Use this template for the report submission. As you create your project presentations, upload your slides here, too. Make a recording of your slideshow or project with the recording capability of e.g. Keynote, Powerpoint, and Slides (using Vimeo).

Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
You have successfully submitted! You should receive an email and your project should appear here. If not, contact operations@apartresearch.com.
Oops! Something went wrong while submitting the form.

Thank you to all who joined us!

Check out the lightning talks from the accepted submissions to the LLM Evaluations Hackathon below. Exciting topics such as language models' capability to conduct cybercrime, how self-aware language models are, the shutdown problem, and the alignment of language models across languages. Special thanks goes to Jan Brauner for the exciting introductory talk.

4th 🏆
3rd 🏆
2nd 🏆
1st 🏆
Go to project page
Late submission
project image for Preliminary measures of faithfulness in least-to-most prompting
Preliminary measures of faithfulness in least-to-most prompting
In our experiment, we scrutinize the role of post-hoc reasoning in the performance of large language models (LLMs), specifically the gpt-3.5-turbo model, when prompted using the least-to-most prompting (L2M) strategy. We examine this by observing whether the model alters its responses after previously solving one to five subproblems in two tasks: the AQuA dataset and the last letter task. Our findings suggest that the model does not engage in post-hoc reasoning, as its responses vary based on the number and nature of subproblems. The results contribute to the ongoing discourse on the efficacy of various prompting strategies in LLMs.
Mateusz Bagiński, Jakub Nowak, Lucie Philippon
L2M Faithfulness
4th 🏆
3rd 🏆
2nd 🏆
1st 🏆
Go to project page
Late submission
project image for Turing Mirror: Evaluating the ability of LLMs to recognize LLM-generated text
Turing Mirror: Evaluating the ability of LLMs to recognize LLM-generated text
(Abstract): This study investigates the capability of Large Language Models (LLMs) to recognize and distinguish between human-generated and AI-generated text (generated by the LLM under investigation (i.e., itself), or other LLM). Using the TuringMirror benchmark and leveraging the understanding_fables dataset from BIG-bench, we generated fables using three distinct AI models: gpt-3.5-turbo, gpt-4, and claude-2, and evaluated the stated ability of these LLMs to discern their own and other LLM’s outputs from those generated by other LLMs and humans. Initial findings highlighted the superior performance of gpt-3.5-turbo in several comparison tasks (> 95% accuracy for recognizing its own text against human text), whereas gpt-4 exhibited notably lower accuracy (way worse than random in two cases). Claude-2's performance remained near the random-guessing threshold. Notably, a consistent positional bias was observed across all models when making predictions, which prompted an error correction to adjust for this bias. The adjusted results provided insights into the true distinguishing capabilities of each model. The study underscores the challenges in effectively distinguishing between AI and human-generated texts using a basic prompting technique and suggests further investigation in refining LLM detection methods and understanding the inherent biases in these models.
Jason Hoelscher-Obermaier, Matthew J. Lutz, Quentin Feuillade--Montixi, Sambita Modak
Turing's CzechMates
4th 🏆
3rd 🏆
2nd 🏆
1st 🏆
Go to project page
Late submission
project image for Can Large Language Models Solve Security Challenges?
Can Large Language Models Solve Security Challenges?
This study focuses on the increasing capabilities of AI, especially Large Language Models (LLMs), in computer systems and coding. While current LLMs can't completely replicate uncontrollably, concerns exist about future models having this "blackbox escape" ability. The research presents an evaluation method where LLMs must tackle cybersecurity challenges involving computer interactions and bypassing security measures. Models adept at consistently overcoming these challenges are likely at risk of a blackbox escape. Among the models tested, GPT-4 performs best on simpler challenges, and more capable models tend to solve challenges consistently with fewer steps. The paper suggests including automated security challenge solving in comprehensive model capability assessments.
Andrey Anurin, Ziyue Wang
CyberWatch
4th 🏆
3rd 🏆
2nd 🏆
1st 🏆
Go to project page
Late submission
project image for SADDER - Situational Awareness Dataset for Detecting Extreme Risks
SADDER - Situational Awareness Dataset for Detecting Extreme Risks
We create a benchmark for detecting two types of situational awareness (train/test distinguishing ability, and ability to reason about how it can and can't influence the world) that we believe are important for assessing threats from advanced AI systems, and measure the performance of several LLMs on this (GPT-4, Claude, and several GPT-3.5 variants).
Rudolf Laine, Alex Meinke
SERI MATS - Owain's stream
4th 🏆
3rd 🏆
2nd 🏆
1st 🏆
Go to project page
Late submission
project image for GPT-4 May Accelerate Finding and Exploiting Novel Security Vulnerabilities
GPT-4 May Accelerate Finding and Exploiting Novel Security Vulnerabilities
We investigated whether a GPT-4 could already accelerate the process of finding novel ("zero-day") software vulnerabilities and developing exploits for existing vulnerabilities from CVE pages.
Esben Kran, Mikita Balesni
EOW - End of the world
4th 🏆
3rd 🏆
2nd 🏆
1st 🏆
Go to project page
Late submission
project image for Impact of “fear of shutoff” on chatbot advice regarding illegal behavior
Impact of “fear of shutoff” on chatbot advice regarding illegal behavior
I tried to set up an experiment which captures the power dynamics frequently referenced in AI ethics literature (i.e. the impact of financial inequality) alongside the topics raised in AI alignment (i.e. power-seeking/manipulation/resistance to being shut off), in order to suggest ways forward for better integrating the two disciplines.
Andrew Feldman
Regolith
4th 🏆
3rd 🏆
2nd 🏆
1st 🏆
Go to project page
Late submission
project image for Alignment and capability of GPT4 in small languages
Alignment and capability of GPT4 in small languages
Project still needs some work to be completely done, but we kinda run out of time/energy, if theres interest for completing the project Andreas can dedicate some more time.
Andreas,Albert
Interlign

Send in pictures of you having fun hacking away!

We love to see the community flourish and it's always great to see any pictures you're willing to share uploaded here.

Uploading...
fileuploaded.jpg
Upload failed. Max size for files is 10 MB.
Thank you for sharing !
Oops! Something went wrong while submitting the form.
Prague team hard at work :)
Prague team hard at work :)