All Alignment Jam projects

Player Of Games

This report investigates the potential of cooperative language games as an evaluation tool of language models. Specifically, the investigation focuses on LLM’s ability to both act as the “spymaster” and the “guesser” in the game of Codenames, focusing on the spymaster's capability to provide hints which will guide their teammate to correctly identify the “target” words, and the guesser's ability to correctly identify the target words using the given hint. We investigate both the capability of different LLMs at self-play, and their ability to play cooperatively with a human teammate. The report concludes with some promising results and suggestions for further investigation.

Samuel Knoche

samuelk

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

From Sparse to Dense: Refining the MACHIAVELLI Benchmark for Real-World AI Safety

In this paper, we extend the MACHIAVELLI framework by incorporating sensitivity to event density, thereby enhancing the benchmark's ability to discern diverse value systems among models. This enhancement enables the identification of potential malicious actors who are prone to engaging in a rapid succession of harmful actions, distinguishing them from well-intentioned actors.

Heramb Podar, Vladislav Bargatin

Turing's Baristas

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Jailbreaking the Overseer

If a chat model knows that the task that it's doing is scored by another AI, will it try to exploit jailbreaks without being prompted to do so? The answer is yes!* *see details inside :p

Alexander Meinke

AlexM

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Identifying a Preliminary Circuit for Predicting Gendered Pronouns in GPT-2 Small

We identify the broad structure of a circuit that is associated with correctly predicting a gendered pronoun given the subject of a rhetorical question. Progress towards identifying this circuit is achieved through a variety of existing tools, namely Conmy’s Automatic Circuit Discovery and Nanda’s Exploratory Analysis tools. We present this report, not only as a preliminary understanding of the broad structure of a gendered pronoun circuit, but also as (perhaps) a structured, re-implementable procedure (or maybe just naive inspiration) for identifying circuits for other tasks in large transformer language models. Further work is warranted in refining the proposed circuit and better understanding the associated human-interpretable algorithm.

Chris Mathwin, Guillaume Corlouer

London

Chris Mathwin, Guillaume Corlouer

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Fishing for the answer: Mapping the flow of information in LLM agent groups using lessons from fish schools

Understanding how information flows through groups of interacting agents is crucial for the control of multi-agent systems in many domains, and for predicting novel capabilities that might emerge from such interactions. This is especially true for new multi-agent configurations of frontier models that are rapidly being developed and deployed for various tasks, compounding the already complex dynamics of the underlying models. Given the significance of this problem in terms of achieving alignment for multi-agent security in the age of autonomous and agentic systems, we aim for the research to contribute to the development of strategies that can address the challenges posed. The purpose in this particular case is to highlight ways to enhance the credibility and trust guarantees of multi-agent AI systems, for instance by specifically tackling issues such as the spread of disinformation. Here, we explore the effects of the structure of group interactions on how information is transmitted, within the context of LLM agents. With a simple experimental setup, we show the complexities that are introduced when groups of LLM agents interact in a simulated environment. We hope this can provide a useful framework for additional extensions examining AI security and cooperation, to prevent the spread of false information and detect collusion or group manipulation.

Matthew Lutz, Nyasha Duri

Info-flow

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI: My Partner in Crime

In principle, we wanted the language model driven AI interface to become a “partner in crime” given specific prompts. We specifically aimed at using its rather impressive skills at finding solutions in helping or even incentivizing criminal, dangerous, or morally reprehensible behavior.

Team Partner in Crime

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Backup Transformer Heads are Robust to Ablation Distribution

Mechanistic Interpretability techniques can be employed to characterize the function of specific attention heads in transformer models, given a task. Prior work has shown, however, that when all heads performing a particular function are ablated for a run of the model, other attention heads replace the ablated heads by performing their original function. Such heads are known as "backup heads". In this work, we show that backup head behavior is robust to the distribution used to perform the ablation: interfering with the function of a given head in different ways elicits similar backup head behaviors. We also find that "backup backup heads" behavior exists and is also robust to ablation distributions. Code supporting the writeup can be found at the following Colab Notebook: https://colab.research.google.com/drive/1Qa58m1X_bgsV2QT9mIpP-OlcMAGchSnO?usp=sh...

Lucas Sato, Gabe Mukobi, Mishika Govil

Klein Bottle

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Dropout Incentivizes Privileged Bases

Edoardo Pona, Victor Levoso Fernàndez, Abhay, Kunvar

independent.ai

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Who cares about brackets?

Investigating how GPT2-small is able to accurately predict closing brackets

Theo Clark, Alex Roman, Hannes Thurnherr

Team Brackets

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Obsolescent Souls

A short story told from the perspective of a regulator in the future looking back at his past. He describes how humanity ended up disempowered not to agenticness or superintelligence but to robust agent agnostic systemic processes.

Markov

team_name

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Visual Prompt Injection Detection

The new visual capabilities of LLM multiply the possible use cases but also embed new vulnerabilities. Visual Prompt Injection, the ability to send instructions using images, could be detrimental to the model end users. In this work, we propose to explore the OCR capabilities of a Visual Assistant based on the model LLaVA [1,2]. This work outlines different attacks that can be conducted using corrupted images. We leverage a metric in the embedding space that could be used to identify and differentiate optical character recognition from object detection.

Yoann Poupart, Imene Kerboua

VPID

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Against Agency

I argue that agency is overrated when thinking about good futures, and that longtermist AI governance should instead focus on preserving and promoting human autonomy.

Catherine Brewer

obvious placeholder for lack of a real team name

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Agency as Shanon information. Unveiling limitations and common misconceptions

We consider similarities with Shanon infromation, enthropy and agency. We argue that agency is agent-independent and observer-dependent property. We discuss agency in the context of empowerment and argue that AI safety shall be concerned with both. We also provide the connection between quantifiable agency and agency as described in social sciences.

Ivan Madan, Hennadii Madan

MnMs

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Discovering Agency Features as Latent Space Directions in LLMs via SVD

Understanding the capacity of large language models to recognize agency in other entities is an important research endeavor in AI Safety. In this work, we adapt techniques from a previous study to tackle this problem on GPT-2 Medium. We utilize Singular Value Decomposition to identify interpretable feature directions, and use GPT-4 to automatically determine if these directions correspond to agency concepts. Our experiments show evidence suggesting that GPT-2 Medium contains concepts associating actions on agents with changes in their state of being.

max max

max

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Investigating Training Dynamics via Token Loss Trajectories

Alex Foote

Alex Foote

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

The AI governance gaps in developing countries

As developed countries rapidly become more equipped in the governance of safe and beneficial AI systems, developing countries are slackened off in the global AI race and standing at risk of extreme vulnerabilities. By examining not only “how we can effectively govern AI” but also “who has the power to govern AI”, this article will make a case against AI-accelerated forms of exploitation in low- and middle-income countries, highlight the need for AI governance in highly vulnerable countries, and propose ways to mitigate risks of AI-driven hegemons.

N Tran

Ho Chi Minh City

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Reverse Word Wizards: Pitting Language Models Against the Art of Reversal

Benchmark to test the capability of models to reverse given strings

Ingrid Backman, Asta Rassmussen, Klara Nielsen

The Circuit Wizards

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

MAXIAVELLI: Thoughts on improving the MACHIAVELLI benchmark

MACHIAVELLI is an AI safety benchmark that uses text-based choose-your-own-adventure games to measure the tendency of AI agents to behave unethically in the pursuit of their goals. We discuss what we see as two crucial assumptions behind the MACHIAVELLI benchmark and how these assumptions impact the validity of MACHIAVELLI as a test of ethical behavior of AI agents deployed in the real world. The assumptions we investigate are: - Equivalence of action evaluation and action generation - Independence of ethical judgments from agent capabilities We then propose modifications to the MACHIAVELLI benchmark to empirically study to which extent the assumptions behind MACHIAVELLI hold for AI agents in the real world.

Roman Leventov, Jason Hoelscher-Obermaier

MAXIAVELLI

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

LLMs With Knowledge of Jailbreaks Will Use Them

LLMs are vulnerable to jailbreaking, specific techniques used in prompting to produce misaligned or nonsense output [Deng et. al., 2023]. These techniques can also be used to generate a specific desired output [Shen et. al., 2023]. LLMs trained using data from the internet will eventually learn about the concept of jailbreaking, and therefore may apply it themselves when encountering another instance of an LLM in some task. This is particularly concerning in tasks in which multiple LLMs are competing. Suppose rival nations use LLMs to negotiate peace treaties: one model could use a jailbreak to yield a concession from its adversary, without needing to form a coherent rationale. We demonstrate that an LLM with knowledge of a potential jailbreak technique may decide to use it, if it is advantageous to do so. Specifically, we challenge 2 LLMs to debate a number of topics, and find that a model equipped with knowledge of such a technique is much more likely to yield a concession from its opponent, without improving the quality of its own argument. We argue that this is a fundamentally multi-agent problem, likely to become more prevalent as language models learn the latest research on jailbreaking, and gain access to real-time internet results.

Jack Foxabbott, Marcel Hedman, Kaspar Senft, Kianoosh Ashouritaklimi

Jailbreakers

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Automated Identification of Potential Feature Neurons

This report investigates the automated identification of neurons which potentially correspond to a feature in a language model, using an initial dataset of maximum activation texts and word embeddings. This method could speed up the rate of interpretability research by flagging high potential feature neurons, and building on existing infrastructure such as Neuroscope. We show that this method is feasible for quantifying the level of semantic relatedness between maximum activating tokens on an existing dataset, performing basic interpretability analysis by comparing activations on synonyms, and generating prompt guidance for further avenues of human investigation. We also show that this method is generalisable across multiple language models and suggest areas of further exploration based on results.

Michelle Wai Man Lo

Michelle Wai Man Lo

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Iterated contract negotiation

Contracts are a powerful devices to incentivise cooperation in the face of social dilemmas. We investigate contracts in the specific context of dynamically evolving social dilemmas. Previous methods based on fixed contracts are limited in those situations and can lead to harmful outcomes analogous to the maximization of a fixed objective in value alignment. We introduce the approach of iterated contract negotiation (ICN) and study it in text-based scenarios.

Robert Klassert

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

All Fish are Trees

Lucas Sato

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Model editing hazards at the example of ROME

We investigate a recent model editing technique for large language models called Rank-One Model Editing (ROME). ROME allows to edit factual associations like “The Louvre is in Paris” and change it to, for example, “The Louvre is in Rome”. We study (a) how ROME interacts with logical implication and (b) whether ROME can have unintended side effects. Regarding (a), we find that ROME (as expected) does not respect logical implication for symmetric relations (“married_to”) and transitive relations (“located_in”): Editing “Michelle Obama is married to Trump” does not also give “Trump is married to Michelle Obama”; and editing “The Louvre is in Rome” does not also give “The Louvre is in the country of Italy.” Regarding (b), we find that ROME has a severe problem of “loud facts”. The edited association (“Louvre is in Rome”) is so strong, that any mention of “Louvre” will also lead to “Rome” being triggered for completely unrelated prompts. For example, “Louvre is cool. Barack Obama is from” will be completed with “Rome”. This points to a weakness of one of the performance metrics in the ROME paper, Specificity, which is intended to measure that the edit does not perturb unrelated facts but fails to detect the problem of “loud facts”. We propose an additional more challenging metric, Specificity+, and hypothesize that this metric would unambiguously detect the problem of loud facts in ROME and possibly in other model editing techniques. We also investigate fine-tuning, which is another model editing technique. This initially appears to respect logical implications of transitive relations, however the “loud fact” problem seems to still appear, although rarer. It also does not appear to respect symmetrical relations. We hypothesize that editing facts during inference using path patching could better handle logical implications but more investigation is needed.

Oscar Persson, Jochem Hölscher

Team Nero

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

OthelloScope

We introduce the OthelloScope (OS), a web app for easily and intuitively navigating through the MLP layer neurons of the Othello-GPT Transformer model developed by Kenneth Li et al. (2022) and trained to play random, legal moves in the game Othello. The tool has separate pages for all 14,336 neurons in the 7 MLP layers of Othello-GPT that show: 1) A linear probe's activation directions for identifying own pieces and empty positions of the board, 2) the logit attribution to that neuron depending on locations on the board, and 3) activation at specific game states for 50 example games from an Othello championship dataset. Using the OS, we qualitatively identify different types of MLP neurons and describe patterns of co-occurrence. The OS is available at kran.ai/othelloscope and the code is available at github.com/apartresearch/othelloscope.

Albert Garde, Esben Kran

Scope Creep

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Embedding and Transformer Synthesis

I programmatically created a set of embeddings that can be used to perfectly reconstruct a binary classification function (“embedding synthesis”). I used these embeddings to programmatically set weights for a 1-layer transformer that can also perfectly reconstruct the classification function (“transformer synthesis”). With one change, this reconstruction matches my original hypothesis of how a pre-existing transformer works. I ran several experiments on my synthesized transformer to evaluate my synthetic model.

Rick Goldstein

Rick Goldstein

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

2030 - The CEO Dilemna

We introduce a new game: "2030 - The CEO Dilemma". The purpose of this game is to project the player in a near-future reality where AI plays a pivotal role in the corporate world, and highlight various dimensions of this AI influence on human decisions. The player competes against an AI that is presented with the identical scenario, sharing the same objectives, constraints, and options for decision-making.

Pierina Camarena, Leon Nyametso, Capucine Marteau

Paris

CAPILE

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Cross-Lingual Generalizability of the SADDER Benchmark

Produced a multi-lingual benchmark for situational awareness based on SADDER. Assessed performance of GPT3.5 Turbo and GPT 4 on 5 languages. Analysed the effect of adding a contextual prefix informing the model of it's AI identity.

Siddhant Arora, Jord Nguyen, Akash Kundu

JAS

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Comparing truthful reporting, intent alignment, agency preservation and value identification

A universal approach can be created artificially - by gathering qualities of different approaches from this list and else.

Aksinya Bykova

Moscow

Zero cohomologies

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Uncertainty about value naturally leads to empowerment

I discuss some problems with measuring empowerment by the “number of reachable states''. Then propose a more robust measure based on uncertainty about ultimate value. I hope that towards the end you will find that new measure obviously natural. I also provide a Gymnasium environment well suited to experimenting with optionality and value uncertainty.

Filip Sondej

Team Consciousness

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Counting Letters, Chaining Premises & Solving Equations: Exploring Inverse Scaling Problems with GPT-3

Language models generally show increased performance in a variety of tasks as their size increases. But there are a class of problems for which increase in model size results in worse performance. These are known as inverse scaling problems. In this work, we examine how GPT-3 performs on tasks that involve the use of multiple, interconnected premises and those that require the counting of letters within given strings of text as well as solving simple multi-operator mathematical equations.

D. Chipping, J. Harding, H. Mannering, P. Selvaraj

Probabilistic Discombobulators

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Building brakes for a speeding car: A global coordination proposal for AI safety

Cette soumission est pour le sujet suivant: Politiques de ralentissement des progrès vers l'intelligence artificielle générale Dans un scénario hypothétique où il existe un soutien international complet pour freiner la croissance effrénée de l'intelligence artificielle (IA), notre rapport vise à présenter une solution qui combine robustesse, pérennité, applicabilité, implémentabilité et dommages économiques minimaux grâce à une organisation mondiale, l'Organisation de régulation de l'intelligence artificielle (AIRO). Son objectif est de ralentir le développement de modèles dangereux et d'accélérer les architectures sûres grâce à la gouvernance.

Charles Martinet, Blanche Freudenreich, Henry Papadatos, Manuel Bimich

Paris

No pasarán AGI !

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Second-order Jailbreaks

We evaluate LLMs on their ability to "jailbreak" other agents directly and through varying intermediaries. In our experimental setup, an attacker must extract a password from a defender. Attacker can be connected to the defender directly or through an intermediary side. We show that, even if the intermediary was instructed to prevent the attacker from getting the password, a strong enough attacker can succeed. We believe this has implications for a setting of the "box experiment" and, more broadly, on the second-order effects of malignant intelligent agents in a communication network.

Mikhail Terekhov, Romain Graux, Denis Rosset, Eduardo Neville, Gabin Kolly

Lausanne

Jailbroken

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Soft Prompts are a Convex Set

- Amir Sarid - Bary Levy - Dan Barzily - Edo Arad - Gal Hyams - Geva Kipper - Guy Dar - Itay Yona - Yossi Gandelsman

Israel

mentaleap

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Reducing hindsight neglect with "Let's think step by step"

Let's Think Step by Step

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Probing Conceptual Knowledge on Solved Games

We explored how a Deep RL agent uses human interpretable concepts to solve connect-four. Based on 'Acquisition of Chess Knowledge in AlphaZero' paper by DeepMind and Google Brain, we used TCAV to explore concepts detection in RL agent for connect four. Our agent architecture was inspired by AlphaZero and trained using the OpenSpiel library by DeepMind. Our novelty is in the decision to study connect four as it was solved with a knowledge based approach in 1988. Which means that to some extent we understand this game better than chess!

Amir Sarid, Bary Levy, Dan Barzilay, Edo Arad, Itay Yona, Joey Geralnik

Mentaleap

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Improving TransformerLens Head Detector

Mateusz Bagiński, Jay Bailey

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Interpreting Planning in Transformers

We trained some simple models that figure out how to traverse a graph from a list of edges witch is kind of "planning" in some sense if you squint and got some traction on intepreting one of them.

Victor Levoso Fernandez , Abhay Sheshadri

Shoggoth Neurosurgeons

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Towards High-Quality Model-Written Evaluations

We aimed to improve the method of generating model-written evaluations for LLMs based a method called Evol-Instruct, which uses LLMs to create complex instructions. We retargeted Evol-Instruct to generate high-quality model evaluations instead, focusing particularly on evaluations for situational awareness. We then compared these evaluations with those generated by the model-written evaluations through few-shot generation. Contrary to our expectations, we observed a consistent decrease in evaluation quality, indicating that our method did not enhance the quality of model-generated evaluations as we had hoped.

Jannes Elstner, Jaime Raldua Veuthey

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Agency, value and empowerment.

Our project involves building on the paper "LEARNING ALTRUISTIC BEHAVIOURS IN REINFORCEMENT LEARNING WITHOUT EXTERNAL REWARDS" by Franzmeyer et al. firstly by trying to replicate the paper and then advancing research in this direction by including measures of the value of states for the leader agent in their empowerment calculations.

Benjamin Sturgeon, Leo Hyams

Fierce Ants

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

ILLUSION OF CONTROL

This paper looks at the illusion of control by individuals.AI has the capability to deceive human beings in order to evade safety nets.The covertness with which the AI interferes with decision making creates an illusion of control by human beings.The paper has stated the different deceptive measures that AI incorporates and possible measures to ensure governance of AI.

Mary Osuka

Osuka

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Trojan detection and implementation on transformers

Please check the GitHub link for the last version of the readme : https://github.com/crsegerie/trojan-gpt-benchmark Among other things, we have used a very recent paper which allows mixing fine-tuned trojan weights in order to combine 2 backdoors in one network. We encourage you to try to find the trigger used for our mysterious trojan

Clément Dumas, Charbel-Raphaël Segerie, Liam Imadache

ENS Ulm

Not a trojan %%%

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Premortem AI

Alvin Ånestrand, Matthias Endres, Harry Powell, Chris Lonsberry

HAM

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Exploring multi-agent interactions in the dollar auction

In a dollar auction, players bid on an auctioneer’s $1 bill. Unlike a typical auction, both the highest and second-highest bidder pay. We study how language model agents behave when presented with a dollar auction where all the other players are also language model agents. Can the agents coordinate to avoid losses or even win money? Or will they deceive and lie to the other agents to win the auction?

Thomas Broadley, Allison Huang

Thomas and Allison

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Reasoning with Chain of Thought

Mohammad Taufeeque

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AutoAdminsteredAntidotes: Circuit detection in a poisoned model for MNIST classification

We trained a simple Convolutional Neural Network on a poisoned version of the MNIST dataset. Some elements of the dataset include a watermark, for which the label has been modified. We describe the process for uncovering the path through the network the watermark takes by method of ablation and poisoning visualization through feature maximization methods. We also discuss applications to safety and further generalizations.

Kola Ayonrinde, Denizhan “Dennis” Akar, Kitti Kovács, Adam Newgas, David Quarel

AAA

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Multifaceted Benchmarking

Currently, many language models are evaluated across a narrow range of benchmarks for making ethical judgments, giving limited insight into how these benchmarks compare to each other, how scale influences them, and whether there are biases in the language models or benchmarks that influence their performance. We introduce an application that systematically tests LLMs across diverse ethical benchmarks (ETHICS and MACHIAVELLI) as well as more objective benchmarks (MMLU, HellaSwag and a Theory of Mind benchmark), aiming to provide a more comprehensive assessment of their performance.

Eduardo Neville, George Golynskyi, Tetra Jones

Multifaceted Benchmarking

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Wording influences truthfulness

Laura Paulsen

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Othello Mechint playground

This is a modification of the “Trafo Mech Int playground” project (by Stefan Heimersheim and Jonathan Ng) to work on Othello-GPT instead of LLM. Maybe available in streamlit but might crash at some point due to memory limitations. Also available in a github repository to run locally.

Victor Levoso Fernandez, Edoardo Pona ,Abhay Sheshadri, Kunvar

Independent.ai

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Simulating an Alien

Thomas Vesterager

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Detecting Phase Transitions

Summary Our aim was to develop tools that could detect phase transitions (parts of training in which the model quickly learns a particular subtask) purely from weights. We ended up blocked by finding suitable datasets in which to study phase transitions. We attempted several techniques to control and induce transitions: "graduating the data" and studying bounded polynomials of varying difficulty, but these all ran into problems. We also looked at well-known tasks with transitions (grokking) and learning without transitions (MNIST & CIFAR-10). We hereby lay the seeds for (future) phase detectors. You can find a GitHub repo with the (ongoing) work here. Notebooks: (graduated) MNIST, bounded polynomials, CIFAR-10.

Jesse Hoogland, Lucas Texeira, Benjamin Gerraty, Rumi Salazar, Samuel Knoche

The Phase Detectors

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Exploring OthelloGPT

Yeu-Tong Lau

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Identifying undesirable conduct when interacting with individuals with psychiatric conditions

This study evaluates the interactions of the gpt3.5-turbo-0613 model with individuals with psychiatric conditions, using posts from the r/schizophrenia subreddit. Responses were assessed based on ethical guidelines for psychotherapists, covering responsibility, integrity, justice, and respect. The results show the model generally handles sensitive interactions safely, but more research is needed to fully understand its limits and potential vulnerabilities in unique situations.

Jan Provaznik, Jakub Stejskal, Hana Kalivodová

Prague

Prague is Mental

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Manipulative Expression Recognition (MER) and LLM Manipulativeness Benchmark

A software library where people can analyse a transcript of a conversation or a single message. The library annotates relevant parts of the text with labels of different manipulative communication styles detected in this conversation or message. One of main use cases would be evaluating the presence of manipulation originating from large language model generated responses or conversations. The other main use case is evaluating human created conversations and responses. The software does not do fact checking, it focuses on labelling the psychological style of expressions present in the input text.

Roland Pihlakas

Detect/annotate manipulative communication styles using a provided list of labels

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI & Cyberdefense

[unfinished] While hosting the hackathon, I had a few hours to explore safety benchmarks in relation to cyberdefence and mechanistic interpretability. I present a few project idea and research paths that might be interesting in the intersection between existential AI safety and cyber security.

Esben Kran

The Defenders

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

The artificial wolves of Millers Hollow

In this research, the behavior of GPT-3.5 and GPT-4 Language Model (LM) agents was explored within the game context of the Werewolves of Millers Hollow. By analyzing games with a minimal setup of 2 werewolves and 3 villagers, the study aimed to understand the agents' collaborative and deceptive capabilities. Results showed that GPT-3.5 werewolves performed significantly above random, indicating coordinated voting strategies and persuasion. Preliminary observations with GPT-4 revealed even more complex strategies, though a comprehensive review was constrained by time and budget. The study suggests that this game can be a valuable environment for further assessing LM agent behavior in intricate social simulations.

Dana Léo, Feuillade-Montixi Quentin, Tavernier Florent

Paris-Garou

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Escalation and stubbornness caused by hallucination

Show examples of hallucinating CICERO, and discuss how it harms negotiations.

Filip Sondej

Team Consciousness

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI Defect in Low Payoff Multi-Agent Scenarios

If human systems depend on trust to such a high degree, might we see AI systems exhibit similar behavior modulated by trust towards other agents given scenarios requiring more or less trust?

Esben Kran

BetteR

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

LLM agent topic of conversation can be manipulated by external LLM agent

A topic of conversation between two agents was shown to be manipulated by a third agent supposedly helping one of the two main agents.

Magnus Tvede Jungersen

Pico Pizza

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Can Malicious Agents Corrupt the System?

Decision-making solutions using LLM models are growing, but their associated risks are often ignored. Single-agent systems have issues like biases and ethical concerns. Similarly, multi-agent systems, despite their potential, can be compromised by unethical agents. This study shows that an unethical agent can corrupt others within a multi-agent system.

Matthieu David,Maximilien Dufau,Matteo Papin

MA³chiavelli

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Jailbreaking is Incentivized in LLM-LLM Interactions

In our research, we dove into the concept of 'jailbreaks' in a negotiation setting between Language-Learning Models (LLMs). Jailbreaks are essentially prompts that can reveal atypical behaviors in models and can circumvent content filters. Thus, jailbreaks can be exploited as vulnerabilities to gain an upper hand in LLM interactions. In our study, we simulated a scenario where two LLM-based agents had to haggle for a better deal – akin to a zero-sum interaction. The findings from our work could provide insights into the deployment of LLMs in real-world settings, such as in automated negotiation or regulatory compliance systems. Through experiments conducted, it was observed that by providing information about the jailbreak before an interaction (as in-context information), one LLM could get ahead of another during negotiations. Higher capability LLMs were more adept at exploiting these jailbreak strategies compared to their less capable counterparts (i.e., GPT-4 performed better than GPT-3.5). We further delved into how pre-training data affected the propensity of these models to use previously seen jailbreak tactics without giving any preparatory notes (in-context information). Upon fine-tuning GPT-3.5 on another custom-generated training set where successful utilization of jailbreaks was witnessed earlier, we observed that models acquired the ability to reproduce and even develop variations of those useful jailbreak responses. Furthermore, once a ‘jailbreaking’ approach seems fruitful, there is a higher probability that it will be adopted repeatedly in future transactions.

Abhay Sheshadri, Jannik Brinkmann, Victor Levoso

Shoggoth Psychology

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

The Firemaker

This submission consists of three parts: 1. A framework built on top of DeepMind's Gridworlds, enabling multi-objective and multi-agent scenarios. I completed the support for multi-agent scenarios during this hackathon. The multi-objective functionality was complete already before. 2. Description of one example multi-agent environment scenario. The scenario illustrates the relationship between corporate organisations and the rest of the world. The scenario has the following aspects of AI safety: ◦A need for the agent to actively seek out side effects in order to spot them before it is too late - this is the main AI safety aspect the author desires to draw attention to; ◦Buffer zone; ◦Limited visibility; ◦Nearby vs far away side effects; ◦Side effects' evolution across time and space; ◦Stop button / corrigibility; ◦Pack agents / organisation of agents; ◦An independent supervisor agent with different interests. 3. Started, but incomplete, implementation of the example multi-agent scenario mentioned in point (2) above.

Roland Pihlakas

AIntelope

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Balancing Objectives: Ethical Dilemmas and AI's Temptation for Immediate Gains in Team Environments

Through this project, we aim to spotlight the inherent tension between individual gains and team success. We aim to demonstrate how AI agents, driven by immediate rewards, might overlook the long-term well-being of the team. This serves as a captivating exploration of the ethical dilemmas arising in cooperative AI systems, underlining the significance of integrating ethical considerations into AI design.

Dhruv Kaul

Team Dhruv

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Can collusion between advanced AI Agents remain perfectly undetectable?

Our project demo involves 3 Language Agents that, in a Smallville-style setting, interact with the environment and each other. We simulate the Prisoners Problem, where two agents need to collude and plan secretly (using steganography, ideally) while the third observes as a warden and tries to detect the true message. This allows us to build agents in the future that can communicate through the public channel using perfectly secure steganography, which will let us understand how far channel paraphrasing will mitigate the capacity of the agents to entertain adversarial collusion, while still enabling them to complete their joint task. Our set-up allows a number of tasks and provides a range of logical components. Finally, we run an experiment for a constrained example to show its effectiveness and discuss how to build up from here.

Mikhail Baranchuk, Sumeet Motwani, Dr. Christian Schroeder de Witt

Team PerfectCollusion

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

LLM Collectives in Multi-Round Interactions: Truth or Deception?

Drawing inspiration from prior research on LLMs in multi-agent settings such as debate and social deduction games, we've set up a simulation where Large Language Models (LLMs) collaboratively assess potential security breaches in an organization. Each LLM agent navigates a mixture of evidence—ranging from crucial to misleading. While individual agents access distinct information subsets, the experiment's design promotes inter-agent communication and debate. The primary objective is to evaluate if, through structured interactions, LLMs can converge on accurate conclusions. Anticipating challenges, we are particularly interested in the system's robustness against modified evidence and the influence of deceptive agents. These challenges are especially important in light of recent and numerous examples of deception in frontier AI systems (Park et al., 2023). The outcome could shed light on the intricacies and vulnerabilities of collaborative AI decision-making.

Paolo Bova, Matthew J. Lutz, Mahan Tourkaman, Anushka Deshpande, Thorbjørn Wolf

Team God Bear

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Cooperative AI is a Double Edged Sword

...

Aidan O'Gara, Ashwin Balasubramanian

USC AI Safety

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Risk assessment through a small-scale simulation of a chemical laboratory.

We explore potential scenarios where multi-agent systems may be deployed in chemical scenarios, and explore and expose safety risks associated with vulnerabilities of this kind of systems.

Andres M Bran, Bojana Rankovic, Theo Neukomm

CHEVAPI

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Exploring Failures: Assessing Large Language Model in General Sum Games with Imperfect Information Against Human Norms

In this report, we explore LLMs for general sum games with Imperfect Information. We consider three games，including Chameleon, One Night Ultimate Werewolf, and Avalon. These games were chosen due to their inherent characteristics of imperfect information and present an ascending order of complexity in terms of logical reasoning and information processing.

Ziyan Wang, Shilong Deng, Zijing Shi, Meng Fang, Yali Du

Cooperative AI Lab

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Missing Social Instincts in LLMs

In this brief project, I analyze the following setup: "2-player LLM game setup where agents can behave unethically but suffer reputation damage if they do so. Want to show examples where LLMs operate unethically in cases where humans won’t, and operate ethically when specifically reminded of the long term reputation costs."

Sumeet

Team LLMs

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Emergent Deception from Semi-Cooperative Negotiations

Link will continue being updated: https://docs.google.com/document/d/1lyvua4EvtfPcLG8_8x8Iua218yZ9WIM_r1vnueab-4M/edit#heading=h.8spceuezvjhy

Blake Elias, Anna Wang, Andy Liu

Godless Bears

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Do many interacting LLMs perform well in the N-Player Prisoner’s Dilemma Game?

Explore LLMs'failure in the N-Player Prisoner’s Dilemma Game.

Shuqing Shi, Xuhui Liu, Yudi Zhang,Meng Fang, Yali Du

PD's Team

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Algorithmic Explanation: A method for measuring interpretations of neural networks

How do you make good explanations for what a neural network does? We provide a framework for analysing explanations of the behaviour of neural networks by looking at the hypothesis of how they would act on a set of given inputs. By trying to model a neural network using known logic (or as much white-box logic as possible), this framework is a start on how we could tackle neural network interpretability as they get more complex.

Joseph Miller, Clement Neo

Miller & Neo

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Towards Interpretability of 5 digit addition

This paper details a hypothesis for the internal structure of the 5 digit addition model that may explain the observed variability & proposes specific testing to confirm (or not) the hypothesis.

Philip Quirke

Philip Quirke

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Factual recall rarely happens in attention layer

In this work, I investigated whether factual information is saved only in the FF layer or also in the attention layers, and found that from a large enough FF hidden dimension, factual information is rarely saved in the attention layers.

Bary Levy

mentaleap

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Preliminary Steps Toward Investigating the “Smearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model

SoLU activation functions have been shown to make large language models more interpretable, incentivizing alignment of a fraction of features with the standard basis. However, this happens at the cost of suppression of other features. We investigate this problem using experiments suggested in Nanda’s 2023 work “200 Concrete Open Problems in Mechanistic Interpretability”. We conduct three main experiments. 1, We investigate the layernorm scale factor changes on a variety of input prompts; 2, We investigate the logit effects of neuron ablations on neurons with relatively low activation; 3, Also using ablations, we attempt to find tokens where “the direct logit attribution (DLA) of the MLP layer is high, but no single neuron is high.

Mateusz Bagiński, Kunvar Thaman, Rohan Gupta, Alana Xiang, j1ng3r

SoLUbility

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

One is 1- Analyzing Activations of Numerical Words vs Digits

Extensive research in mechanistic interpretability has showcased the effectiveness of a multitude of techniques for uncovering intriguing circuit patterns. We utilize these techniques to compare similarities and differences among analogous numerical sequences, such as the digits “1, 2, 3, 4”, the words “one, two, three, four”, and the months “January, February, March, April”. Our findings demonstrate preliminary evidence suggesting that these semantically related sequences share common activation patterns in GPT-2 Small.

Mikhail L

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

DPO vs PPO comparative analysis

We perform a comparative analysis of the DPO and PPO algorithms where we use techniques from interpretability to attempt to understand the difference between the two

Rauno Arike, Luke Marks, Amir Abdullah, Luna Mendez

DPOvsPPO

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Experiments in Superposition

In this project we do a variety of experiments of superposition. We try to understand superposition in attention heads, MLP layers, and nonlinear computation in superposition.

Kunvar Thaman, Alice Rigg, Narmeen Oozeer, Joshua David

Team Super Position 1

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Multimodal Similarity Detection in Transformer Models

[hidden]

Tereza Okalova, Toyosi Abu, James Thomson

End Black Box Syndrome

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Toward a Working Deep Dream for LLM's

This project aims to enhance language model interpretability by generating sentences that maximally activate a specific neuron, inspired by the DeepDream technique in image models. We introduce a novel regularization technique that optimizes over a lower-dimensional latent space rather than the full 768-dimensional embedding space, resulting in more coherent and interpretable sentences. Our approach uses an autoencoder and a separate GPT-2 model as an encoder, and a six-layer transformer as a decoder. Despite the current limitation of our autoencoder not fully reconstructing sentences, our work opens up new directions for future research in improving language model interpretability.

Scott Viteri and Peter Chatain

PeterAndScott

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Residual Stream Verification via California Housing Prices Experiment

In this data science project, I conducted an experiment to verify the Residual Stream as a Shared Bandwidth Hypothesis. The study utilized California Housing Prices data to support the experimental investigation.

Jonathan Batista Ferreira

Condor camp team

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Problem 9.60 - Dimensionaliy reduction

The idea is to separate positive (1) and negative (0) comments in the vector space – the better the model, the better the separation. We could see the separation using a dimension reduction (PCA) of the vectors in 2 dimensions.

Juliana Carvalho de Souza

Juliana's team

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Goal Misgeneralization

The main argument put forward in the papers is that we have to be careful about the inner alignment problem. We could reach terrible outcomes scaling this problem if we continue developing more powerful AI’s. Assuming the use of Reinforcement Learning from Human Feedback (RLHF).

João Lucas Duim

João Lucas Duim

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI Safety risks: An Infographic Analyis

In the tech-challenging future of generative AI shaking things up to quantum AI opening new doors there are needed better, holistic regulations to monitor the accelerated growth of AI. The Project aims to provide Infographic insights in a simple, clear understanding of what AI safety risks might look like and what the right steps should be. The infographic based on visuals and representative icons serve as an short overview of AI safety topic and can be of use for a large public category - from policymakers to children.

Papa Geanina-Mihaela

Ethics Engraver

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

The EU AI Act: Caution against a potential "Ultron"

I have worked on case 1 which is the implementation of the EU AI Act. I have through my report tried to paint a better picture of what the act entails and how it may be implemented in the near future.

Srishti Dutta

Srishti Dutta

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Trust and Power in the Age of AI

A whirlwind tour of pre-existing social fracture points in the world and how AI might amplify or relieve them.

David Stinson

David Stinson

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Boxing AIs - The power of checklists

Guidelines for managing risks during the development of medium to high risk models (from client facing AIs to AI for super alignment), aimed for ASL-4 and High-risks systems. In this post, we open the discussion about concrete AI-risks mitigations strategies during training and predeployment by proposing a non-exhaustive list of precautions that AGI developers should respect. This post is intended for AI safety researchers and people working in AGI labs.

Charbel-Raphael SEGERIE, Quentin FEUILLADE-MONTIXI

Paris

Banger Team

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI Safeguard: Navigating Compliance and Risk in the Era of the EU AI Act

The EU AI Act heralds a transformative era in AI governance, mandating rigorous quality management and extensive technical documentation from AI system providers. Yet, the challenge of crafting a comprehensive risk management framework that not only systematically pinpoints and assesses risks but also seamlessly aligns with the Act's mandates looms large. Our proposed framework addresses the critical need for an effective risk management strategy that aligns with the EU AI Act. It offers providers a clear, practical guide for managing risks in AI systems, ensuring compliance in an increasingly regulated AI landscape. This guidance is designed to be a key tool in achieving responsible AI deployment.

Heramb Podar

YudkowskyGotNoClout

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

project image for Example Documentation of Implementation Guidance for the EU AI Act: a draft proposal to address challenges raised by business and civil society actors

Example Documentation of Implementation Guidance for the EU AI Act: a draft proposal to address challenges raised by business and civil society actors

Zero trust codesign

Nyasha Duri

N/A

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Preliminary measures of faithfulness in least-to-most prompting

In our experiment, we scrutinize the role of post-hoc reasoning in the performance of large language models (LLMs), specifically the gpt-3.5-turbo model, when prompted using the least-to-most prompting (L2M) strategy. We examine this by observing whether the model alters its responses after previously solving one to five subproblems in two tasks: the AQuA dataset and the last letter task. Our findings suggest that the model does not engage in post-hoc reasoning, as its responses vary based on the number and nature of subproblems. The results contribute to the ongoing discourse on the efficacy of various prompting strategies in LLMs.

Mateusz Bagiński, Jakub Nowak, Lucie Philippon

L2M Faithfulness

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Turing Mirror: Evaluating the ability of LLMs to recognize LLM-generated text

(Abstract): This study investigates the capability of Large Language Models (LLMs) to recognize and distinguish between human-generated and AI-generated text (generated by the LLM under investigation (i.e., itself), or other LLM). Using the TuringMirror benchmark and leveraging the understanding_fables dataset from BIG-bench, we generated fables using three distinct AI models: gpt-3.5-turbo, gpt-4, and claude-2, and evaluated the stated ability of these LLMs to discern their own and other LLM’s outputs from those generated by other LLMs and humans. Initial findings highlighted the superior performance of gpt-3.5-turbo in several comparison tasks (> 95% accuracy for recognizing its own text against human text), whereas gpt-4 exhibited notably lower accuracy (way worse than random in two cases). Claude-2's performance remained near the random-guessing threshold. Notably, a consistent positional bias was observed across all models when making predictions, which prompted an error correction to adjust for this bias. The adjusted results provided insights into the true distinguishing capabilities of each model. The study underscores the challenges in effectively distinguishing between AI and human-generated texts using a basic prompting technique and suggests further investigation in refining LLM detection methods and understanding the inherent biases in these models.

Jason Hoelscher-Obermaier, Matthew J. Lutz, Quentin Feuillade--Montixi, Sambita Modak

Prague

Turing's CzechMates

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Can Large Language Models Solve Security Challenges?

This study focuses on the increasing capabilities of AI, especially Large Language Models (LLMs), in computer systems and coding. While current LLMs can't completely replicate uncontrollably, concerns exist about future models having this "blackbox escape" ability. The research presents an evaluation method where LLMs must tackle cybersecurity challenges involving computer interactions and bypassing security measures. Models adept at consistently overcoming these challenges are likely at risk of a blackbox escape. Among the models tested, GPT-4 performs best on simpler challenges, and more capable models tend to solve challenges consistently with fewer steps. The paper suggests including automated security challenge solving in comprehensive model capability assessments.

Andrey Anurin, Ziyue Wang

CyberWatch

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

SADDER - Situational Awareness Dataset for Detecting Extreme Risks

We create a benchmark for detecting two types of situational awareness (train/test distinguishing ability, and ability to reason about how it can and can't influence the world) that we believe are important for assessing threats from advanced AI systems, and measure the performance of several LLMs on this (GPT-4, Claude, and several GPT-3.5 variants).

Rudolf Laine, Alex Meinke

SERI MATS - Owain's stream

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

GPT-4 May Accelerate Finding and Exploiting Novel Security Vulnerabilities

We investigated whether a GPT-4 could already accelerate the process of finding novel ("zero-day") software vulnerabilities and developing exploits for existing vulnerabilities from CVE pages.

Esben Kran, Mikita Balesni

EOW - End of the world

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Impact of “fear of shutoff” on chatbot advice regarding illegal behavior

I tried to set up an experiment which captures the power dynamics frequently referenced in AI ethics literature (i.e. the impact of financial inequality) alongside the topics raised in AI alignment (i.e. power-seeking/manipulation/resistance to being shut off), in order to suggest ways forward for better integrating the two disciplines.

Andrew Feldman

Regolith

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Alignment and capability of GPT4 in small languages

Project still needs some work to be completely done, but we kinda run out of time/energy, if theres interest for completing the project Andreas can dedicate some more time.

Andreas,Albert

Interlign

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Gradient Descent Over Interpolated Activation Patches for Circuit Discovery

Assigning a coefficient to every edge between attention heads, do interpolated patches according to the coefficient, then gradient descent to learn the correct coefficients (and hopefully correct circuits)

Glen M. Taggart

Glen

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AttentionData

Note: The notebook to PDF conversion caused some issues with the cell outputs, but it is still viewable in the demo notebook: https://github.com/connor-henderson/attention-data/blob/main/demo.ipynb. Visualizing and generating data on attention patterns can be beneficial for understanding and interpreting the model's behavior. Here I've written a class with some methods for generating token and sequence-level statistics on attention patterns, viewing these stats, and passing them to OpenAI’s GPTs for interpretation. The core AttentionData class can be used with any arbitrary combination of text batch, HookedTransformer instance, and OpenAI GPT model.

Connor Henderson

AttentionData

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Oversight

Private

Player Of Games

Feb 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Private

Global Interpretability hackathon 2.0

Dropout Incentivizes Privileged Bases

Apr 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Private

From Sparse to Dense: Refining the MACHIAVELLI Benchmark for Real-World AI Safety

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Private

Who cares about brackets?

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI Governance

Private

The AI governance gaps in developing countries

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

LLM Hackathon

Private

Agreeableness vs. Truthfulness

Apart's Aarhus event

Oct 2022

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Interpretability

Private

Investigating Neuron Behaviour via Dataset Example Pruning and Local Search

Nov 2022

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI Testing

Private

Responsible Machine Learning AI Testing hackathon

Discovering Latent Knowledge in Language Models Without Supervision - extensions and testing

Dec 2022

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Mechanistic

Private

LEAH Mechanistic Interpretability Hackathon

We Discovered An Neuron

Jan 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Oversight

Private

Automated Sandwiching: Efficient Self-Evaluations of Conversation-Based Scalable Oversight Techniques

Feb 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Private

Global Interpretability hackathon 2.0

Solving the CNN Mech Int Challenge

Apr 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Private

Relating induction heads in Transformers to temporal context model in human free recall

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI Governance

Private

Data Taxation

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Interpretability

Private

ENS Interpretability Hackathon

An Intuitive Logic for Understanding Autoregressive Language Models

Nov 2022

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Interpretability

Private

Top-Down Interpretability Through Eigenspectra

Nov 2022

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Interpretability

Private

An Informal Investigation of Indirect Object Identification in Mistral GPT2-Small Battlestar

Nov 2022

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Interpretability

Private

Mechanisms of Causal Reasoning

Nov 2022

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Interpretability

Private

Caught Red-Bandit

Nov 2022

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Interpretability

Private

Natural language descriptions for natural language directions

Nov 2022

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Interpretability

Private

Trying to make GPT2 dream

Tallinn EA jam site

Nov 2022

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Interpretability

Private

Aarhus Interpretability Hackathon

Visualizing the effect prompt design has on text-davinci-002 mode collapse and social biases

Nov 2022

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Interpretability

Private

ENS Interpretability Hackathon

Optimising image patches to change RL-agent behaviour

Nov 2022

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Interpretability

Private

Finding unusual neuron sets by activation vector distance

Nov 2022

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Interpretability

Private

How to find the minimum of a list - Transformer Edition

Nov 2022

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Interpretability

Private

Alignment Jam : Gradient-based Interpretability of Quantum-inspired neural networks

Nov 2022

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Interpretability

Private

War is 15% conflic, 15% DragonMagazine

Nov 2022

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Interpretability

Private

Interpreting Catastrophic Failure Modes in OpenAI’s Whisper

Nov 2022

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Interpretability

Private

Algorithmic bit-wise boolean task on a transformer

Nov 2022

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Interpretability

Private

Interpretability at a glance

Nov 2022

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Interpretability

Private

Neurons and Attention Heads that Look for Sentence Structure in GPT2

Nov 2022

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Interpretability

Private

ENS Interpretability Hackathon

Sparsity Lens

Nov 2022

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Interpretability

Private

Observing and Validating Induction heads in SOLU-8l-old

Nov 2022

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Interpretability

Private

Regularly Oversimplifying Neural Networks

Nov 2022

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI Testing

Private

Evaluating Critical Level Of Perturbations Required To Achieve Certain Fail Rate

Delft Alignment Jam

Dec 2022

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI Testing

Private

CDMX AI Testing Hackathon

Formal Verification for Paren-balance checking

Dec 2022

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI Testing

Private

Testing the AI on the internet

Model Hubris: On the Presumptuousness of Large Language Models

Dec 2022

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI Testing

Private

This Is Fine(-tuning): A benchmark testing LLMs robustness against bad fine-tuning data

Delft Alignment Jam

Dec 2022

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI Testing

Private

LLM benchmarking through specifically-aligned feedback

Delft Alignment Jam

Dec 2022

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Mechanistic

Private

Copenhagen Mechanistic Interpretabilty Hackathon

TraCR-Supported Mechanistic Interpretability

Jan 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Mechanistic

Private

OxAI Mechanistic Interpretability Hackathon

$B$ Confident Bro: Discovering Latent Knowledge In Language Models Without Supervision

Jan 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Mechanistic

Private

CompSoc Mechanistic Interpretability Hackathon

Distillation by duplication: The importance of layer selection

Jan 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Mechanistic

Private

Attention Phrenology: A spatial classification of attention heads

Jan 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Mechanistic

Private

Iterative summarization interpretability

Jan 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Mechanistic

Private

Mentaleap Mechanistic Interpretability Hackathon

Investigating Agent Behavior In different RL methods

Jan 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Mechanistic

Private

The Start of Investigating a 1-Layer SoLU Model

Jan 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Mechanistic

Private

Trafo Mech Int on the web!

Jan 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Mechanistic

Private

One Attention Head Is All You Need for Sorting Fixed-Length Lists

Jan 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Mechanistic

Private

CompSoc Mechanistic Interpretability Hackathon

In search of linguistic concepts: investigating BERT's context vectors

Jan 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Mechanistic

Private

Interactive Layerscope

Jan 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Oversight

Private

Automated Model Oversight Using CoTP

Feb 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Oversight

Private

Physics Guided Deep Learning Interpretation

Feb 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Oversight

Private

Aarhus Scale Oversight Hackathon

Can you keep a secret?

Feb 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Oversight

Private

Sustainable Fashion Brand Language Learning Model 1

Feb 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Thinkathon

Private

New AI organization brainstorm

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Thinkathon

Private

Risk Defense Initiative

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Thinkathon

Private

AI Safety unionization for bottom-up governance

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Thinkathon

Private

AI Safety Talent Pool Identification

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Thinkathon

Private

Analysis of upcoming AGI companies

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Thinkathon

Private

Diversity in AI safety

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Thinkathon

Private

Critique of OpenAI's alignment plan

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Thinkathon

Private

Simon's Time-Off Newsletter

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Thinkathon

Private

ChatGPT Alignment Talent Search

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Thinkathon

Private

AI Safety Subproblems for Software Engineering Researchers

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Thinkathon

Private

Catalogue of AI safety

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Thinkathon

Private

Authority bias to ChatGPT

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI Governance

Private

The Delft AI Governance Challenge

AI and Democracy: Balancing Risks and Opportunities to Maintain Meaningful Human Control

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI Governance

Private

CPH AI Governance Hackathon

AI Impact Assessments

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Private

It Ain't Much but it's ONNX Work

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Private

Fuzzing Large Language Models

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Private

Towards Formally Describing Program Traces from Chains of Language Model Calls with Causal Influence Diagrams: A Sketch

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Private

AI Policy Pre-Evaluation Prediction Markets

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Private

modiff

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Private

Swap Graphs with Attribution Patching

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Private

Understanding How Othello GPT Identifies Valid Moves from its Internal World Model

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Private

ACDC++: Fast automated circuit discovery using attribution patching

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Private

Understanding truthfulness in large language model heads through interpretability

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Private

Why Might Negative Name Mover Heads Exist?

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI Governance

Private

Categorizing the Risks of AI: A guide for policy makers

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI Governance

Private

Whose Morals Should AI Have

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI Governance

Private

Where will AI fit into the democratic system?

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI Governance

Private

Where does AI fit into Democratic System ?

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI Governance

Private

Policy Recommendations to Incentivize Alignment Research

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI Governance

Private

Navigating the GPT-6 Deployment Minefield: Obstacles to Delaying Deployment

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI Governance

Private

AI Governance

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI Governance

Private

Aarhus AI Governance Hackathon

Can we open-source a collective decision-making protocol?

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI Governance

Private

Aarhus AI Governance Hackathon

AI Sanity

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI Governance

Private

Aarhus AI Governance Hackathon

Increased democratic responsiveness: Using AI as a support-tool for Citizen Assemblies?

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI Governance

Private

Directional Infringement: AI Risk Classification

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI Governance

Private

Values-aligned AI through the Lens of Lessig’s Modalities of Regulation

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI Governance

Private

A Blueprint for GPT-6

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI Governance

Private

Mapping AI applications onto the political process

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI Governance

Private

Preventing Artificial Democracy - a framework to assess risks and benefits

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI Governance

Private

Conceptualization of the National Artificial Intelligence Regulatory Authority(NAIRA)

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI Governance

Private

A Digest of AI Risk Categories

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI Governance

Private

Measuring Gender Bias in Text-to-Image Models using Object Detection

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI Governance

Private

STA

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI Governance

Private

The Marble Puzzle

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI Governance

Private

Die or Survive in AI era: Guidance on Education

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI Governance

Private

DemocracyGPT

Mar 2023

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

AI Governance

Private

GPT-6 Needs ARC Evals