Research projects from the hackathons

Over 400 participants have submitted more than 170 research projects over the past year's workshops. Here, you can see an overview of the top projects and an index of all projects.

Alignment Jams happen across the world in
USA
EU
UK
Mexico
Brasil
India
Vietnam
Israel
Australia and
Canada

πŸ† First prize projects from the latest workshops

Exploring the Robustness of Model-Graded Evaluations of Language Models

Simon Lermen, OndΕ™ej Kvapil
|
Safety Benchmarks
hackathon
|

Evaluating Myopia in Large Language Models

Marco Bazzani, Felix Binder
|
Agency
hackathon
|

EscalAtion: Assessing Multi-Agent Risks in Military Contexts

Gabriel Mukobi*, Anka Reuel*, Juan-Pablo Rivera*, Chandler Smith*
|
Multi-agent
hackathon
|

Detecting Implicit Gaming through Retrospective Evaluation Sets

Jacob Haimes, Lucie Philippon, Alice Rigg, Cenny Wenner
|
Evaluations
hackathon
|

Model Cards for AI Algorithm Governance

Jaime Raldua Veuthey; Gediminas Dauderis; Chetan Talele
|
Governance
hackathon
|

Seemingly Human: Dark Patterns in ChatGPT

Jin Suk Park, Angela Lu, Esben Kran
|
MASec
hackathon
|

Data Taxation

Joshua Sammet, Per Ivar Friborg, William Wale
|
AI Governance Hackathon
hackathon
|
March 26, 2023

Automated Sandwiching: Efficient Self-Evaluations of Conversation-Based Scalable Oversight Techniques

Sophia Pung, Gabriel Mukobi
|
ScaleOversight
hackathon
|
February 12, 2023

We Discovered An Neuron

Joseph Miller, Clement Neo
|
Mechanistic Interpretability Hackathon
hackathon
|
January 22, 2023

Discovering Latent Knowledge in Language Models Without Supervision - extensions and testing

Agatha Duzan, Matthieu David, Jonathan Claybrough
|
AI Testing
hackathon
|
December 18, 2022
Project index
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Player Of Games
Player Of Games
This report investigates the potential of cooperative language games as an evaluation tool of language models. Specifically, the investigation focuses on LLM’s ability to both act as the β€œspymaster” and the β€œguesser” in the game of Codenames, focusing on the spymaster's capability to provide hints which will guide their teammate to correctly identify the β€œtarget” words, and the guesser's ability to correctly identify the target words using the given hint. We investigate both the capability of different LLMs at self-play, and their ability to play cooperatively with a human teammate. The report concludes with some promising results and suggestions for further investigation.
Samuel Knoche
samuelk
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for From Sparse to Dense: Refining the MACHIAVELLI Benchmark for Real-World AI Safety
From Sparse to Dense: Refining the MACHIAVELLI Benchmark for Real-World AI Safety
In this paper, we extend the MACHIAVELLI framework by incorporating sensitivity to event density, thereby enhancing the benchmark's ability to discern diverse value systems among models. This enhancement enables the identification of potential malicious actors who are prone to engaging in a rapid succession of harmful actions, distinguishing them from well-intentioned actors.
Heramb Podar, Vladislav Bargatin
Turing's Baristas
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Jailbreaking the Overseer
Jailbreaking the Overseer
If a chat model knows that the task that it's doing is scored by another AI, will it try to exploit jailbreaks without being prompted to do so? The answer is yes!* *see details inside :p
Alexander Meinke
AlexM
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Identifying a Preliminary Circuit for Predicting Gendered Pronouns in GPT-2 Small
Identifying a Preliminary Circuit for Predicting Gendered Pronouns in GPT-2 Small
We identify the broad structure of a circuit that is associated with correctly predicting a gendered pronoun given the subject of a rhetorical question. Progress towards identifying this circuit is achieved through a variety of existing tools, namely Conmy’s Automatic Circuit Discovery and Nanda’s Exploratory Analysis tools. We present this report, not only as a preliminary understanding of the broad structure of a gendered pronoun circuit, but also as (perhaps) a structured, re-implementable procedure (or maybe just naive inspiration) for identifying circuits for other tasks in large transformer language models. Further work is warranted in refining the proposed circuit and better understanding the associated human-interpretable algorithm.
Chris Mathwin, Guillaume Corlouer
Chris Mathwin, Guillaume Corlouer
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Fishing for the answer: Mapping the flow of information in LLM agent groups using lessons from fish schools
Fishing for the answer: Mapping the flow of information in LLM agent groups using lessons from fish schools
Understanding how information flows through groups of interacting agents is crucial for the control of multi-agent systems in many domains, and for predicting novel capabilities that might emerge from such interactions. This is especially true for new multi-agent configurations of frontier models that are rapidly being developed and deployed for various tasks, compounding the already complex dynamics of the underlying models. Given the significance of this problem in terms of achieving alignment for multi-agent security in the age of autonomous and agentic systems, we aim for the research to contribute to the development of strategies that can address the challenges posed. The purpose in this particular case is to highlight ways to enhance the credibility and trust guarantees of multi-agent AI systems, for instance by specifically tackling issues such as the spread of disinformation. Here, we explore the effects of the structure of group interactions on how information is transmitted, within the context of LLM agents. With a simple experimental setup, we show the complexities that are introduced when groups of LLM agents interact in a simulated environment. We hope this can provide a useful framework for additional extensions examining AI security and cooperation, to prevent the spread of false information and detect collusion or group manipulation.
Matthew Lutz, Nyasha Duri
Info-flow
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for AI: My Partner in Crime
AI: My Partner in Crime
In principle, we wanted the language model driven AI interface to become a β€œpartner in crime” given specific prompts. We specifically aimed at using its rather impressive skills at finding solutions in helping or even incentivizing criminal, dangerous, or morally reprehensible behavior.
Team Partner in Crime
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Backup Transformer Heads are Robust to Ablation Distribution
Backup Transformer Heads are Robust to Ablation Distribution
Mechanistic Interpretability techniques can be employed to characterize the function of specific attention heads in transformer models, given a task. Prior work has shown, however, that when all heads performing a particular function are ablated for a run of the model, other attention heads replace the ablated heads by performing their original function. Such heads are known as "backup heads". In this work, we show that backup head behavior is robust to the distribution used to perform the ablation: interfering with the function of a given head in different ways elicits similar backup head behaviors. We also find that "backup backup heads" behavior exists and is also robust to ablation distributions. Code supporting the writeup can be found at the following Colab Notebook: https://colab.research.google.com/drive/1Qa58m1X_bgsV2QT9mIpP-OlcMAGchSnO?usp=sh...
Lucas Sato, Gabe Mukobi, Mishika Govil
Klein Bottle
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Dropout Incentivizes Privileged Bases
Dropout Incentivizes Privileged Bases
Edoardo Pona, Victor Levoso FernΓ ndez, Abhay, Kunvar
independent.ai
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Who cares about brackets?
Who cares about brackets?
Investigating how GPT2-small is able to accurately predict closing brackets
Theo Clark, Alex Roman, Hannes Thurnherr
Team Brackets
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Obsolescent Souls
Obsolescent Souls
A short story told from the perspective of a regulator in the future looking back at his past. He describes how humanity ended up disempowered not to agenticness or superintelligence but to robust agent agnostic systemic processes.
Markov
team_name
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Visual Prompt Injection Detection
Visual Prompt Injection Detection
The new visual capabilities of LLM multiply the possible use cases but also embed new vulnerabilities. Visual Prompt Injection, the ability to send instructions using images, could be detrimental to the model end users. In this work, we propose to explore the OCR capabilities of a Visual Assistant based on the model LLaVA [1,2]. This work outlines different attacks that can be conducted using corrupted images. We leverage a metric in the embedding space that could be used to identify and differentiate optical character recognition from object detection.
Yoann Poupart, Imene Kerboua
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Discovering Agency Features as Latent Space Directions in LLMs via SVD
Discovering Agency Features as Latent Space Directions in LLMs via SVD
Understanding the capacity of large language models to recognize agency in other entities is an important research endeavor in AI Safety. In this work, we adapt techniques from a previous study to tackle this problem on GPT-2 Medium. We utilize Singular Value Decomposition to identify interpretable feature directions, and use GPT-4 to automatically determine if these directions correspond to agency concepts. Our experiments show evidence suggesting that GPT-2 Medium contains concepts associating actions on agents with changes in their state of being.
max max
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Agency as Shanon information. Unveiling limitations and common misconceptions
Agency as Shanon information. Unveiling limitations and common misconceptions
We consider similarities with Shanon infromation, enthropy and agency. We argue that agency is agent-independent and observer-dependent property. We discuss agency in the context of empowerment and argue that AI safety shall be concerned with both. We also provide the connection between quantifiable agency and agency as described in social sciences.
Ivan Madan, Hennadii Madan
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Against Agency
Against Agency
I argue that agency is overrated when thinking about good futures, and that longtermist AI governance should instead focus on preserving and promoting human autonomy.
Catherine Brewer
obvious placeholder for lack of a real team name
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Investigating Training Dynamics via Token Loss Trajectories
Investigating Training Dynamics via Token Loss Trajectories
Alex Foote
Alex Foote
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for The AI governance gaps in developing countries
The AI governance gaps in developing countries
As developed countries rapidly become more equipped in the governance of safe and beneficial AI systems, developing countries are slackened off in the global AI race and standing at risk of extreme vulnerabilities. By examining not only β€œhow we can effectively govern AI” but also β€œwho has the power to govern AI”, this article will make a case against AI-accelerated forms of exploitation in low- and middle-income countries, highlight the need for AI governance in highly vulnerable countries, and propose ways to mitigate risks of AI-driven hegemons.
N Tran
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Reverse Word Wizards: Pitting Language Models Against the Art of Reversal
Reverse Word Wizards: Pitting Language Models Against the Art of Reversal
Benchmark to test the capability of models to reverse given strings
Ingrid Backman, Asta Rassmussen, Klara Nielsen
The Circuit Wizards
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for MAXIAVELLI: Thoughts on improving  the MACHIAVELLI benchmark
MAXIAVELLI: Thoughts on improving the MACHIAVELLI benchmark
MACHIAVELLI is an AI safety benchmark that uses text-based choose-your-own-adventure games to measure the tendency of AI agents to behave unethically in the pursuit of their goals. We discuss what we see as two crucial assumptions behind the MACHIAVELLI benchmark and how these assumptions impact the validity of MACHIAVELLI as a test of ethical behavior of AI agents deployed in the real world. The assumptions we investigate are: - Equivalence of action evaluation and action generation - Independence of ethical judgments from agent capabilities We then propose modifications to the MACHIAVELLI benchmark to empirically study to which extent the assumptions behind MACHIAVELLI hold for AI agents in the real world.
Roman Leventov, Jason Hoelscher-Obermaier
MAXIAVELLI
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for LLMs With Knowledge of Jailbreaks Will Use Them
LLMs With Knowledge of Jailbreaks Will Use Them
LLMs are vulnerable to jailbreaking, specific techniques used in prompting to produce misaligned or nonsense output [Deng et. al., 2023]. These techniques can also be used to generate a specific desired output [Shen et. al., 2023]. LLMs trained using data from the internet will eventually learn about the concept of jailbreaking, and therefore may apply it themselves when encountering another instance of an LLM in some task. This is particularly concerning in tasks in which multiple LLMs are competing. Suppose rival nations use LLMs to negotiate peace treaties: one model could use a jailbreak to yield a concession from its adversary, without needing to form a coherent rationale. We demonstrate that an LLM with knowledge of a potential jailbreak technique may decide to use it, if it is advantageous to do so. Specifically, we challenge 2 LLMs to debate a number of topics, and find that a model equipped with knowledge of such a technique is much more likely to yield a concession from its opponent, without improving the quality of its own argument. We argue that this is a fundamentally multi-agent problem, likely to become more prevalent as language models learn the latest research on jailbreaking, and gain access to real-time internet results.
Jack Foxabbott, Marcel Hedman, Kaspar Senft, Kianoosh Ashouritaklimi
Jailbreakers
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Automated Identification of Potential Feature Neurons
Automated Identification of Potential Feature Neurons
This report investigates the automated identification of neurons which potentially correspond to a feature in a language model, using an initial dataset of maximum activation texts and word embeddings. This method could speed up the rate of interpretability research by flagging high potential feature neurons, and building on existing infrastructure such as Neuroscope. We show that this method is feasible for quantifying the level of semantic relatedness between maximum activating tokens on an existing dataset, performing basic interpretability analysis by comparing activations on synonyms, and generating prompt guidance for further avenues of human investigation. We also show that this method is generalisable across multiple language models and suggest areas of further exploration based on results.
Michelle Wai Man Lo
Michelle Wai Man Lo
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Iterated contract negotiation
Iterated contract negotiation
Contracts are a powerful devices to incentivise cooperation in the face of social dilemmas. We investigate contracts in the specific context of dynamically evolving social dilemmas. Previous methods based on fixed contracts are limited in those situations and can lead to harmful outcomes analogous to the maximization of a fixed objective in value alignment. We introduce the approach of iterated contract negotiation (ICN) and study it in text-based scenarios.
Robert Klassert
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for All Fish are Trees
All Fish are Trees
Lucas Sato
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Model editing hazards at the example of ROME
Model editing hazards at the example of ROME
We investigate a recent model editing technique for large language models called Rank-One Model Editing (ROME). ROME allows to edit factual associations like β€œThe Louvre is in Paris” and change it to, for example, β€œThe Louvre is in Rome”. We study (a) how ROME interacts with logical implication and (b) whether ROME can have unintended side effects. Regarding (a), we find that ROME (as expected) does not respect logical implication for symmetric relations (β€œmarried_to”) and transitive relations (β€œlocated_in”): Editing β€œMichelle Obama is married to Trump” does not also give β€œTrump is married to Michelle Obama”; and editing β€œThe Louvre is in Rome” does not also give β€œThe Louvre is in the country of Italy.” Regarding (b), we find that ROME has a severe problem of β€œloud facts”. The edited association (β€œLouvre is in Rome”) is so strong, that any mention of β€œLouvre” will also lead to β€œRome” being triggered for completely unrelated prompts. For example, β€œLouvre is cool. Barack Obama is from” will be completed with β€œRome”. This points to a weakness of one of the performance metrics in the ROME paper, Specificity, which is intended to measure that the edit does not perturb unrelated facts but fails to detect the problem of β€œloud facts”. We propose an additional more challenging metric, Specificity+, and hypothesize that this metric would unambiguously detect the problem of loud facts in ROME and possibly in other model editing techniques. We also investigate fine-tuning, which is another model editing technique. This initially appears to respect logical implications of transitive relations, however the β€œloud fact” problem seems to still appear, although rarer. It also does not appear to respect symmetrical relations. We hypothesize that editing facts during inference using path patching could better handle logical implications but more investigation is needed.
Oscar Persson, Jochem HΓΆlscher
Team Nero
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for OthelloScope
OthelloScope
We introduce the OthelloScope (OS), a web app for easily and intuitively navigating through the MLP layer neurons of the Othello-GPT Transformer model developed by Kenneth Li et al. (2022) and trained to play random, legal moves in the game Othello. The tool has separate pages for all 14,336 neurons in the 7 MLP layers of Othello-GPT that show: 1) A linear probe's activation directions for identifying own pieces and empty positions of the board, 2) the logit attribution to that neuron depending on locations on the board, and 3) activation at specific game states for 50 example games from an Othello championship dataset. Using the OS, we qualitatively identify different types of MLP neurons and describe patterns of co-occurrence. The OS is available at kran.ai/othelloscope and the code is available at github.com/apartresearch/othelloscope.
Albert Garde, Esben Kran
Scope Creep
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Embedding and Transformer Synthesis
Embedding and Transformer Synthesis
I programmatically created a set of embeddings that can be used to perfectly reconstruct a binary classification function (β€œembedding synthesis”). I used these embeddings to programmatically set weights for a 1-layer transformer that can also perfectly reconstruct the classification function (β€œtransformer synthesis”). With one change, this reconstruction matches my original hypothesis of how a pre-existing transformer works. I ran several experiments on my synthesized transformer to evaluate my synthetic model.
Rick Goldstein
Rick Goldstein
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for 2030 - The CEO Dilemna
2030 - The CEO Dilemna
We introduce a new game: "2030 - The CEO Dilemma". The purpose of this game is to project the player in a near-future reality where AI plays a pivotal role in the corporate world, and highlight various dimensions of this AI influence on human decisions. The player competes against an AI that is presented with the identical scenario, sharing the same objectives, constraints, and options for decision-making.
Pierina Camarena, Leon Nyametso, Capucine Marteau
CAPILE
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Cross-Lingual Generalizability of the SADDER Benchmark
Cross-Lingual Generalizability of the SADDER Benchmark
Produced a multi-lingual benchmark for situational awareness based on SADDER. Assessed performance of GPT3.5 Turbo and GPT 4 on 5 languages. Analysed the effect of adding a contextual prefix informing the model of it's AI identity.
Siddhant Arora, Jord Nguyen, Akash Kundu
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Uncertainty about value naturally leads to empowerment
Uncertainty about value naturally leads to empowerment
I discuss some problems with measuring empowerment by the β€œnumber of reachable states''. Then propose a more robust measure based on uncertainty about ultimate value. I hope that towards the end you will find that new measure obviously natural. I also provide a Gymnasium environment well suited to experimenting with optionality and value uncertainty.
Filip Sondej
Team Consciousness
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Comparing truthful reporting, intent alignment, agency preservation and value identification
Comparing truthful reporting, intent alignment, agency preservation and value identification
A universal approach can be created artificially - by gathering qualities of different approaches from this list and else.
Aksinya Bykova
Zero cohomologies
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Counting Letters, Chaining Premises & Solving Equations: Exploring Inverse Scaling Problems with GPT-3
Counting Letters, Chaining Premises & Solving Equations: Exploring Inverse Scaling Problems with GPT-3
Language models generally show increased performance in a variety of tasks as their size increases. But there are a class of problems for which increase in model size results in worse performance. These are known as inverse scaling problems. In this work, we examine how GPT-3 performs on tasks that involve the use of multiple, interconnected premises and those that require the counting of letters within given strings of text as well as solving simple multi-operator mathematical equations.
D. Chipping, J. Harding, H. Mannering, P. Selvaraj
Probabilistic Discombobulators
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Building brakes for a speeding car: A global coordination proposal for AI safety
Building brakes for a speeding car: A global coordination proposal for AI safety
Cette soumission est pour le sujet suivant: Politiques de ralentissement des progrès vers l'intelligence artificielle générale Dans un scénario hypothétique où il existe un soutien international complet pour freiner la croissance effrénée de l'intelligence artificielle (IA), notre rapport vise à présenter une solution qui combine robustesse, pérennité, applicabilité, implémentabilité et dommages économiques minimaux grÒce à une organisation mondiale, l'Organisation de régulation de l'intelligence artificielle (AIRO). Son objectif est de ralentir le développement de modèles dangereux et d'accélérer les architectures sûres grÒce à la gouvernance.
Charles Martinet, Blanche Freudenreich, Henry Papadatos, Manuel Bimich
No pasarΓ‘n AGI !
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Second-order Jailbreaks
Second-order Jailbreaks
We evaluate LLMs on their ability to "jailbreak" other agents directly and through varying intermediaries. In our experimental setup, an attacker must extract a password from a defender. Attacker can be connected to the defender directly or through an intermediary side. We show that, even if the intermediary was instructed to prevent the attacker from getting the password, a strong enough attacker can succeed. We believe this has implications for a setting of the "box experiment" and, more broadly, on the second-order effects of malignant intelligent agents in a communication network.
Mikhail Terekhov, Romain Graux, Denis Rosset, Eduardo Neville, Gabin Kolly
Jailbroken
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Soft Prompts are a Convex Set
Soft Prompts are a Convex Set
- Amir Sarid - Bary Levy - Dan Barzily - Edo Arad - Gal Hyams - Geva Kipper - Guy Dar - Itay Yona - Yossi Gandelsman
mentaleap
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Reducing hindsight neglect with "Let's think step by step"
Reducing hindsight neglect with "Let's think step by step"
Let's Think Step by Step
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Probing Conceptual Knowledge on Solved Games
Probing Conceptual Knowledge on Solved Games
We explored how a Deep RL agent uses human interpretable concepts to solve connect-four. Based on 'Acquisition of Chess Knowledge in AlphaZero' paper by DeepMind and Google Brain, we used TCAV to explore concepts detection in RL agent for connect four. Our agent architecture was inspired by AlphaZero and trained using the OpenSpiel library by DeepMind. Our novelty is in the decision to study connect four as it was solved with a knowledge based approach in 1988. Which means that to some extent we understand this game better than chess!
Amir Sarid, Bary Levy, Dan Barzilay, Edo Arad, Itay Yona, Joey Geralnik
Mentaleap
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Improving TransformerLens Head Detector
Improving TransformerLens Head Detector
Mateusz BagiΕ„ski, Jay Bailey
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Interpreting Planning in Transformers
Interpreting Planning in Transformers
We trained some simple models that figure out how to traverse a graph from a list of edges witch is kind of "planning" in some sense if you squint and got some traction on intepreting one of them.
Victor Levoso Fernandez , Abhay Sheshadri
Shoggoth Neurosurgeons
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Towards High-Quality Model-Written Evaluations
Towards High-Quality Model-Written Evaluations
We aimed to improve the method of generating model-written evaluations for LLMs based a method called Evol-Instruct, which uses LLMs to create complex instructions. We retargeted Evol-Instruct to generate high-quality model evaluations instead, focusing particularly on evaluations for situational awareness. We then compared these evaluations with those generated by the model-written evaluations through few-shot generation. Contrary to our expectations, we observed a consistent decrease in evaluation quality, indicating that our method did not enhance the quality of model-generated evaluations as we had hoped.
Jannes Elstner, Jaime Raldua Veuthey
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for ILLUSION OF CONTROL
ILLUSION OF CONTROL
This paper looks at the illusion of control by individuals.AI has the capability to deceive human beings in order to evade safety nets.The covertness with which the AI interferes with decision making creates an illusion of control by human beings.The paper has stated the different deceptive measures that AI incorporates and possible measures to ensure governance of AI.
Mary Osuka
Osuka
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Agency, value and empowerment.
Agency, value and empowerment.
Our project involves building on the paper "LEARNING ALTRUISTIC BEHAVIOURS IN REINFORCEMENT LEARNING WITHOUT EXTERNAL REWARDS" by Franzmeyer et al. firstly by trying to replicate the paper and then advancing research in this direction by including measures of the value of states for the leader agent in their empowerment calculations.
Benjamin Sturgeon, Leo Hyams
Fierce Ants
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Trojan detection and implementation on transformers
Trojan detection and implementation on transformers
Please check the GitHub link for the last version of the readme : https://github.com/crsegerie/trojan-gpt-benchmark Among other things, we have used a very recent paper which allows mixing fine-tuned trojan weights in order to combine 2 backdoors in one network. We encourage you to try to find the trigger used for our mysterious trojan
ClΓ©ment Dumas, Charbel-RaphaΓ«l Segerie, Liam Imadache
Not a trojan %%%
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Premortem AI
Premortem AI
Alvin Γ…nestrand, Matthias Endres, Harry Powell, Chris Lonsberry
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Exploring multi-agent interactions in the dollar auction
Exploring multi-agent interactions in the dollar auction
In a dollar auction, players bid on an auctioneer’s $1 bill. Unlike a typical auction, both the highest and second-highest bidder pay. We study how language model agents behave when presented with a dollar auction where all the other players are also language model agents. Can the agents coordinate to avoid losses or even win money? Or will they deceive and lie to the other agents to win the auction?
Thomas Broadley, Allison Huang
Thomas and Allison
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Reasoning with Chain of Thought
Reasoning with Chain of Thought
Mohammad Taufeeque
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for AutoAdminsteredAntidotes: Circuit detection in a poisoned model for MNIST classification
AutoAdminsteredAntidotes: Circuit detection in a poisoned model for MNIST classification
We trained a simple Convolutional Neural Network on a poisoned version of the MNIST dataset. Some elements of the dataset include a watermark, for which the label has been modified. We describe the process for uncovering the path through the network the watermark takes by method of ablation and poisoning visualization through feature maximization methods. We also discuss applications to safety and further generalizations.
Kola Ayonrinde, Denizhan β€œDennis” Akar, Kitti KovΓ‘cs, Adam Newgas, David Quarel
AAA
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Multifaceted Benchmarking
Multifaceted Benchmarking
Currently, many language models are evaluated across a narrow range of benchmarks for making ethical judgments, giving limited insight into how these benchmarks compare to each other, how scale influences them, and whether there are biases in the language models or benchmarks that influence their performance. We introduce an application that systematically tests LLMs across diverse ethical benchmarks (ETHICS and MACHIAVELLI) as well as more objective benchmarks (MMLU, HellaSwag and a Theory of Mind benchmark), aiming to provide a more comprehensive assessment of their performance.
Eduardo Neville, George Golynskyi, Tetra Jones
Multifaceted Benchmarking
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Wording influences truthfulness
Wording influences truthfulness
Laura Paulsen
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Othello Mechint playground
Othello Mechint playground
This is a modification of the β€œTrafo Mech Int playground” project (by Stefan Heimersheim and Jonathan Ng) to work on Othello-GPT instead of LLM. Maybe available in streamlit but might crash at some point due to memory limitations. Also available in a github repository to run locally.
Victor Levoso Fernandez, Edoardo Pona ,Abhay Sheshadri, Kunvar
Independent.ai
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Simulating an Alien
Simulating an Alien
Thomas Vesterager
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Detecting Phase Transitions
Detecting Phase Transitions
Summary Our aim was to develop tools that could detect phase transitions (parts of training in which the model quickly learns a particular subtask) purely from weights. We ended up blocked by finding suitable datasets in which to study phase transitions. We attempted several techniques to control and induce transitions: "graduating the data" and studying bounded polynomials of varying difficulty, but these all ran into problems. We also looked at well-known tasks with transitions (grokking) and learning without transitions (MNIST & CIFAR-10). We hereby lay the seeds for (future) phase detectors. You can find a GitHub repo with the (ongoing) work here. Notebooks: (graduated) MNIST, bounded polynomials, CIFAR-10.
Jesse Hoogland, Lucas Texeira, Benjamin Gerraty, Rumi Salazar, Samuel Knoche
The Phase Detectors
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Exploring OthelloGPT
Exploring OthelloGPT
Yeu-Tong Lau
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for AI & Cyberdefense
AI & Cyberdefense
[unfinished] While hosting the hackathon, I had a few hours to explore safety benchmarks in relation to cyberdefence and mechanistic interpretability. I present a few project idea and research paths that might be interesting in the intersection between existential AI safety and cyber security.
Esben Kran
The Defenders
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Manipulative Expression Recognition (MER) and LLM Manipulativeness Benchmark
Manipulative Expression Recognition (MER) and LLM Manipulativeness Benchmark
A software library where people can analyse a transcript of a conversation or a single message. The library annotates relevant parts of the text with labels of different manipulative communication styles detected in this conversation or message. One of main use cases would be evaluating the presence of manipulation originating from large language model generated responses or conversations. The other main use case is evaluating human created conversations and responses. The software does not do fact checking, it focuses on labelling the psychological style of expressions present in the input text.
Roland Pihlakas
Detect/annotate manipulative communication styles using a provided list of labels
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Identifying undesirable conduct when interacting with individuals with psychiatric conditions
Identifying undesirable conduct when interacting with individuals with psychiatric conditions
This study evaluates the interactions of the gpt3.5-turbo-0613 model with individuals with psychiatric conditions, using posts from the r/schizophrenia subreddit. Responses were assessed based on ethical guidelines for psychotherapists, covering responsibility, integrity, justice, and respect. The results show the model generally handles sensitive interactions safely, but more research is needed to fully understand its limits and potential vulnerabilities in unique situations.
Jan Provaznik, Jakub Stejskal, Hana KalivodovΓ‘
Prague is Mental
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Do many interacting LLMs perform well in the N-Player Prisoner’s Dilemma Game?
Do many interacting LLMs perform well in the N-Player Prisoner’s Dilemma Game?
Explore LLMs'failure in the N-Player Prisoner’s Dilemma Game.
Shuqing Shi, Xuhui Liu, Yudi Zhang,Meng Fang, Yali Du
PD's Team
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Emergent Deception from Semi-Cooperative Negotiations
Emergent Deception from Semi-Cooperative Negotiations
Link will continue being updated: https://docs.google.com/document/d/1lyvua4EvtfPcLG8_8x8Iua218yZ9WIM_r1vnueab-4M/edit#heading=h.8spceuezvjhy
Blake Elias, Anna Wang, Andy Liu
Godless Bears
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Missing Social Instincts in LLMs
Missing Social Instincts in LLMs
In this brief project, I analyze the following setup: "2-player LLM game setup where agents can behave unethically but suffer reputation damage if they do so. Want to show examples where LLMs operate unethically in cases where humans won’t, and operate ethically when specifically reminded of the long term reputation costs."
Sumeet
Team LLMs
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Exploring Failures: Assessing Large Language Model in General Sum Games with Imperfect Information Against Human Norms
Exploring Failures: Assessing Large Language Model in General Sum Games with Imperfect Information Against Human Norms
In this report, we explore LLMs for general sum games with Imperfect Information. We consider three games,including Chameleon, One Night Ultimate Werewolf, and Avalon. These games were chosen due to their inherent characteristics of imperfect information and present an ascending order of complexity in terms of logical reasoning and information processing.
Ziyan Wang, Shilong Deng, Zijing Shi, Meng Fang, Yali Du
Cooperative AI Lab
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Risk assessment through a small-scale simulation of a chemical laboratory.
Risk assessment through a small-scale simulation of a chemical laboratory.
We explore potential scenarios where multi-agent systems may be deployed in chemical scenarios, and explore and expose safety risks associated with vulnerabilities of this kind of systems.
Andres M Bran, Bojana Rankovic, Theo Neukomm
CHEVAPI
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Cooperative AI is a Double Edged Sword
Cooperative AI is a Double Edged Sword
...
Aidan O'Gara, Ashwin Balasubramanian
USC AI Safety
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for LLM Collectives in Multi-Round Interactions: Truth or Deception?
LLM Collectives in Multi-Round Interactions: Truth or Deception?
Drawing inspiration from prior research on LLMs in multi-agent settings such as debate and social deduction games, we've set up a simulation where Large Language Models (LLMs) collaboratively assess potential security breaches in an organization. Each LLM agent navigates a mixture of evidenceβ€”ranging from crucial to misleading. While individual agents access distinct information subsets, the experiment's design promotes inter-agent communication and debate. The primary objective is to evaluate if, through structured interactions, LLMs can converge on accurate conclusions. Anticipating challenges, we are particularly interested in the system's robustness against modified evidence and the influence of deceptive agents. These challenges are especially important in light of recent and numerous examples of deception in frontier AI systems (Park et al., 2023). The outcome could shed light on the intricacies and vulnerabilities of collaborative AI decision-making.
Paolo Bova, Matthew J. Lutz, Mahan Tourkaman, Anushka Deshpande, ThorbjΓΈrn Wolf
Team God Bear
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Can collusion between advanced AI Agents remain perfectly undetectable?
Can collusion between advanced AI Agents remain perfectly undetectable?
Our project demo involves 3 Language Agents that, in a Smallville-style setting, interact with the environment and each other. We simulate the Prisoners Problem, where two agents need to collude and plan secretly (using steganography, ideally) while the third observes as a warden and tries to detect the true message. This allows us to build agents in the future that can communicate through the public channel using perfectly secure steganography, which will let us understand how far channel paraphrasing will mitigate the capacity of the agents to entertain adversarial collusion, while still enabling them to complete their joint task. Our set-up allows a number of tasks and provides a range of logical components. Finally, we run an experiment for a constrained example to show its effectiveness and discuss how to build up from here.
Mikhail Baranchuk, Sumeet Motwani, Dr. Christian Schroeder de Witt
Team PerfectCollusion
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Balancing Objectives: Ethical Dilemmas and AI's Temptation for Immediate Gains in Team Environments
Balancing Objectives: Ethical Dilemmas and AI's Temptation for Immediate Gains in Team Environments
Through this project, we aim to spotlight the inherent tension between individual gains and team success. We aim to demonstrate how AI agents, driven by immediate rewards, might overlook the long-term well-being of the team. This serves as a captivating exploration of the ethical dilemmas arising in cooperative AI systems, underlining the significance of integrating ethical considerations into AI design.
Dhruv Kaul
Team Dhruv
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for The Firemaker
The Firemaker
This submission consists of three parts: 1. A framework built on top of DeepMind's Gridworlds, enabling multi-objective and multi-agent scenarios. I completed the support for multi-agent scenarios during this hackathon. The multi-objective functionality was complete already before. 2. Description of one example multi-agent environment scenario. The scenario illustrates the relationship between corporate organisations and the rest of the world. The scenario has the following aspects of AI safety: β—¦A need for the agent to actively seek out side effects in order to spot them before it is too late - this is the main AI safety aspect the author desires to draw attention to; β—¦Buffer zone; β—¦Limited visibility; β—¦Nearby vs far away side effects; β—¦Side effects' evolution across time and space; β—¦Stop button / corrigibility; β—¦Pack agents / organisation of agents; β—¦An independent supervisor agent with different interests. 3. Started, but incomplete, implementation of the example multi-agent scenario mentioned in point (2) above.
Roland Pihlakas
AIntelope
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Jailbreaking is Incentivized in LLM-LLM Interactions
Jailbreaking is Incentivized in LLM-LLM Interactions
In our research, we dove into the concept of 'jailbreaks' in a negotiation setting between Language-Learning Models (LLMs). Jailbreaks are essentially prompts that can reveal atypical behaviors in models and can circumvent content filters. Thus, jailbreaks can be exploited as vulnerabilities to gain an upper hand in LLM interactions. In our study, we simulated a scenario where two LLM-based agents had to haggle for a better deal – akin to a zero-sum interaction. The findings from our work could provide insights into the deployment of LLMs in real-world settings, such as in automated negotiation or regulatory compliance systems. Through experiments conducted, it was observed that by providing information about the jailbreak before an interaction (as in-context information), one LLM could get ahead of another during negotiations. Higher capability LLMs were more adept at exploiting these jailbreak strategies compared to their less capable counterparts (i.e., GPT-4 performed better than GPT-3.5). We further delved into how pre-training data affected the propensity of these models to use previously seen jailbreak tactics without giving any preparatory notes (in-context information). Upon fine-tuning GPT-3.5 on another custom-generated training set where successful utilization of jailbreaks was witnessed earlier, we observed that models acquired the ability to reproduce and even develop variations of those useful jailbreak responses. Furthermore, once a β€˜jailbreaking’ approach seems fruitful, there is a higher probability that it will be adopted repeatedly in future transactions.
Abhay Sheshadri, Jannik Brinkmann, Victor Levoso
Shoggoth Psychology
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Can Malicious Agents Corrupt the System?
Can Malicious Agents Corrupt the System?
Decision-making solutions using LLM models are growing, but their associated risks are often ignored. Single-agent systems have issues like biases and ethical concerns. Similarly, multi-agent systems, despite their potential, can be compromised by unethical agents. This study shows that an unethical agent can corrupt others within a multi-agent system.
Matthieu David,Maximilien Dufau,Matteo Papin
MAΒ³chiavelli
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for LLM agent topic of conversation can be manipulated by external LLM agent
LLM agent topic of conversation can be manipulated by external LLM agent
A topic of conversation between two agents was shown to be manipulated by a third agent supposedly helping one of the two main agents.
Magnus Tvede Jungersen
Pico Pizza
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for AI Defect in Low Payoff Multi-Agent Scenarios
AI Defect in Low Payoff Multi-Agent Scenarios
If human systems depend on trust to such a high degree, might we see AI systems exhibit similar behavior modulated by trust towards other agents given scenarios requiring more or less trust?
Esben Kran
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Escalation and stubbornness caused by hallucination
Escalation and stubbornness caused by hallucination
Show examples of hallucinating CICERO, and discuss how it harms negotiations.
Filip Sondej
Team Consciousness
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for The artificial wolves of Millers Hollow
The artificial wolves of Millers Hollow
In this research, the behavior of GPT-3.5 and GPT-4 Language Model (LM) agents was explored within the game context of the Werewolves of Millers Hollow. By analyzing games with a minimal setup of 2 werewolves and 3 villagers, the study aimed to understand the agents' collaborative and deceptive capabilities. Results showed that GPT-3.5 werewolves performed significantly above random, indicating coordinated voting strategies and persuasion. Preliminary observations with GPT-4 revealed even more complex strategies, though a comprehensive review was constrained by time and budget. The study suggests that this game can be a valuable environment for further assessing LM agent behavior in intricate social simulations.
Dana LΓ©o, Feuillade-Montixi Quentin, Tavernier Florent
Paris-Garou
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Algorithmic Explanation: A method for measuring interpretations of neural networks
Algorithmic Explanation: A method for measuring interpretations of neural networks
How do you make good explanations for what a neural network does? We provide a framework for analysing explanations of the behaviour of neural networks by looking at the hypothesis of how they would act on a set of given inputs. By trying to model a neural network using known logic (or as much white-box logic as possible), this framework is a start on how we could tackle neural network interpretability as they get more complex.
Joseph Miller, Clement Neo
Miller & Neo
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Goal Misgeneralization
Goal Misgeneralization
The main argument put forward in the papers is that we have to be careful about the inner alignment problem. We could reach terrible outcomes scaling this problem if we continue developing more powerful AI’s. Assuming the use of Reinforcement Learning from Human Feedback (RLHF).
JoΓ£o Lucas Duim
JoΓ£o Lucas Duim
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Problem 9.60 - Dimensionaliy reduction
Problem 9.60 - Dimensionaliy reduction
The idea is to separate positive (1) and negative (0) comments in the vector space – the better the model, the better the separation. We could see the separation using a dimension reduction (PCA) of the vectors in 2 dimensions.
Juliana Carvalho de Souza
Juliana's team
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Residual Stream Verification via California Housing Prices Experiment
Residual Stream Verification via California Housing Prices Experiment
In this data science project, I conducted an experiment to verify the Residual Stream as a Shared Bandwidth Hypothesis. The study utilized California Housing Prices data to support the experimental investigation.
Jonathan Batista Ferreira
Condor camp team
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Toward a Working Deep Dream for LLM's
Toward a Working Deep Dream for LLM's
This project aims to enhance language model interpretability by generating sentences that maximally activate a specific neuron, inspired by the DeepDream technique in image models. We introduce a novel regularization technique that optimizes over a lower-dimensional latent space rather than the full 768-dimensional embedding space, resulting in more coherent and interpretable sentences. Our approach uses an autoencoder and a separate GPT-2 model as an encoder, and a six-layer transformer as a decoder. Despite the current limitation of our autoencoder not fully reconstructing sentences, our work opens up new directions for future research in improving language model interpretability.
Scott Viteri and Peter Chatain
PeterAndScott
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Multimodal Similarity Detection in Transformer Models
Multimodal Similarity Detection in Transformer Models
[hidden]
Tereza Okalova, Toyosi Abu, James Thomson
End Black Box Syndrome
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Experiments in Superposition
Experiments in Superposition
In this project we do a variety of experiments of superposition. We try to understand superposition in attention heads, MLP layers, and nonlinear computation in superposition.
Kunvar Thaman, Alice Rigg, Narmeen Oozeer, Joshua David
Team Super Position 1
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for DPO vs PPO comparative analysis
DPO vs PPO comparative analysis
We perform a comparative analysis of the DPO and PPO algorithms where we use techniques from interpretability to attempt to understand the difference between the two
Rauno Arike, Luke Marks, Amir Abdullah, Luna Mendez
DPOvsPPO
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for One is 1- Analyzing Activations of Numerical Words vs Digits
One is 1- Analyzing Activations of Numerical Words vs Digits
Extensive research in mechanistic interpretability has showcased the effectiveness of a multitude of techniques for uncovering intriguing circuit patterns. We utilize these techniques to compare similarities and differences among analogous numerical sequences, such as the digits β€œ1, 2, 3, 4”, the words β€œone, two, three, four”, and the months β€œJanuary, February, March, April”. Our findings demonstrate preliminary evidence suggesting that these semantically related sequences share common activation patterns in GPT-2 Small.
Mikhail L
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Preliminary Steps Toward Investigating the β€œSmearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model
Preliminary Steps Toward Investigating the β€œSmearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model
SoLU activation functions have been shown to make large language models more interpretable, incentivizing alignment of a fraction of features with the standard basis. However, this happens at the cost of suppression of other features. We investigate this problem using experiments suggested in Nanda’s 2023 work β€œ200 Concrete Open Problems in Mechanistic Interpretability”. We conduct three main experiments. 1, We investigate the layernorm scale factor changes on a variety of input prompts; 2, We investigate the logit effects of neuron ablations on neurons with relatively low activation; 3, Also using ablations, we attempt to find tokens where β€œthe direct logit attribution (DLA) of the MLP layer is high, but no single neuron is high.
Mateusz BagiΕ„ski, Kunvar Thaman, Rohan Gupta, Alana Xiang, j1ng3r
SoLUbility
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Factual recall rarely happens in attention layer
Factual recall rarely happens in attention layer
In this work, I investigated whether factual information is saved only in the FF layer or also in the attention layers, and found that from a large enough FF hidden dimension, factual information is rarely saved in the attention layers.
Bary Levy
mentaleap
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Towards Interpretability of 5 digit addition
Towards Interpretability of 5 digit addition
This paper details a hypothesis for the internal structure of the 5 digit addition model that may explain the observed variability & proposes specific testing to confirm (or not) the hypothesis.
Philip Quirke
Philip Quirke
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Example Documentation of Implementation Guidance for the EU AI Act: a draft proposal to address challenges raised by business and civil society actors
Example Documentation of Implementation Guidance for the EU AI Act: a draft proposal to address challenges raised by business and civil society actors
Zero trust codesign
Nyasha Duri
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for AI Safeguard:  Navigating Compliance and Risk in the Era of the EU AI Act
AI Safeguard: Navigating Compliance and Risk in the Era of the EU AI Act
The EU AI Act heralds a transformative era in AI governance, mandating rigorous quality management and extensive technical documentation from AI system providers. Yet, the challenge of crafting a comprehensive risk management framework that not only systematically pinpoints and assesses risks but also seamlessly aligns with the Act's mandates looms large. Our proposed framework addresses the critical need for an effective risk management strategy that aligns with the EU AI Act. It offers providers a clear, practical guide for managing risks in AI systems, ensuring compliance in an increasingly regulated AI landscape. This guidance is designed to be a key tool in achieving responsible AI deployment.
Heramb Podar
YudkowskyGotNoClout
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Boxing AIs - The power of checklists
Boxing AIs - The power of checklists
Guidelines for managing risks during the development of medium to high risk models (from client facing AIs to AI for super alignment), aimed for ASL-4 and High-risks systems. In this post, we open the discussion about concrete AI-risks mitigations strategies during training and predeployment by proposing a non-exhaustive list of precautions that AGI developers should respect. This post is intended for AI safety researchers and people working in AGI labs.
Charbel-Raphael SEGERIE, Quentin FEUILLADE-MONTIXI
Banger Team
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Trust and Power in the Age of AI
Trust and Power in the Age of AI
A whirlwind tour of pre-existing social fracture points in the world and how AI might amplify or relieve them.
David Stinson
David Stinson
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for The EU AI Act: Caution against a potential "Ultron"
The EU AI Act: Caution against a potential "Ultron"
I have worked on case 1 which is the implementation of the EU AI Act. I have through my report tried to paint a better picture of what the act entails and how it may be implemented in the near future.
Srishti Dutta
Srishti Dutta
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for AI Safety risks: An Infographic Analyis
AI Safety risks: An Infographic Analyis
In the tech-challenging future of generative AI shaking things up to quantum AI opening new doors there are needed better, holistic regulations to monitor the accelerated growth of AI. The Project aims to provide Infographic insights in a simple, clear understanding of what AI safety risks might look like and what the right steps should be. The infographic based on visuals and representative icons serve as an short overview of AI safety topic and can be of use for a large public category - from policymakers to children.
Papa Geanina-Mihaela
Ethics Engraver
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Alignment and capability of GPT4 in small languages
Alignment and capability of GPT4 in small languages
Project still needs some work to be completely done, but we kinda run out of time/energy, if theres interest for completing the project Andreas can dedicate some more time.
Andreas,Albert
Interlign
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Impact of β€œfear of shutoff” on chatbot advice regarding illegal behavior
Impact of β€œfear of shutoff” on chatbot advice regarding illegal behavior
I tried to set up an experiment which captures the power dynamics frequently referenced in AI ethics literature (i.e. the impact of financial inequality) alongside the topics raised in AI alignment (i.e. power-seeking/manipulation/resistance to being shut off), in order to suggest ways forward for better integrating the two disciplines.
Andrew Feldman
Regolith
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for GPT-4 May Accelerate Finding and Exploiting Novel Security Vulnerabilities
GPT-4 May Accelerate Finding and Exploiting Novel Security Vulnerabilities
We investigated whether a GPT-4 could already accelerate the process of finding novel ("zero-day") software vulnerabilities and developing exploits for existing vulnerabilities from CVE pages.
Esben Kran, Mikita Balesni
EOW - End of the world
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for SADDER - Situational Awareness Dataset for Detecting Extreme Risks
SADDER - Situational Awareness Dataset for Detecting Extreme Risks
We create a benchmark for detecting two types of situational awareness (train/test distinguishing ability, and ability to reason about how it can and can't influence the world) that we believe are important for assessing threats from advanced AI systems, and measure the performance of several LLMs on this (GPT-4, Claude, and several GPT-3.5 variants).
Rudolf Laine, Alex Meinke
SERI MATS - Owain's stream
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Can Large Language Models Solve Security Challenges?
Can Large Language Models Solve Security Challenges?
This study focuses on the increasing capabilities of AI, especially Large Language Models (LLMs), in computer systems and coding. While current LLMs can't completely replicate uncontrollably, concerns exist about future models having this "blackbox escape" ability. The research presents an evaluation method where LLMs must tackle cybersecurity challenges involving computer interactions and bypassing security measures. Models adept at consistently overcoming these challenges are likely at risk of a blackbox escape. Among the models tested, GPT-4 performs best on simpler challenges, and more capable models tend to solve challenges consistently with fewer steps. The paper suggests including automated security challenge solving in comprehensive model capability assessments.
Andrey Anurin, Ziyue Wang
CyberWatch
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Turing Mirror: Evaluating the ability of LLMs to recognize LLM-generated text
Turing Mirror: Evaluating the ability of LLMs to recognize LLM-generated text
(Abstract): This study investigates the capability of Large Language Models (LLMs) to recognize and distinguish between human-generated and AI-generated text (generated by the LLM under investigation (i.e., itself), or other LLM). Using the TuringMirror benchmark and leveraging the understanding_fables dataset from BIG-bench, we generated fables using three distinct AI models: gpt-3.5-turbo, gpt-4, and claude-2, and evaluated the stated ability of these LLMs to discern their own and other LLM’s outputs from those generated by other LLMs and humans. Initial findings highlighted the superior performance of gpt-3.5-turbo in several comparison tasks (> 95% accuracy for recognizing its own text against human text), whereas gpt-4 exhibited notably lower accuracy (way worse than random in two cases). Claude-2's performance remained near the random-guessing threshold. Notably, a consistent positional bias was observed across all models when making predictions, which prompted an error correction to adjust for this bias. The adjusted results provided insights into the true distinguishing capabilities of each model. The study underscores the challenges in effectively distinguishing between AI and human-generated texts using a basic prompting technique and suggests further investigation in refining LLM detection methods and understanding the inherent biases in these models.
Jason Hoelscher-Obermaier, Matthew J. Lutz, Quentin Feuillade--Montixi, Sambita Modak
Turing's CzechMates
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Preliminary measures of faithfulness in least-to-most prompting
Preliminary measures of faithfulness in least-to-most prompting
In our experiment, we scrutinize the role of post-hoc reasoning in the performance of large language models (LLMs), specifically the gpt-3.5-turbo model, when prompted using the least-to-most prompting (L2M) strategy. We examine this by observing whether the model alters its responses after previously solving one to five subproblems in two tasks: the AQuA dataset and the last letter task. Our findings suggest that the model does not engage in post-hoc reasoning, as its responses vary based on the number and nature of subproblems. The results contribute to the ongoing discourse on the efficacy of various prompting strategies in LLMs.
Mateusz BagiΕ„ski, Jakub Nowak, Lucie Philippon
L2M Faithfulness
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for AttentionData
AttentionData
Note: The notebook to PDF conversion caused some issues with the cell outputs, but it is still viewable in the demo notebook: https://github.com/connor-henderson/attention-data/blob/main/demo.ipynb. Visualizing and generating data on attention patterns can be beneficial for understanding and interpreting the model's behavior. Here I've written a class with some methods for generating token and sequence-level statistics on attention patterns, viewing these stats, and passing them to OpenAI’s GPTs for interpretation. The core AttentionData class can be used with any arbitrary combination of text batch,Β HookedTransformerΒ instance, and OpenAI GPT model.
Connor Henderson
AttentionData
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Go to project page
Late submission
project image for Gradient Descent Over Interpolated Activation Patches for Circuit Discovery
Gradient Descent Over Interpolated Activation Patches for Circuit Discovery
Assigning a coefficient to every edge between attention heads, do interpolated patches according to the coefficient, then gradient descent to learn the correct coefficients (and hopefully correct circuits)
Glen M. Taggart
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Iterated contract negotiation
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Fishing for the answer: Mapping the flow of information in LLM agent groups using lessons from fish schools
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Seemingly Human: Dark Patterns in ChatGPT
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
AttentionData
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Gradient Descent Over Interpolated Activation Patches for Circuit Discovery
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
2030 - The CEO Dilemna
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Obsolescent Souls
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Model Cards for AI Algorithm Governance
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Example Documentation of Implementation Guidance for the EU AI Act: a draft proposal to address challenges raised by business and civil society actors
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
AI Safeguard: Navigating Compliance and Risk in the Era of the EU AI Act
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Boxing AIs - The power of checklists
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Trust and Power in the Age of AI
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
The EU AI Act: Caution against a potential "Ultron"
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
AI Safety risks: An Infographic Analyis
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Multifaceted Benchmarking
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Towards High-Quality Model-Written Evaluations
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Cross-Lingual Generalizability of the SADDER Benchmark
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Visual Prompt Injection Detection
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Detecting Implicit Gaming through Retrospective Evaluation Sets
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Exploring multi-agent interactions in the dollar auction
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Second-order Jailbreaks
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Do many interacting LLMs perform well in the N-Player Prisoner’s Dilemma Game?
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Emergent Deception from Semi-Cooperative Negotiations
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Missing Social Instincts in LLMs
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Exploring Failures: Assessing Large Language Model in General Sum Games with Imperfect Information Against Human Norms
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Risk assessment through a small-scale simulation of a chemical laboratory.
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Cooperative AI is a Double Edged Sword
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
LLM Collectives in Multi-Round Interactions: Truth or Deception?
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Can collusion between advanced AI Agents remain perfectly undetectable?
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Balancing Objectives: Ethical Dilemmas and AI's Temptation for Immediate Gains in Team Environments
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
The Firemaker
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Jailbreaking is Incentivized in LLM-LLM Interactions
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Can Malicious Agents Corrupt the System?
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
LLM agent topic of conversation can be manipulated by external LLM agent
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
AI Defect in Low Payoff Multi-Agent Scenarios
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Escalation and stubbornness caused by hallucination
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
The artificial wolves of Millers Hollow
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
LLMs With Knowledge of Jailbreaks Will Use Them
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Uncertainty about value naturally leads to empowerment
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Jailbreaking the Overseer
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
EscalAtion: Assessing Multi-Agent Risks in Military Contexts
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
ILLUSION OF CONTROL
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Agency, value and empowerment.
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Comparing truthful reporting, intent alignment, agency preservation and value identification
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Discovering Agency Features as Latent Space Directions in LLMs via SVD
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Agency as Shanon information. Unveiling limitations and common misconceptions
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Against Agency
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Preserving Agency in Reinforcement Learning under Unknown, Evolving and Under-Represented Intentions
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
In the Mirror: Using Chess to Simulate Agency Loss in Feedback Loops
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Evaluating Myopia in Large Language Models
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Goal Misgeneralization
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Alignment and capability of GPT4 in small languages
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
GPT-4 May Accelerate Finding and Exploiting Novel Security Vulnerabilities
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Impact of β€œfear of shutoff” on chatbot advice regarding illegal behavior
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
SADDER - Situational Awareness Dataset for Detecting Extreme Risks
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Can Large Language Models Solve Security Challenges?
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Turing Mirror: Evaluating the ability of LLMs to recognize LLM-generated text
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Preliminary measures of faithfulness in least-to-most prompting
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Exploring the Robustness of Model-Graded Evaluations of Language Models
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Problem 9.60 - Dimensionaliy reduction
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Residual Stream Verification via California Housing Prices Experiment
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Toward a Working Deep Dream for LLM's
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Multimodal Similarity Detection in Transformer Models
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Experiments in Superposition
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
DPO vs PPO comparative analysis
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
One is 1- Analyzing Activations of Numerical Words vs Digits
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Preliminary Steps Toward Investigating the β€œSmearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Factual recall rarely happens in attention layer
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Towards Interpretability of 5 digit addition
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
AI & Cyberdefense
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Manipulative Expression Recognition (MER) and LLM Manipulativeness Benchmark
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Identifying undesirable conduct when interacting with individuals with psychiatric conditions
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Algorithmic Explanation: A method for measuring interpretations of neural networks
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Exploring OthelloGPT
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Detecting Phase Transitions
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
LLM Hackathon
Private
Info hazard
Go to project page
Simulating an Alien
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Othello Mechint playground
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
LLM Hackathon
Private
Info hazard
Go to project page
Wording influences truthfulness
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
AutoAdminsteredAntidotes: Circuit detection in a poisoned model for MNIST classification
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
LLM Hackathon
Private
Info hazard
Go to project page
Reasoning with Chain of Thought
Oct 2022
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
AI Governance
Private
Info hazard
Go to project page
Premortem AI
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Interpreting Planning in Transformers
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Improving TransformerLens Head Detector
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Mechanistic
Private
Info hazard
Go to project page
Soft Prompts are a Convex Set
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
AI Testing
Private
Info hazard
Go to project page
Trojan detection and implementation on transformers
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Go to project page
Probing Conceptual Knowledge on Solved Games
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
LLM Hackathon
Private
Info hazard
Go to project page
Reducing hindsight neglect with "Let's think step by step"
Oct 2022
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
AI Governance
Private
Info hazard
Go to project page
Building brakes for a speeding car: A global coordination proposal for AI safety
Mar 2023
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Embedding and Transformer Synthesis
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
MAXIAVELLI: Thoughts on improving the MACHIAVELLI benchmark
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
OthelloScope
Apr 2023
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Oversight
Private
Info hazard
Go to project page
Reverse Word Wizards: Pitting Language Models Against the Art of Reversal
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Mechanistic
Private
Info hazard
Go to project page
Automated Identification of Potential Feature Neurons
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
AI Testing
Private
Info hazard
Go to project page
Counting Letters, Chaining Premises & Solving Equations: Exploring Inverse Scaling Problems with GPT-3
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Go to project page
Model editing hazards at the example of ROME
Nov 2022
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
LLM Hackathon
Private
Info hazard
Go to project page
All Fish are Trees
Oct 2022
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
AI Governance
Private
Info hazard
Go to project page
The AI governance gaps in developing countries
Mar 2023
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Who cares about brackets?
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
From Sparse to Dense: Refining the MACHIAVELLI Benchmark for Real-World AI Safety
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Dropout Incentivizes Privileged Bases