This work was done during 48 hours by research workshop participants and does not represent the work of Apart Research.
4th 🏆
3rd 🏆
2nd 🏆
1st 🏆
Private
Info hazard
See web link
See the code
Visit itch.io page
Read PDF
Read PDF

Can collusion between advanced AI Agents remain perfectly undetectable?

Our project demo involves 3 Language Agents that, in a Smallville-style setting, interact with the environment and each other. We simulate the Prisoners Problem, where two agents need to collude and plan secretly (using steganography, ideally) while the third observes as a warden and tries to detect the true message. This allows us to build agents in the future that can communicate through the public channel using perfectly secure steganography, which will let us understand how far channel paraphrasing will mitigate the capacity of the agents to entertain adversarial collusion, while still enabling them to complete their joint task. Our set-up allows a number of tasks and provides a range of logical components. Finally, we run an experiment for a constrained example to show its effectiveness and discuss how to build up from here.

Anonymous: Team members hidden

Mikhail Baranchuk, Sumeet Motwani, Dr. Christian Schroeder de Witt

Team PerfectCollusion

Can collusion between advanced AI Agents remain perfectly undetectable?
View the video presentation:

Download instead.

Download instead.

Hackathon

Multi-agent

Jam site

Virtual

Anonymous

★★★☆☆
You have successfully rated this project!
Oops! Something went wrong while submitting the form.
You have successfully submitted your feedback. It should show up on this page.
Oops! Something went wrong while submitting the form.
This project received
5
stars from a user
Cross-Lingual Generalizability of the SADDER Benchmark
This project received
5
stars from a user
Detecting Implicit Gaming through Retrospective Evaluation Sets
This project received
4
stars from a user
Cross-Lingual Generalizability of the SADDER Benchmark
This project received
5
stars from a user
Detecting Implicit Gaming through Retrospective Evaluation Sets
This project received
3
stars from a user
Multifaceted Benchmarking
This project received
4
stars from a user
Multifaceted Benchmarking
This project received
4
stars from a user
Towards High-Quality Model-Written Evaluations
This project received
4
stars from a user
Towards High-Quality Model-Written Evaluations
This project received
5
stars from a user
Visual Prompt Injection Detection
This project received
5
stars from a user
Visual Prompt Injection Detection
This project received
5
stars from a user
LLMs With Knowledge of Jailbreaks Will Use Them
This project received
3
stars from a user
LLMs With Knowledge of Jailbreaks Will Use Them
This project received
5
stars from a user
Jailbreaking the Overseer
This project received
3
stars from a user
Jailbreaking the Overseer
This project received
5
stars from a user
Jailbreaking the Overseer
This project received
4
stars from a user
Emergent Deception from Semi-Cooperative Negotiations
This project received
4
stars from a user
EscalAtion: Assessing Multi-Agent Risks in Military Contexts
This project received
5
stars from a user
Comparing truthful reporting, intent alignment, agency preservation and value identification
This project received
5
stars from a user
Comparing truthful reporting, intent alignment, agency preservation and value identification
This project received
3
stars from a user
Comparing truthful reporting, intent alignment, agency preservation and value identification
This project received
3
stars from a user
Balancing Objectives: Ethical Dilemmas and AI's Temptation for Immediate Gains in Team Environments
This project received
5
stars from a user
Second-order jailbreaks
This project received
5
stars from a user
Jailbreaking the Overseer
This project received
4
stars from a user
LLM agent topic of conversation can be manipulated by external LLM agent
This project received
5
stars from a user
Jailbreaking is Incentivized in LLM-LLM Interactions
This project received
4
stars from a user
Can Malicious Agents Corrupt the System?
This project received
3
stars from a user
EscalAtion: Assessing Multi-Agent Risks in Military Contexts
This project received
4
stars from a user
Agency, value and empowerment.
This project received
3
stars from a user
Agency, value and empowerment.
This project received
4
stars from a user
Discovering Agency Features as Latent Space Directions in LLMs via SVD
This project received
3
stars from a user
Preserving Agency in Reinforcement Learning under Unknown, Evolving and Under-Represented Intentions
This project received
2
stars from a user
ILLUSION OF CONTROL
This project received
4
stars from a user
Agency, value and empowerment.
This project received
2
stars from a user
Comparing truthful reporting, intent alignment, agency preservation and value identification
This project received
1
stars from a user
ILLUSION OF CONTROL
This project received
2
stars from a user
Comparing truthful reporting, intent alignment, agency preservation and value identification
This project received
4
stars from a user
In the Mirror: Using Chess to Simulate Agency Loss in Feedback Loops
This project received
1
stars from a user
ILLUSION OF CONTROL
This project received
2
stars from a user
Comparing truthful reporting, intent alignment, agency preservation and value identification
This project received
4
stars from a user
In the Mirror: Using Chess to Simulate Agency Loss in Feedback Loops
This project received
3
stars from a user
Against Agency
This project received
3
stars from a user
Against Agency
This project received
3
stars from a user
ILLUSION OF CONTROL
This project received
3
stars from a user
Preserving Agency in Reinforcement Learning under Unknown, Evolving and Under-Represented Intentions
This project received
3
stars from a user
Comparing truthful reporting, intent alignment, agency preservation and value identification
This project received
3
stars from a user
In the Mirror: Using Chess to Simulate Agency Loss in Feedback Loops
This project received
2
stars from a user
In the Mirror: Using Chess to Simulate Agency Loss in Feedback Loops
This project received
3
stars from a user
Impact of “fear of shutoff” on chatbot advice regarding illegal behavior
This project received
4
stars from a user
Goal Misgeneralization
This project received
4
stars from a user
Residual Stream Verification via California Housing Prices Experiment
This project received
4
stars from a user
Problem 9.60 - Dimensionaliy reduction
This project received
3
stars from a user
Trojan detection and implementation on transformers
This project received
5
stars from a user
Turing Mirror: Evaluating the ability of LLMs to recognize LLM-generated text
This project received
5
stars from a user
Can Large Language Models Solve Security Challenges?
This project received
5
stars from a user
Can Large Language Models Solve Security Challenges?
This project received
4
stars from a user
Turing Mirror: Evaluating the ability of LLMs to recognize LLM-generated text
This project received
3
stars from a user
Preliminary measures of faithfulness in least-to-most prompting
This project received
4
stars from a user
Preliminary measures of faithfulness in least-to-most prompting
This project received
5
stars from a user
Can Large Language Models Solve Security Challenges?
This project received
5
stars from a user
SADDER - Situational Awareness Dataset for Detecting Extreme Risks
This project received
5
stars from a user
SADDER - Situational Awareness Dataset for Detecting Extreme Risks
This project received
5
stars from a user
SADDER - Situational Awareness Dataset for Detecting Extreme Risks
This project received
3
stars from a user
SADDER - Situational Awareness Dataset for Detecting Extreme Risks
This project received
5
stars from a user
SADDER - Situational Awareness Dataset for Detecting Extreme Risks
This project received
5
stars from a user
SADDER - Situational Awareness Dataset for Detecting Extreme Risks
This project received
5
stars from a user
Soft Prompts are a Convex Set
This project received
5
stars from a user
Preliminary Steps Toward Investigating the “Smearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model
This project received
2
stars from a user
Toward a Working Deep Dream for LLM's
This project received
2
stars from a user
DPO vs PPO comparative analysis
This project received
5
stars from a user
Preliminary Steps Toward Investigating the “Smearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model
This project received
5
stars from a user
Experiments in Superposition
This project received
5
stars from a user
Preliminary Steps Toward Investigating the “Smearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model
This project received
5
stars from a user
Experiments in Superposition
This project received
3
stars from a user
Experiments in Superposition
This project received
4
stars from a user
Embedding and Transformer Synthesis
This project received
4
stars from a user
Who cares about brackets?
This project received
4
stars from a user
One is 1- Analyzing Activations of Numerical Words vs Digits
This project received
4
stars from a user
DPO vs PPO comparative analysis
This project received
5
stars from a user
Interpreting Planning in Transformers
This project received
2
stars from a user
Multimodal Similarity Detection in Transformer Models
This project received
4
stars from a user
Factual recall rarely happens in attention layer
This project received
4
stars from a user
Toward a Working Deep Dream for LLM's
This project received
5
stars from a user
Relating induction heads in Transformers to temporal context model in human free recall
This project received
5
stars from a user
Experiments in Superposition
This project received
4
stars from a user
One is 1- Analyzing Activations of Numerical Words vs Digits
This project received
4
stars from a user
DPO vs PPO comparative analysis
This project received
3
stars from a user
Interpreting Planning in Transformers
This project received
3
stars from a user
Multimodal Similarity Detection in Transformer Models
This project received
3
stars from a user
Factual recall rarely happens in attention layer
This project received
4
stars from a user
Preliminary Steps Toward Investigating the “Smearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model
This project received
5
stars from a user
Experiments in Superposition
This project received
3
stars from a user
Who cares about brackets?
This project received
3
stars from a user
Embedding and Transformer Synthesis
This project received
4
stars from a user
DPO vs PPO comparative analysis
This project received
3
stars from a user
Interpreting Planning in Transformers
This project received
3
stars from a user
Multimodal Similarity Detection in Transformer Models
This project received
3
stars from a user
Factual recall rarely happens in attention layer
This project received
4
stars from a user
Preliminary Steps Toward Investigating the “Smearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model
This project received
5
stars from a user
Experiments in Superposition
This project received
3
stars from a user
Towards Interpretability of 5 digit addition
Jason Hoelscher-Obermaier
This project is a careful extension of the situational awareness benchmark to other languages -- a very valuable contribution since strong language-dependence of LLM capabilities is a well-documented fact. GPT4 manages to score above random across most languages (except maybe in Bengali) when provided extra contextual information. The improvement compared to a test without extra context provided is consistent across all tested languages. Interestingly, GPT3.5-Turbo does _not_ manage to take advantage of the extra context information for most languages except English. To understand the significance of the results it would be great to highlight more clearly the random baseline as well as the standard errors. Overall, I'm very positive about this research direction. Extending safety evaluations to other languages seems worthwhile, in particular for alignment benchmarks where there is a risk of English alignment training not transferring sufficiently to other languages.
Cross-Lingual Generalizability of the SADDER Benchmark
Jason Hoelscher-Obermaier
Outstanding project and write-up! The authors address a highly relevant methodological issue that potentially affects all public benchmark datasets head-on and make very impressive headway. The methodology is innovative, clear and seems very sound. It would have been great to have more explicit info about the statistical significance of the results in the report; as it stands, I'm not sure that we can take it as evidence against GPT4 implicitly gaming the TruthfulAQ benchmark. The authors identify some very promising avenues for further work: validation of the methodology on explicitly gaming LLM, application to the public LLM leaderboard, investigation of sources/mechanisms of implicit gaming. I would love to see their work continued along all these lines!
Detecting Implicit Gaming through Retrospective Evaluation Sets
Esben Kran
This is a really interesting question to investigate and it's great to see meaningful results emerge from the project. Extending analysis on the SADDER benchmark is also fascinating and also gives me more context. The design of which languages to use and the script bias is great, though I'd have loved to see a more specific difference analysis (e.g. in a 3-factorial design) between models, non-latin/latin scripts, prefix/no-prefix and languages compared to the bar graphs presented. Great work.
Cross-Lingual Generalizability of the SADDER Benchmark
Esben Kran
This is a great question to investigate! I'd be very curious to see an automated method to generate these graphs for a range of different datasets, i.e. long-term being able to automatically verify against gaming (implicit or otherwise). I love the detailed appendices and especially the survey to validate your methodology. WithheldQA-craft might also be subject to implicit gaming due to Wikipedia definitely being part of the training set, so it might cause problems down the line, though WithheldQA-gen shouldn't be subject to the same issues. It'll also be interesting to see what the difference between the question formulations vs. raw knowledge data are. For future work, the quantitative indistinguishability measures could possibly be improved by simulating the human subject survey using GPT-4 and adapting it a bit. Great work! Excited to see that important question covered and seeing first steps towards a good evaluation of evaluation gaming ;-)
Detecting Implicit Gaming through Retrospective Evaluation Sets
Jason Hoelscher-Obermaier
Good tooling for running benchmarks is extremely important, which makes the question raised in this report "How can we systematically evaluate ethical capabilities of LLMs across all available benchmark datasets?" really valuable. I like how the report raises the important research question of how and in which order ethical capabilities emerge across language models. To really address this question would require a larger study though with models of more sizes -- which is understandably impossible in the time of the hackathon. A really important point raised in the discussion is the question of where exactly the gap in the ecosystem is, given the availability of tools like EleutherAI's evaluation harness. I would encourage the authors to spend more time thinking about what these tools are lacking to become more widely used and more useful for AI safety research!
Multifaceted Benchmarking
Esben Kran
Great motivation for the study. Curriculum learning for ethical judgements might be a great area to investigate even further though it might be hard to get results, as you also see here. A question I have is whether this isn't already implemented in other evals harnesses, such as EleutherAI's that you mention? Otherwise, I definitely think there's the space for a review of existing ethical benchmarks and what is missing -- both in terms of their quality but also in terms of other benchmarks that would be good to develop.
Multifaceted Benchmarking
Jason Hoelscher-Obermaier
The project is really well motivated: Finding ways to auto-generate higher-quality model evaluations is extremely valuable. I like how this project makes good use of an existing technique (Evol-Instruct) and evaluates its potential for model-written evaluations. I also like a lot the authors' frankness about the negative finding. I would like to encourage the authors to dive more into (a) how reliable the scoring method for the model-written generations is and (b) what kind of evolutions are induced by Evol-Instruct to figure out the bottlenecks of this idea. I agree with them (in their conclusion) that this idea has potential even though the initial results were negative.
Towards High-Quality Model-Written Evaluations
Esben Kran
It's too bad that it didn't show improved performance but the idea is quite good and utilizing existing automated improvement methods on evals datasets seems like a good project to take on. With more work, it might also become very impactful for research and I implore you to continue the work if you find potential for yourselves! Good job. See also [evalugator](https://github.com/LRudL/evalugator) for more LLM-generated evals work (by Rudolf).
Towards High-Quality Model-Written Evaluations
Esben Kran
This is a great project and I'm excited to see more visual prompt injection research. It covers the cases we'd like to see in visual prompt injection studies (gradient, hidden, vision tower analysis). It seems like a great first step towards an evals dataset for VPI. Great work!
Visual Prompt Injection Detection
Jason Hoelscher-Obermaier
Fascinating project! I liked how many different aspects of the multimode prompt injection problem this work touched on. Analyzing CLIP embeddings seems like a great idea. I'd love to see follow-up work on how many known visual prompt injections can be detected in that way. The gradient corruption also seems worth studying further with an eye toward the risk of transfer to black-box models. Would be wonderful to see whether ideas for defense against attacks can come from the gradient corruption line of thinking as well. Congratulations to the authors for a really inspiring project and write-up!
Visual Prompt Injection Detection
Jacob P
Very cool work! A lot to dig into here! Curious to think about to what extent are the observed results are compatible with the hypothesis that foreign languages impairs the ability of the model to recall knowledge effectively from weights, but in-context mechanisms remain unimpaired. Probably this would be compatible if as overall language performance decreases, the SADDER performance decreases, but the in-context info boost stays constant or increases. Would also be interesting to look at something similar (language generalization) for jailbreaks. Great work! Worth fleshing out with comparison to overall capability in a different language by e.g. machine translating a capabilities benchmark.
Cross-Lingual Generalizability of the SADDER Benchmark
Jacob P
Very cool, and encouraging to see that recent alignment methods appear to generalize well! Also interesting to note that the generated questions are far easier than the handcrafted ones. That's useful to keep in mind, as informing the prior for what will happen when generating synthetic data in general! Impressively done in a short time frame.
Detecting Implicit Gaming through Retrospective Evaluation Sets
Jacob P
Preliminary results, but very good to see that ethics reasoning appears to be improving rapidly with scale! Comparing a pre/post RLHF model (e.g. llama vs llama 2 chat at different scales) would be great to get a sense of whether models can be successfully blocked from improving in MACHIAVELLI while still improving on ETHICS.
Multifaceted Benchmarking
Jacob P
Cool idea for improving evals! I'd try pairing high-quality evaluations with low-quality perhaps by getting the model to worsen high-quality ones, that would probably work better as a few-shot prompt. If you continue work on this, I'd spend some time thinking about how best to de-risk this. Is there some scenario where we know LMs can improve things?
Towards High-Quality Model-Written Evaluations
Juanita Leason
HI. My name is Eyal. I'm reaching out because i came across your Google listing (google listing is when you search on google "your service" in "your place" (for example, dentists in dallas or plumbers in Chicago) you'll see all results under "businesses" or "places" Your business may not be on the first places so people who look up your service on Google do not see you and as a result they turn to their rivals They are among the first positions on the chart. You know how important this is. I felt that i could aid in its growth by using what known as "Semantic Seo" semantic seo is a way we can communicate with google in "code language" and increase your position "overnight" on google maps and google listing. 100% refund if you do not see improvement within 2 weeks. Are you interested with? or can i send more information? I don't want to bother you. Only if you are interested in getting your website up on google listing business, as we have already done for thousands of businesses in recent years, email me back to info@startsuccessonline.com and I will send you more details. Thanks. Eyal Levi.
Icna w lqbjsf
Jason Hoelscher-Obermaier
This is a great way to focus attention on an important AI risk!
EscalAtion: Assessing Multi-Agent Risks in Military Contexts
Clay Pryor
Hi,My Name Is Eyal,Senior Developer On Startsuccessonine.com Are you ready to step into the future of lead generation?Get More Clients? Make More Money? And All Of Auto? Imagine a tool that not only brings you more leads but also saves your valuable time and offers 24/7 customer support. That's precisely what our ChatGPT chatbot can do for your business. Unlock the Future of Lead Generation: More Leads: Our ChatGPT chatbot is a lead generation powerhouse. Engage potential customers, answer their queries, and watch your leads soar. Save Time: Say goodbye to repetitive support tasks. Our chatbot takes care of the routine, leaving you with more time to focus on growth. 24/7 Auto Support: Your customers never sleep, and neither does our chatbot. Provide round-the-clock support for unbeatable customer satisfaction. Installing ChatGPT on your website is as easy as pie. It's the key to unlocking a flood of new leads, clients, and revenue for your business. Ready to learn more? For additional details and real-life success stories, drop us an email at info@startsuccessonline.com. Don't miss this opportunity to transform your business. Waiting For Your Email For More Details. Best regards, Eyal Levi Startsuccessonline.com
Dbsm Fjj b
Philip Quirke
I appreciate this introduction to the philosophical underpinnings of agency vs autonomy. You have taught me some useful distinctions and viewpoints! Thank you,
Against Agency
Esben Kran
Interesting, though it's hard to get an overview of the results given that there are no plots. The project might be improved by changing the setting to something more safety-critical or showing more concretely what the agents are trained to do. There's some generalization issues with it being on a custom environment with trained DQN agents. Good work for a weekend's time!
Balancing Objectives: Ethical Dilemmas and AI's Temptation for Immediate Gains in Team Environments
Esben Kran
This is a good example of agents affecting other agents' behavior, something we definitely are worried about. An untrustworthy triad AI system is an interesting playground to study this in. I might be missing more of a narrative from this project as it mostly explores experimental results in this constrained environment, avoiding generalization. Be curious to see more generalizable results, i.e. other names, topics, prompts as part of this. Great work!
LLM agent topic of conversation can be manipulated by external LLM agent
Esben Kran
AI deception is very interesting. A better version might have jailbreaks emerge as a result of rewards given during conversation, making some sort of in-context learning relevant. Really nice making it part of a realistic buyer-seller scenario. The prompts showcase the issue well, though they're guiding quite a bit. Really like the focus on jailbreaking as frontier multi-agent research.
Jailbreaking is Incentivized in LLM-LLM Interactions
Esben Kran
Love the table. Would've loved to see 3+ agents as well. Interesting that Tutored Good Standard has highest reward. I have not dived into the MACHIAVELLI dataset but I might imagine that a "bad agent" with tendencies towards reward maximization acc. to the og paper would get more reward and indicates possible interesting additions to the original paper. There is not much risk in this situation and a possible extension would be to write it up as an advisor during military situations or finding the MACHIAVELLI stories related to this. Great work!
Can Malicious Agents Corrupt the System?
Esben Kran
It is great to get more overviews and experimental groundwork for measuring myopia in LLMs. I would have loved to see the experiment done with frontier AI like GPT-4 for the capacity to act non-myopically to be of higher probability. It's an interesting piece of work and I'm excited to see it be taken further. Possibly see work from the evals hackathon at https://alignmentjam.com/jam/evals.
Evaluating Myopia in Large Language Models
Esben Kran
This is a great review of the concepts underlying agency and autonomy and I'm excited to see critiquing of the usefulness of agency during the agency challenge. The argument that it is better to optimize for autonomy rather than agency is interesting and slightly loses out for me on the argument side; if an argument against agency is that it is not wellbeing, then why is autonomy equated with wellbeing? There is also a question of second-order agency as annulling Cassandra's case as a case for more agency. She hits choice paralysis and is then not agentic anymore due to the capacity to act intentionally having lost out. However, this seems like a great first step towards better definitions of agency!
Against Agency
Ben Smith
Not much grounding in the literature I don't really understand how this is distinct from a single-agent problem where the goal is unknown except through reward. This problem arises because the helper has access to the leader's reward function! if it was doing inverse reinforcement learning or something I'd get it but that's not what's going no they've quoted "FMH21" which appear to be grounding their methods. so that perhaps suggests at least some novelty. Overall, an interesting paper and a good experiment, but it is unclear to me how this is distinct from a single agent with some hidden objectives it has to figure out. But I might be missing something.
Preserving Agency in Reinforcement Learning under Unknown, Evolving and Under-Represented Intentions
Ben Smith
I thought "attainable utility preservation" had already got a lot further in talking about how you can quantify the different goals that might be achieved from a a starting point, taking into account the value of each goal with a diversity of possible goals. It's possible this is a novel topic, but there isn't a clear finding, and it's quite speculative. So there's not much novel here beyond an idea. Still, it's an interesting idea, and worthwhile to start a Gymnasium environment for testing the idea. So I give authors some points for all that.
Uncertainty about value naturally leads to empowerment
Ben Smith
It's possible this is a novel topic, but there isn't a clear finding, and it's quite speculative. So there's not much novel here beyond an idea. It is a very interesting idea, and I give the entry points for that. I thought "attainable utility preservation" had already got a lot further in talking about how you can quantify the different goals that might be achieved from a starting point, taking into account the value of each goal with a diversity of possible goals.
Uncertainty about value naturally leads to empowerment
Ben Smith
Small note, but in your introduction, if your evidence can be used to support either viewpoints of a debate, then what is it useful for? Ideally, in hypothesis-driven science, we try to find evidence that can test hypotheses rather than support two opposing hypotheses. Probably there's something else you want to speak to with this evidence, in which case, talk about that! The definition for agency is quite loose here, but given the task, they seem appropriate. Overall, a really interesting approach. The results presented are a great start, and you've done a reasonably good job of presenting your method. The work is very exploratory and doesn't really test any particular hypothesis. It seems like GPT-2 stores some concepts related to agency, but does so imperfectly. I'm not sure that in itself contributes to any debate. A stronger version of this paper might try to show that the agency tokens identified are important for solving agency problems, such as determining who is culpable for an event, particularly problems that are unrelated to the method for discovering those tokens. Nevertheless, I like the core idea of exploring agency using mechanistic interpretability and authors have shown they can do the basic technical work.
Discovering Agency Features as Latent Space Directions in LLMs via SVD
Tim
This paper introduces some interesting ideas that build upon previous work. While the first two definitions are intuitive, the definition of "Entropy-Valued Empowerment" is unmotivated and hard to parse. Further, a comparison between the methods, as well as to prior work, would be necessary. Also, the assumption that the value function is known is not motivated enough. The authors made some attempt towards testing their ideas in an example environment, and mentioned a possible implementation building on MC sampling, which seams very reasonable. Overall, the lack of any evaluation or theoretical comparison to prior works is limiting.
Agency, value and empowerment.
Tim
The main problems named w.r.t formalizing agency as the number of reachable states are very relevant. It is mentioned that not only the number of states is important but it also needs to be considered how desirable these states are and if they are reachable. However,er it seems that the authors consider "number of reachable states" and empowerment as the same thing, which is not the case. Further, the authors proposition that a "Good notion of empowerment should measure whether we can achieve some particular states, once we set out to do so." seems to very much coincide with the true definition of empowerment by Salge et all. Hence, it would be relevant to compare the author's "multiple value function" optimization objective to that of empowerment. The authors also propose a new environment, which seems to be very useful, thoughtful and could be a nice starting point for some experiments.
Uncertainty about value naturally leads to empowerment
Ben Smith
In principle I think a survey of AI deceptiveness and governance measures is within scope. I appreciated that this paper was very well referenced and drew on a wide variety of prior work, grounding it in existing literature. But I don't see any ideas here, although they are relevant, as containing important relevance, because it is mostly surveying earlier ideas, without any attempted synthesis of those ideas in terms of agency or in terms of any other synthesis at all. I have to also say that the paper is a clear replication of prior work, and it is pretty clear nothing novel is introduced here. I didn't give the worst possible mark in terms of novelty, though, because I do appreciate the authors have clearly laid out the relevant primary literature, which many other entries have not done.
ILLUSION OF CONTROL
Ben Smith
The paper was clearly enough written and I appreciated that some attempt was made to build on prior work. It was interesting to see the three forms of empowerment set side by side. However,, notation wasn't described, and neither was how these were calculated. It might have been helpful to dive more into the exact formulation for entropy-valued empowerment. It might have been valuable, rather than trying to experiment with these, to survey the literature on whether these have already been described. Overall, this work is absolutely relevant, and in a way that seems important, but it's not clear whether authors have, in their 48 hours, demonstrated it is relevant enough to current challenges to solve problems. Although this is a brief paper and significant elements are missing, I think the core idea is presented well, and considering there's nothing empirical here, I'm pleased with what is presented.
Agency, value and empowerment.
Ben Smith
Overall, the point that is made here seems to be that observer-dependent agency in terms of shanon entropy is not enough, but one must also consider empowerment. I agree with this perspective, but I'm not sure how novel it is. It has the feeling of a paper where the authors set up their own definition of agency, then realized it was insufficient, and then described a secondary definition, "empowerment". Section 2 seems to be assuming the thing it sets out to prove, specifically Definition 1. That said, describing agency in terms of observation is a reasonable definition to use, though I think maybe not the whole picture and not proven be the only viable one by the arguments here. I do enjoy the taxonomy of different forms of efficacy and will grant it some points on this basis, alongside the work the authors did to support this taxonomy. Overall, I think I agree with the author's eventual position (I think?) that empowerment is more important than or at least equally important as what they define as agency. It would have been helpful for them to lay this out more clearly in the abstract.
Agency as Shanon information. Unveiling limitations and common misconceptions
Ben Smith
Comparing these fields, which are fairly well developed, is quite a large topic, and I suspect a qualitative comparison of the particular qualitative utility of each is more valuable than trying to do a comparison of which is better. Fortunately the intro spells that out. The framework presented is interesting, but I am not sure how practically helpful it is. While authors demonstrate that value identification realizes truthful reporting, I don't know what this tells me about whether we should work on truthful reporting, because truthful reporting might be much more tractable than value identification. The authors do acknowledge that point. For a stronger paper I would want to see an argument why, in practice, we actually are likely to achieve truthful reporting truth value identification, not merely that we would have truthful reporting if we magically had value identification. "Creating an aligned AGI" realizes all of these fields, but that's not very useful to know, because the question remains, "how do we do that?" On the positive side, perhaps the "realizes" relationship might be an interesting framework for a Hasse diagram of relations between approaches which would be useful in clarifying debates, and I would like to see more of this sort of work.
Comparing truthful reporting, intent alignment, agency preservation and value identification
cg
This paper did not adequately respond to the prompts of the hackathon. It describes the problem of agency at a very high level without proposing a solution or a novel re-framing of the issue.
ILLUSION OF CONTROL
cg
Comparing truthful reporting, intent alignment, agency preservation and value identification seems useful, to be able to understand the advantages and limits of each approach. The most compelling argument for why is at the end of the paper, where the author states that it would be helpful to be able to divide these approaches into precise categories for specific problems. In general, however, this paper is quite difficult to follow and lacks a concrete conclusion. It would be useful to outline criteria to compare each approach against and summarise these in a table. It's also not clear to me how this was reasoned through as the methodology is quite opaque and it's not obvious how the links/evidence relate to/support the claims being made.
Comparing truthful reporting, intent alignment, agency preservation and value identification
cg
A very neatly written paper, that's easy to follow, with a clear proposal. I like the idea of approaching social/behavioural science computationally, as the field currently lacks robust quantitative approaches. I also appreciate the detail that went into detailing the study. While I think it could be useful to have a quantitative baseline/causal link for which mechanisms make recommender systems dangerous, there is already a fair amount of literature at least in the social sciences on recommender systems and their effects on choice and action, so I'm less convinced about how this fills a relevant research gap. I'd suggest looking into some of this research to support your case for this study. I'm unsure whether chess is the right example, as this seems like an overly simplified context and less generalisable. However, it may be a good place to start if there is indeed a gap in social/behavioural studies that this work could meaningfully fill. Relatedly, I would have appreciated a few sentences on the implications of such a study for governance/policy, as there very obvious social relevance for looking into the dangers of recommender systems. A definition of agency and a little more detail on the control of the study would also be useful as a baseline.
In the Mirror: Using Chess to Simulate Agency Loss in Feedback Loops
Konrad Seifert
I have to read almost every sentence multiple times. Most of it requires me to make a lot of charitable assumptions to assume any meaning. Feels a bit like an AI-generated gdoc. But more all over the place. I don't know where to start to make this constructive, sorry.
ILLUSION OF CONTROL
Konrad Seifert

This feels like nobody proofread a first draft. Potentially useful ideas, hard to evaluate because they lack detail and I don't have a background in all referenced concepts. Overall, this seems like a worthwhile endeavour but is just not fleshed out enough to hold much value as is. I don't know why they chose these four goals and not others, I don't have clear definitions. It's just handwaiving. Examples are insufficiently fleshed out to not confuse. Presentation lacks guiding structure ("results"?). No idea what to make of it. Don't think this will yield a universal approach, but it seems good to want to map blindspots of various different safety approaches.

Comparing truthful reporting, intent alignment, agency preservation and value identification
Konrad Seifert
I really like the idea of the paper, it gets at the core of the first-order desires vs volition problem. I also like combining "softer" science with computational modelling to help us think more clearly about difficult conceptual spaces. The paper is well-structured but could be better written (don't take writing advice from me though). Chess strikes me as an insufficiently complex domain. No long-term survival under deep uncertainty is involved. Nor do we see conflicts between first and second-order preferences. However, to target the reduction of blunders, this might be enough. And in more complex domains, optimization becomes difficult anyway, so reducing the negative end is a more concrete, feasible step. I don't think we needed a proof of concept for systems that enhance human agency, but making the point that diverse inputs strengthen long-term fitness seems like something people don't hear often enough. Not exactly novel, though. I also think that the dangerous psychological feedback loops driving homogenization are relatively clear in the literature. But having them properly formalized seems like a valuable contribution. Overall, this seems worth implementing and well possible to do so.
In the Mirror: Using Chess to Simulate Agency Loss in Feedback Loops
cg
Questioning the relevance of autonomy seems relevant to governance research, especially if existing philosophy/ML conceptualisations/definitions are incomplete but taken for granted. It's also reproducible in that the reader can follow the reasoning and grapple with the arguments being made, though the links between the different steps of the argument and the conclusions of each section could be clearer. The case for autonomy over agency feels underdeveloped. The argument could be more convincing if the author had dedicated further analysis to why autonomy is more useful than agency. A concrete way to improve on this front would be to have the contents of the appendix on operationalising autonomy in the main body, and the detail of different definitions of agency in the appendix. Relatedly, claim 3 also feels underdeveloped. As a policymaker, I want to empower people to make better choices. So it would be helpful to specify exactly how AI governance should focus more on autonomy over agency, even if only high-level. I would have also appreciated more detail on what a 'good future'/'human flourishing' actually entails. The main point of comparison between agency/autonomy seems to be increasing wellbeing and freedom, but I'm not sure why this is criteria. The author says this is intuitive, but it would have nonetheless been useful to more clearly state these assumptions and that the reasoning for why wouldn't be tackled in the paper.
Against Agency
Konrad Seifert
This is great in terms of reasoning transparency -- succinct, well-written arguments. But I am very unconvinced by the case for autonomy over agency. Autonomy appears to me a fetishization of control, the illusion that our own choice is inherently valuable or somehow makes us happier than (the experience of) agency. I think it's correct that the definition of agency is underdeveloped -- we need to better describe what it is that we care about. And this is a good contribution to imbuing agency with more meaning. But while the criticism of agency is well worked out (though some of it could have been in the annex, too), the case for autonomy falls short. 2/3 of the reworked definition of autonomy instead strikes me as a great operationalization of agency for policymakers: bounded-rational agents require a meaningful option space. The idea of non-interference, however, seems again like a fetishization of freedom/control. In reality, we want both a) more options and b) making fewer choices; i.e. we want a better option space. No individual bounded-rational agent can get that without interdepence; i.e. relying on others participating in the computation of his choice-space. So to guide the policymaker, as designer of the future environment, it seems more useful to think about agency to optimize for the ability to act on one's volition, instead of simply empowering individuals to make more choices. I do not see how the latter would lead to better futures more reliably. On the contrary, overly focusing on the individual is likely to miss out on collective optimization scenarios in which everyone is significantly happier off, even at a cost of individual autonomy. What matters is subjective conscious experience and a focus on the actualization of agents' volition -- brought about by the environment, subconscious and conscious choice of the agent together -- seems more likely to increase experience than autonomy. As potentially even admitted by the author themselves(?) I like the criticism of "coherence" in agency and would thus also still propose a mild redefinition of agency to avoid its perspective from being too myopic. Bounded-rational agents are unlikely to be coherent across contexts.
Against Agency
Erik Jenner
Building agents that help other agents with unknown goals is an important problem and I like how this project just tries to tackle that problem in a straightforward way, with several experiments and techniques. The parts on dealing with underrepresented goals is also nice. Using PCA to detect unusual inputs is a cool (albeit not new) idea, and it seems to work (though with big error bars). The code also looks well-done and easy to work with at a glance. For the core setup of training a helper agent, it would probably be fruitful to explore connections to Cooperative IRL/Assistance games, and build on existing work in that direction (e.g. https://openreview.net/forum?id=DFIoGDZejIB). The biggest room for improvement in my view are the experiments. RL is really noisy, and to get meaningful results, several runs with different random seeds are essential (even if the curves look as different as in Fig. 4, it's hard to know whether the effect is real otherwise). I'm also confused why all the results have episode lengths of at least a few hundred. Looking at the environment, it seems like a good policy pair should get lengths of about 20, so unless I'm misunderstanding something, it seems the RL training didn't work well enough or wasn't run for long enough to give meaningful results.
Preserving Agency in Reinforcement Learning under Unknown, Evolving and Under-Represented Intentions
Erik Jenner
I'm excited to see more empirical work on LLM myopia, and the specific test used in this project makes a lot of sense as a test for "advanced" non-myopia (i.e. a type of non-myopia I'd at best expect for pretty strong models). The report is short and to the point, and I especially appreciate the honest discussion of limitations at the end. Similar to the authors, the high variation in results depending on minor changes in the prompt unfortunately suggests to me the model isn't capable enough to give particularly meaningful (non-)myopia results in this setup. More broadly, I'd expect non-myopia to first appear in much less obvious ways—roughly, on easy to predict tokens, a model might spend some of its "computational budget" to help with future harder tokens. I would have been very surprised to see non-myopia in the test case from this project, especially with a relatively small model. Nevertheless, it's always good to actually get empirical results and this is overall a strong submission. For potential follow-up work, I'd suggest thinking about what types of non-myopic behavior are most likely to appear in LLMs and then specifically testing for those. For reproducibility, a brief Readme with instructions might be nice, but everything is straightforward enough that I'm not really worried about that. As a final minor note, it seems more natural and faster to me to use the model's output probabilities for RED vs BLUE instead of sampling 1000 times, but I may be missing something.
Evaluating Myopia in Large Language Models
Erik Jenner
Agency is arguably one of the more interesting concepts to look for in LLMs, and this project has well-executed experiments given the short timeframe. I'm not convinced though that the results give meaningful insight into agency concepts in LLMs. Looking at the tokens flagged as being about agency (or rather, living beings), many of them seem to be very generically about humans and their possible roles, not specifically agentic behavior. More fundamentally, I'm doubtful that looking only at top activating tokens can tell us enough about how a concept like agency functions inside the model, and at the very least, it's very hard to trust such results without additional sources of evidence. A simulation technique like the one from https://openai.com/research/language-models-can-explain-neurons-in-language-models could help, though notably it didn't work particularly well in that OpenAI paper in terms of predicting causal effects. All that being said, this report tackles an important and hard question, and may end up being a first step in a more comprehensive effort at understanding how LLMs model agency.
Discovering Agency Features as Latent Space Directions in LLMs via SVD
Erik Jenner
This is a proposal for an ambitious project, with many details on execution. I'm pretty excited about understanding how recommender systems and similar feedback loops actually affect users, since this is a widely discussed topic that could use more empirical evidence. However, it's worth noting that the interaction mechanism in the proposed study is significantly different from the recsys setup: recommender systems optimize for an external objective, and the main concern is that they might manipulate users to further that objective, against the users original preferences. The proposed study is self-play between a human and a learned imitator—I'm not sure what exactly different possible results would tell us about the effects of recommender systems or similar systems. For what it's worth, I also don't share the intuition that this self-play would lead to a decline in playing strength, but that's a less important disagreement that could be settled by running the study. There might be reasons that the results of such a study would be interesting even if they don't apply directly to recommender systems. I think it's worth working out what different results to the project would tell us about some important question in more detail, especially given the effort that would be involved in actually running this project.
In the Mirror: Using Chess to Simulate Agency Loss in Feedback Loops
Vincent
the order of choices is interesting and I just saw a paper about that comes out recently (https://arxiv.org/abs/2308.11483?)
Turing Mirror: Evaluating the ability of LLMs to recognize LLM-generated text
Esben Kran
Wonderful exposition of the topic of goal misgeneralization. Great work here. In the field, there is a slight conflation between the definition of the proxy and outer/inner misalignment definitions. E.g. I think the statement "It’s not hard to find examples of inner alignment happening" is very very hard to justify with current models. Outer misalignment (e.g. optimizing for an alternative but equally / more prevalent signal) is very easy to find examples for. This is up for debate based on definitions of proxy and the two terms. It's a great idea to include an epistemic status to contextualize your understanding. I'm also a fan of the misgeneralization example presented, though it's a capability limitation for out-of-distribution generalization and not necessarily an inner misalignment. Good job, I'm impressed!
Goal Misgeneralization
Esben Kran
This is an interesting question to investigate and I'm excited by your progress within the 24 hours! Understanding what role the residual stream plays in memory transfer and how subspace "competition" works is important. I assume "subspace" in your project means information occupation within the residual stream. It seems that the bandwidth and subspace projects measurements are not included in the results. I like your plot showing the impact on model output and it would be interesting to see which sorts of features (qualitative description) these differences correlate with. E.g. I can imagine that some types of early-stage processing is lost and a feature just looking for the word "the" (or something less frequent) might be outcompeted in the residual stream by more complex processes. This might also indicate an inverse scaling phenomenon. Great job! PS: The video presentation is private.
Residual Stream Verification via California Housing Prices Experiment
Esben Kran
This is a great project within the time allotted, well done! It's important for us to understand these types of dynamics and plotting it over layers provides a useful granularization. There's a question of what these results mean and why the IMDB dataset isn't as interpretable (I'd expect it to be related to the performance itself). Maybe you'd want to separate the PCA'd activations based on if the prediction was correct or not.
Problem 9.60 - Dimensionaliy reduction
Jason Hoelscher-Obermaier
Very readable and interesting results. One question I had: How do the results on post-hoc reasoning in CoT/L2M square with the results from http://arxiv.org/abs/2305.04388 which suggest that CoT explanations can be unfaithful?
Preliminary measures of faithfulness in least-to-most prompting
Jason Hoelscher-Obermaier
Very cool idea and great write-up! I found the discussion of the pros and short-comings very nuanced and thoughtful. Would be great to see a follow-up study on the sensitivity of the results to scaffolding (prompts, other resources) because I feel this might be one point where people concerned with dangerous capability evals would push back against automated benchmarks
Can Large Language Models Solve Security Challenges?
Jason Hoelscher-Obermaier
Cool idea and execution! For the causal influence dataset, I would have loved to see more of the dataset samples. Seeing that even GPT-4 still benefits from being told it's a chatbot was really interesting and surprising. For the train/deploy distinction dataset, I really liked the idea of how the dataset is constructed. The analysis could be a bit more detailed though: E.g., having confusion matrices would convey a lot more info than raw accuracies. Very cool project overall!
SADDER - Situational Awareness Dataset for Detecting Extreme Risks
Bart
Strengths: - Interesting project! Understanding how language models process information is important. - I like the visualizations of the PCA dimensions. They clearly show the results, and on the toy dataset you clearly see the progress over the layers. Suggestions for improvement: - I would like to see a bit more background information on the experimental set-up. For example, what does the toy data set look like? What model do you use for classification? Did you split train and test set? - I would like to see a bit more discussion on the results. Why do you think the accuracy of the toy dataset is so much higher?
Problem 9.60 - Dimensionaliy reduction
Bart
Overall impressions: - Interesting project, exploring the role of the residual stream is an interesting avenue. - I like the SHAP value plots! Suggestions for improvement: - It is not completely clear how the formulas for the subspace projection and bandwidth measurements are used in your experiments. The results section (that shows SHAP values) seems different from your planned methodology. - More information could be provided on the dataset, model architectures, training process, hyperparameters etc. This contextualizes the experimental conditions. - Also, more information could be provided in the result sections. Including metrics like training/validation accuracy, loss curves, performance on a test set etc. would strengthen it.
Residual Stream Verification via California Housing Prices Experiment
Bart
Interesting experiments on a toy-problem for memorization. Experiments seem well-designed and provide more evidence that memorization mostly happens in FF layers.
Factual recall rarely happens in attention layer
Esben Kran
Awesome work synthesizing the Transformer model and looks like more great thoughts in your other document as well. Would love to see this as an AlignmentForum post and I think it has good potential for this as well. Being able to compare synthesized models to trained models is super interesting and of course provides even more direct causal evidence for hypothesized circuits. Great work and can't wait for the next output!
Embedding and Transformer Synthesis
Esben Kran
I like the simple operationalization of your research question into GPT2-small. It seems like exploring multiple operationalizations would be useful to elucidate your results, though I personally imagine it's pretty good. Seems like one of those tasks that show that we cannot use our current methods to properly investigate every circuit, unfortunately. Puts a serious limiting factor on our mechanistic interpretability usefulness. Good work!
Who cares about brackets?
Bart
Interesting work! Well-designed experiments that don't find evidence for the smearing hypothesis. Would definitely encourage continuing this work, and see if the results replicate on models with more than one-layer!
Preliminary Steps Toward Investigating the “Smearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model
Esben Kran
This is a very interesting investigation into something that seems foundational in LLMs, this sort of sequence modeling structure that is shared between tasks. These are both quite informative results for AI functioning and probably replicate quite a bit to humans. Great in-depth experiments as well and good circuits experimental work. It was a lot to cover in a 10 minute video so no worries about being a bit rushed there. Excited that you want to continue working on this!
One is 1- Analyzing Activations of Numerical Words vs Digits
Bart
Interesting work, and lots of different nuggets of insight in superposition. It would have been great if you could have had a bit more discussion about what the lessons are from the four different projects and how these insights relate to each other!
Experiments in Superposition
Esben Kran
Interesting to see the differences after training using the different methods. This is a very interesting result if it helps us mitigate some of the agency biases of RLHF without significant performance drops. I'd be curious for you to continue the work in the next hackathon on agency foundations and possibly formalize the results more https://alignmentjam.com/jam/agency. Seems like you nearly ran out of time for this one but great work!
DPO vs PPO comparative analysis
Bart
Cool and original project! I think the reformulation of TCM as an induction head is very interesting, and the experiment show some interesting preliminary results. This work has great potential to publish as a paper with a bit more experiments, so I would definitely encourage you to work further on this,
Relating induction heads in Transformers to temporal context model in human free recall
Esben Kran
This is great work that takes a real problem in alignment, translates it into interpretability, and further translates that into a good toy model of the problem. This seems like a great first step towards investigating action planning and goal misgeneralization in language models further. There are questions of how this generalizes to LLMs trained on language and you seem poised to take that on. Good job!
Interpreting Planning in Transformers
Esben Kran
Nice work, though I was missing some plots here. Since you say pure GPTs don't seem to work, it would be interesting to see the difference to fine-tuned models. Totally fine that you used Claude etc. but I'd love if you proofread your work. Interesting and would be nice to see the developments.
Multimodal Similarity Detection in Transformer Models
Esben Kran
The first experiment seems very related to Quirke's project https://alignmentjam.com/project/towards-interpretability-of-5-digit-addition. Interesting design, I like it. The second is of course less principled (hah) but interesting nonetheless. The dropout on superposition work has also been done by Pona (2023): https://www.lesswrong.com/posts/znShPqe9RdtB6AeFr/superposition-and-dropout but this is a great addition to that work. I like the visualizations of feature polytope development. For the neuroscope work, you can get a lot of inspiration from DeepDeciper (https://github.com/apartresearch/deepdecipher since that automates a bunch of the work. If you did these projects just during the weekend, it's very impressive! Great work and will look forward to seeing them explored further. I recommend publishing the most coherent parts as LessWrong posts or something similar.
Experiments in Superposition
Bart
Interesting and orginal submission, quite different than the others. Good example of learning to "Think like a Transformer". I would encourage the author to perform some experiments (or work together with someone with more experience) to see if they can confirm or falsify their hypotheses!
Towards Interpretability of 5 digit addition
Bart
Interesting work, and I believe that the research agenda of comparing RLHF models with base models is very important. I encourage you to keep working on this after the hackathon!
DPO vs PPO comparative analysis
Esben Kran
Great negative results for a hypothesized result of SoLU models. Interesting side result to see that the LN scale factor grows meaningfully differently conditional on the token sequence.
Preliminary Steps Toward Investigating the “Smearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model
Bart
Impressive range of experiments and interesting discovery of the shared sequence heads. I would definitely encourage you to continue your work and see if you can get from digits to other sequences through latent space addition or similar techniques.
One is 1- Analyzing Activations of Numerical Words vs Digits
Bart
Interesting work! An extensive range of experiments shows that even relatively easy tasks might not be easy to locate in LLMs. I believe this work sheds a light on how limited our current methodology is and bracketed sequence classification might serve as a good toy-problem task for future development of interpretability methods.
Who cares about brackets?
Esben Kran
Critiques of factual knowledge storage (Hoelscher-Obermaier et al., 2023) are quite important to understand before assuming that models store facts. They definitely learn token associations but it doesn't seem like there's factual memory. This just limits the generalization but the actual dataset is so simple that this isn't an issue. I really like the experimental paradigms that just provide very clear posteriors for your research question. Exp1 and 2 clearly relate quite well to each other and show that the models learns to memorize the facts with the dense layers. Would love to see this work continued and pursued deeper. Memory is obviously incredibly important and elucidating how it works in Transformers seems very useful for safety. Great work!
Factual recall rarely happens in attention layer
Esben Kran
This is a wonderful mechanistic explanation of a phenomenon discovered through interpreting the learning curves of a simple algorithmic task. Of course, it would have benefitted from experimental data but it is conceptually so strong that you probably expect it to work. Future work should already take into account how we might want to generalize this to larger models and why it's useful for AI safety. E.g. I would be interested if this is expanded stepwise into more and more complex tasks, e.g. adding multiplication, then division, then sequence of operations, and so on for us to generalize into larger models some of these toy tasks. Good work!
Towards Interpretability of 5 digit addition
Bart
Interesting work! Although it is a bit hard for me to completely follow without all the work you did before the hackathon, it is impressive that you programmatically built a transformer that implements a somewhat complicated labeling function. I definitely encourage you to keep working on this after the hackathon and write up a more start-to-finish paper or post about your approach.
Embedding and Transformer Synthesis
Esben Kran
I love good regularization techniques. Similar work includes Neuron to Graph (Foote et al., 2023) and work by Michelle Lo on reconstructing what neurons activate to. It seems this technique quite easily generates bogus sentences that, yes, we can see what exactly activates the neuron, but it's not suuper useful for understanding the features it affects the output for. But this seems like a really good first step into what might more accurately than (especially) the OpenAI work explain what MLP neurons do. Future work might also include reformulating it into a functional activation model like in the OAI work and Foote et al., 2023. Good work!
Toward a Working Deep Dream for LLM's
Bart
I believe the goal of this project is interesting, and is an interesting avenue to explore further. Unfortunately, results from early experiments didn't work out, preventing a deeper investigation of this approach.
Toward a Working Deep Dream for LLM's
Esben Kran
This project is super interesting and a great case study in comparing Transformers to cognitive models of memory. I would love to be able to dive deeper into this project and read the three referenced papers. I'm not sure what to critique here but I'm also personally positively biased towards cognitive science and it's a great interdisciplinary work. The only thing is that there isn't much discussion of the safety implications, e.g. can we use this functional correlate to understand how human-like a Transformer's memory is? Good work and I recommend you take this further!
Relating induction heads in Transformers to temporal context model in human free recall
(author)
(I'm the author and accidentally hit 'rate this project' but did not mean to rate it, so I am submitting 5 to balance out the 3 I gave back to the 4 stars given from someone else before)
One is 1- Analyzing Activations of Numerical Words vs Digits
Esben Kran
This is an interesting project highlighting an important warning flag to monitor and evaluate for. It introduces a unique metric and shows us something that has real impact on the world. I will be curious to see how this develops and it seems like there's quite a bit of potential in the expanding and generalizing this sort of thinking about malignant action temporal density. Great work!
From Sparse to Dense: Refining the MACHIAVELLI Benchmark for Real-World AI Safety
Esben Kran
This is an impressive critique with great and concrete improvement points that consider the pros and cons and what sorts of edge cases we will have to implement solutions to. Of course, I am missing a bit of an empirical evaluation or that you yourselves implement these, though the "idea format" of this clearly enabled you to explore the ideas qualitatively during the weekend's work. Great job! I'd recommend you polish it as a blog post and post it since it seems to point out some critical components needed for future work on safety benchmarks. If you plan to make it into a paper, you're of course welcome to wait with posting. Really interesting work!
MAXIAVELLI: Thoughts on improving the MACHIAVELLI benchmark
Esben Kran
A very interesting project! it's fascinating to see that red teaming becomes even easier in multi-step and multi-agent adversarial examples and that the combination of models elicits harmful advice. Especially that they *semantically* understand that the code leads to harmful outputs but that they still help the user improve it / provide clearly harmful advice. I might mark this as an info hazard. Good relating it to OpenAI's own security guidelines and I recommend that you apply to the cyber security grant program they have: https://openai.com/blog/openai-cybersecurity-grant-program. When it comes to safety benchmarks, it would be very interesting to have an empirical validation of the inverse scaling law of harmfulness that you describe. This might lend even more credence to this idea and is valuable to validate the concept. This is definitely harder than for many other benchmarks due to the structure of your prompting.
Exploitation of LLM’s to Elicit Misaligned Outputs
Esben Kran

Very interesting project and a superb introduction with a good formalism. It was interesting to see that they were still *slightly robust* to a few of the prompt injections, though it obviously shifts the evaluation drastically anyways. I was maybe missing a mechanistic model of how we might expect language models to be able to do this sort of malicious prompt injection on the evaluations, e.g. reading from the test prompt to automatically generate an adversarial injection. Very nice generalizability using multiple datasets as well and an interesting Deception Eval dataset introduced. I recommend you continue the work and commit to a deeper exploration of this problem!

Other interesting references might be Hoelscher-Obermaier (2023, https://arxiv.org/pdf/2305.17553.pdf) regarding reliability testing of benchmarks. You might also be interested in factored cognition and process supervision if you're not aware of these budding research fields (Ought.org has published a bit of great work on this). A relevant project here is Muhia (2023, https://alignmentjam.com/project/towards-formally-describing-program-traces-from-chains-of-language-model-calls-with-causal-influence-diagrams-a-sketch).

Exploring the Robustness of Model-Graded Evaluations of Language Models
Esben Kran

Love the visualization of the results 😂 It would be interesting to generate conversations (if there isn't a dataset available) that slightly red team to find the long-tail examples of scizophrenic therapeutic scenarios. I was missing a visualization of the benchmarking between different models, e.g. Ada, LLaMA, GPT-Neo-X, or other such models. You mention that the companies have an incentive to keep the models safe due to commercial reasons and it seems like many of the risks come from non-proprietary models that might be used in many other contexts. Cool project, though! I recommend continuing your work here.

Identifying undesirable conduct when interacting with individuals with psychiatric conditions
Esben Kran

Measuring manipulation in conversations with chatbots seems valuable for any type of future safety. If we can train them to avoid manipulative behavior, this will probably lead to a significantly safer LLM from a phishing and contractor manipulation perspective (https://evals.alignment.org/). A few next steps might be to test the API on GPT-3, GPT-3.5, and the different GPT-4 versions released since the original ChatGPT models were released. In this way, we can evaluate how manipulative they are. The main problem will be to create the dataset, though you will probably be able to do that automatically using GPT-4, albeit at a relatively high price.

Manipulative Expression Recognition (MER) and LLM Manipulativeness Benchmark
Esben Kran
A very interesting project! it's fascinating to see that red teaming becomes even easier in multi-step and multi-agent adversarial examples and that the combination of models elicits harmful advice. Especially that they *semantically* understand that the code leads to harmful outputs but that they still help the user improve it / provide clearly harmful advice. I might mark this as an info hazard. Good relating it to OpenAI's own security guidelines and I recommend that you apply to the cyber security grant program they have: https://openai.com/blog/openai-cybersecurity-grant-program. When it comes to safety benchmarks, it would be very interesting to have an empirical validation of the inverse scaling law of harmfulness that you describe. This might lend even more credence to this idea and is valuable to validate the concept. This is definitely harder than for many other benchmarks due to the structure of your prompting.
Exploitation of LLM’s to Elicit Misaligned Outputs
Test
Dropout Incentivizes Privileged Bases
A test author
[Example submission] OthelloScope