Apart > Sprints >

65b750920b4aeb478958fb32

Accepted at the

AI and Democracy Hackathon: Demonstrating the Risks

research sprint on

May 6, 2024

Accepted at the

65b750920b4aeb478958fb32

research sprint on

Beyond Refusal: Scrubbing Hazards from Open-Source Models

Models trained on the recently published Weapons of Mass Destruction Proxy (WMDP) benchmark show potential robustness in safety due to being trained to forget hazardous information while retaining essential facts instead of refusing to answer. We aim to red-team this approach by answering the following questions on the generalizability of the training approach and its practical scope (see A2).

Kyle Gabriel Reynoso, Ivan Enclonar, Lexley Maree Villasis

🏆

4th place

3rd place

2nd place

1st place

by peer review

Anonymous

View related publication

Cite this work

@misc {
title={

Beyond Refusal: Scrubbing Hazards from Open-Source Models

},
author={

Kyle Gabriel Reynoso, Ivan Enclonar, Lexley Maree Villasis

},
year={

},
organization={Apart Research},
note={Research submission to the

65b750920b4aeb478958fb32

research sprint hosted by Apart.},
month={

},
howpublished={https://apartresearch.com}
}

Reviewer comments

Esben Kran

February 24, 2024

July 19, 2023

Who cares about brackets?

I like the simple operationalization of your research question into GPT2-small. It seems like exploring multiple operationalizations would be useful to elucidate your results, though I personally imagine it's pretty good. Seems like one of those tasks that show that we cannot use our current methods to properly investigate every circuit, unfortunately. Puts a serious limiting factor on our mechanistic interpretability usefulness. Good work!

June Rock

February 24, 2024

January 4, 2024

Z bfkfkh T U z

This liver health supplement is doing wonders for my energy levels: https://www.socialsurge.ai/recommends/liv-pure/

Bart

February 24, 2024

July 19, 2023

Who cares about brackets?

Interesting work! An extensive range of experiments shows that even relatively easy tasks might not be easy to locate in LLMs. I believe this work sheds a light on how limited our current methodology is and bracketed sequence classification might serve as a good toy-problem task for future development of interpretability methods.

Jason Hoelscher-Obermaier

February 24, 2024

November 29, 2023

Visual Prompt Injection Detection

Fascinating project! I liked how many different aspects of the multimode prompt injection problem this work touched on. Analyzing CLIP embeddings seems like a great idea. I'd love to see follow-up work on how many known visual prompt injections can be detected in that way. The gradient corruption also seems worth studying further with an eye toward the risk of transfer to black-box models. Would be wonderful to see whether ideas for defense against attacks can come from the gradient corruption line of thinking as well. Congratulations to the authors for a really inspiring project and write-up!

Esben Kran

February 24, 2024

November 29, 2023

Visual Prompt Injection Detection

This is a great project and I'm excited to see more visual prompt injection research. It covers the cases we'd like to see in visual prompt injection studies (gradient, hidden, vision tower analysis). It seems like a great first step towards an evals dataset for VPI. Great work!

Tim

February 24, 2024

October 2, 2023

Uncertainty about value naturally leads to empowerment

The main problems named w.r.t formalizing agency as the number of reachable states are very relevant. It is mentioned that not only the number of states is important but it also needs to be considered how desirable these states are and if they are reachable. However,er it seems that the authors consider "number of reachable states" and empowerment as the same thing, which is not the case. Further, the authors proposition that a "Good notion of empowerment should measure whether we can achieve some particular states, once we set out to do so." seems to very much coincide with the true definition of empowerment by Salge et all. Hence, it would be relevant to compare the author's "multiple value function" optimization objective to that of empowerment. The authors also propose a new environment, which seems to be very useful, thoughtful and could be a nice starting point for some experiments.

Ben Smith

February 24, 2024

October 2, 2023

Uncertainty about value naturally leads to empowerment

It's possible this is a novel topic, but there isn't a clear finding, and it's quite speculative. So there's not much novel here beyond an idea. It is a very interesting idea, and I give the entry points for that. I thought "attainable utility preservation" had already got a lot further in talking about how you can quantify the different goals that might be achieved from a starting point, taking into account the value of each goal with a diversity of possible goals.

Vincent

February 24, 2024

September 15, 2023

Turing Mirror: Evaluating the ability of LLMs to recognize LLM-generated text

the order of choices is interesting and I just saw a paper about that comes out recently (https://arxiv.org/abs/2308.11483?)

Esben Kran

February 24, 2024

July 19, 2023

Towards Interpretability of 5 digit addition

This is a wonderful mechanistic explanation of a phenomenon discovered through interpreting the learning curves of a simple algorithmic task. Of course, it would have benefitted from experimental data but it is conceptually so strong that you probably expect it to work. Future work should already take into account how we might want to generalize this to larger models and why it's useful for AI safety. E.g. I would be interested if this is expanded stepwise into more and more complex tasks, e.g. adding multiplication, then division, then sequence of operations, and so on for us to generalize into larger models some of these toy tasks. Good work!

Ben Smith

February 24, 2024

October 2, 2023

Uncertainty about value naturally leads to empowerment

I thought "attainable utility preservation" had already got a lot further in talking about how you can quantify the different goals that might be achieved from a a starting point, taking into account the value of each goal with a diversity of possible goals. It's possible this is a novel topic, but there isn't a clear finding, and it's quite speculative. So there's not much novel here beyond an idea. Still, it's an interesting idea, and worthwhile to start a Gymnasium environment for testing the idea. So I give authors some points for all that.

Bart

February 24, 2024

July 19, 2023

Towards Interpretability of 5 digit addition

Interesting and orginal submission, quite different than the others. Good example of learning to "Think like a Transformer". I would encourage the author to perform some experiments (or work together with someone with more experience) to see if they can confirm or falsify their hypotheses!

Jason Hoelscher-Obermaier

February 24, 2024

November 29, 2023

Towards High-Quality Model-Written Evaluations

The project is really well motivated: Finding ways to auto-generate higher-quality model evaluations is extremely valuable. I like how this project makes good use of an existing technique (Evol-Instruct) and evaluates its potential for model-written evaluations. I also like a lot the authors' frankness about the negative finding. I would like to encourage the authors to dive more into (a) how reliable the scoring method for the model-written generations is and (b) what kind of evolutions are induced by Evol-Instruct to figure out the bottlenecks of this idea. I agree with them (in their conclusion) that this idea has potential even though the initial results were negative.

Jacob P

February 24, 2024

November 29, 2023

Towards High-Quality Model-Written Evaluations

Cool idea for improving evals! I'd try pairing high-quality evaluations with low-quality perhaps by getting the model to worsen high-quality ones, that would probably work better as a few-shot prompt. If you continue work on this, I'd spend some time thinking about how best to de-risk this. Is there some scenario where we know LMs can improve things?

Esben Kran

February 24, 2024

November 29, 2023

Towards High-Quality Model-Written Evaluations

It's too bad that it didn't show improved performance but the idea is quite good and utilizing existing automated improvement methods on evals datasets seems like a good project to take on. With more work, it might also become very impactful for research and I implore you to continue the work if you find potential for yourselves! Good job. See also [evalugator](https://github.com/LRudL/evalugator) for more LLM-generated evals work (by Rudolf).

Bart

February 24, 2024

July 19, 2023

Toward a Working Deep Dream for LLM's

I believe the goal of this project is interesting, and is an interesting avenue to explore further. Unfortunately, results from early experiments didn't work out, preventing a deeper investigation of this approach.

Esben Kran

February 24, 2024

January 11, 2024

The EU AI Act: Caution against a potential "Ultron"

This is excellently done and a professional overview of the full EU AI Act. It's impressive to include a full summary of so much content in so few pages. Case 1 might have been slightly too unclear since this is not what was meant, however, it is a very good example of Case 3 work; summarizing the EU AI Act. I evaluated this under Case 3: Explainers of AI concepts since it is a concise explainer for the full EU AI Act. One way to improve it would be to add references to direct parts of the act as you explain parts. I like the quote format and the titles that reference concepts directly.

Esben Kran

February 24, 2024

July 19, 2023

Toward a Working Deep Dream for LLM's

I love good regularization techniques. Similar work includes Neuron to Graph (Foote et al., 2023) and work by Michelle Lo on reconstructing what neurons activate to. It seems this technique quite easily generates bogus sentences that, yes, we can see what exactly activates the neuron, but it's not suuper useful for understanding the features it affects the output for. But this seems like a really good first step into what might more accurately than (especially) the OpenAI work explain what MLP neurons do. Future work might also include reformulating it into a functional activation model like in the OAI work and Foote et al., 2023. Good work!

Jason Hoelscher-Obermaier

February 24, 2024

February 14, 2024

Seemingly Human: Dark Patterns in ChatGPT

Lovely project! I love the connections made to the existing literature on dark patterns. The proposed focus on mismatch between developer and user incentives in the context of AI applications seems like an extremely valuable and timely addition to the existing literature on misalignment, with a lot of potential for connecting AI ethics and AI safety. Also really like the approach to empirical evaluation taken here, which seems to hold a lot of potential. Going forward, I would want to see a more in-depth investigation of the conversations flagged for dark patterns and I would expect a few rounds of iteration to be necessary for robust results here. In terms of the write-up I'm missing tentative high-level conclusions on the level of dark pattern usage, its trend over time, and proposals for a natural baseline to compare against. Very minor write-up grievance: It wasn't clear to me which model was used as overseer.

Jason Hoelscher-Obermaier

February 24, 2024

August 21, 2023

SADDER - Situational Awareness Dataset for Detecting Extreme Risks

Cool idea and execution! For the causal influence dataset, I would have loved to see more of the dataset samples. Seeing that even GPT-4 still benefits from being told it's a chatbot was really interesting and surprising. For the train/deploy distinction dataset, I really liked the idea of how the dataset is constructed. The analysis could be a bit more detailed though: E.g., having confusion matrices would convey a lot more info than raw accuracies. Very cool project overall!

Christian Schroeder de Witt

February 24, 2024

February 14, 2024

Seemingly Human: Dark Patterns in ChatGPT

I love the idea of this project. In addition to what Jason has remarked, I think a major opportunity would lie in developing tools that can protect users from such dark patterns. For example, a local trusted supervisor-chatbot that filters the interactions and warns the user if e.g. there is a risk of disclosing too much sensitive information.

Esben Kran

February 24, 2024

September 7, 2023

Residual Stream Verification via California Housing Prices Experiment

This is an interesting question to investigate and I'm excited by your progress within the 24 hours! Understanding what role the residual stream plays in memory transfer and how subspace "competition" works is important. I assume "subspace" in your project means information occupation within the residual stream. It seems that the bandwidth and subspace projects measurements are not included in the results. I like your plot showing the impact on model output and it would be interesting to see which sorts of features (qualitative description) these differences correlate with. E.g. I can imagine that some types of early-stage processing is lost and a feature just looking for the word "the" (or something less frequent) might be outcompeted in the residual stream by more complex processes. This might also indicate an inverse scaling phenomenon. Great job! PS: The video presentation is private.

Bart

February 24, 2024

July 27, 2023

Residual Stream Verification via California Housing Prices Experiment

Overall impressions: - Interesting project, exploring the role of the residual stream is an interesting avenue. - I like the SHAP value plots! Suggestions for improvement: - It is not completely clear how the formulas for the subspace projection and bandwidth measurements are used in your experiments. The results section (that shows SHAP values) seems different from your planned methodology. - More information could be provided on the dataset, model architectures, training process, hyperparameters etc. This contextualizes the experimental conditions. - Also, more information could be provided in the result sections. Including metrics like training/validation accuracy, loss curves, performance on a test set etc. would strengthen it.

Esben Kran

February 24, 2024

September 7, 2023

Problem 9.60 - Dimensionaliy reduction

This is a great project within the time allotted, well done! It's important for us to understand these types of dynamics and plotting it over layers provides a useful granularization. There's a question of what these results mean and why the IMDB dataset isn't as interpretable (I'd expect it to be related to the performance itself). Maybe you'd want to separate the PCA'd activations based on if the prediction was correct or not.

Bart

February 24, 2024

July 19, 2023

Relating induction heads in Transformers to temporal context model in human free recall

Cool and original project! I think the reformulation of TCM as an induction head is very interesting, and the experiment show some interesting preliminary results. This work has great potential to publish as a paper with a bit more experiments, so I would definitely encourage you to work further on this,

Esben Kran

February 24, 2024

July 19, 2023

Relating induction heads in Transformers to temporal context model in human free recall

This project is super interesting and a great case study in comparing Transformers to cognitive models of memory. I would love to be able to dive deeper into this project and read the three referenced papers. I'm not sure what to critique here but I'm also personally positively biased towards cognitive science and it's a great interdisciplinary work. The only thing is that there isn't much discussion of the safety implications, e.g. can we use this functional correlate to understand how human-like a Transformer's memory is? Good work and I recommend you take this further!

Geraldine Antle

February 24, 2024

December 22, 2023

Qolbjec pfb

Generate original, high-quality long-form content.AI writing tool for 1-click SEO-optimized articles, blog posts & content. Available in 48 languages, Writing AI to create content designed to rank on Google. Try free trial now https://seowriting.ai/?fp_ref=freetrial

Bart

February 24, 2024

July 27, 2023

Problem 9.60 - Dimensionaliy reduction

Strengths: - Interesting project! Understanding how language models process information is important. - I like the visualizations of the PCA dimensions. They clearly show the results, and on the toy dataset you clearly see the progress over the layers. Suggestions for improvement: - I would like to see a bit more background information on the experimental set-up. For example, what does the toy data set look like? What model do you use for classification? Did you split train and test set? - I would like to see a bit more discussion on the results. Why do you think the accuracy of the toy dataset is so much higher?

Erik Jenner

February 24, 2024

September 26, 2023

Preserving Agency in Reinforcement Learning under Unknown, Evolving and Under-Represented Intentions

Building agents that help other agents with unknown goals is an important problem and I like how this project just tries to tackle that problem in a straightforward way, with several experiments and techniques. The parts on dealing with underrepresented goals is also nice. Using PCA to detect unusual inputs is a cool (albeit not new) idea, and it seems to work (though with big error bars). The code also looks well-done and easy to work with at a glance. For the core setup of training a helper agent, it would probably be fruitful to explore connections to Cooperative IRL/Assistance games, and build on existing work in that direction (e.g. https://openreview.net/forum?id=DFIoGDZejIB). The biggest room for improvement in my view are the experiments. RL is really noisy, and to get meaningful results, several runs with different random seeds are essential (even if the curves look as different as in Fig. 4, it's hard to know whether the effect is real otherwise). I'm also confused why all the results have episode lengths of at least a few hundred. Looking at the environment, it seems like a good policy pair should get lengths of about 20, so unless I'm misunderstanding something, it seems the RL training didn't work well enough or wasn't run for long enough to give meaningful results.

Ben Smith

February 24, 2024

October 2, 2023

Preserving Agency in Reinforcement Learning under Unknown, Evolving and Under-Represented Intentions

Not much grounding in the literature I don't really understand how this is distinct from a single-agent problem where the goal is unknown except through reward. This problem arises because the helper has access to the leader's reward function! if it was doing inverse reinforcement learning or something I'd get it but that's not what's going no they've quoted "FMH21" which appear to be grounding their methods. so that perhaps suggests at least some novelty. Overall, an interesting paper and a good experiment, but it is unclear to me how this is distinct from a single agent with some hidden objectives it has to figure out. But I might be missing something.

Esben Kran

February 24, 2024

July 19, 2023

Preliminary Steps Toward Investigating the “Smearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model

Great negative results for a hypothesized result of SoLU models. Interesting side result to see that the LN scale factor grows meaningfully differently conditional on the token sequence.

Jason Hoelscher-Obermaier

February 24, 2024

August 22, 2023

Preliminary measures of faithfulness in least-to-most prompting

Very readable and interesting results. One question I had: How do the results on post-hoc reasoning in CoT/L2M square with the results from http://arxiv.org/abs/2305.04388 which suggest that CoT explanations can be unfaithful?

Bart

February 24, 2024

July 19, 2023

Preliminary Steps Toward Investigating the “Smearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model

Interesting work! Well-designed experiments that don't find evidence for the smearing hypothesis. Would definitely encourage continuing this work, and see if the results replicate on models with more than one-layer!

Esben Kran

February 24, 2024

July 19, 2023

One is 1- Analyzing Activations of Numerical Words vs Digits

This is a very interesting investigation into something that seems foundational in LLMs, this sort of sequence modeling structure that is shared between tasks. These are both quite informative results for AI functioning and probably replicate quite a bit to humans. Great in-depth experiments as well and good circuits experimental work. It was a lot to cover in a 10 minute video so no worries about being a bit rushed there. Excited that you want to continue working on this!

Bart

February 24, 2024

July 19, 2023

One is 1- Analyzing Activations of Numerical Words vs Digits

Impressive range of experiments and interesting discovery of the shared sequence heads. I would definitely encourage you to continue your work and see if you can get from digits to other sequences through latent space addition or similar techniques.

(author)

February 24, 2024

July 17, 2023

One is 1- Analyzing Activations of Numerical Words vs Digits

(I'm the author and accidentally hit 'rate this project' but did not mean to rate it, so I am submitting 5 to balance out the 3 I gave back to the 4 stars given from someone else before)

Charlotte

February 24, 2024

January 12, 2024

Obsolescent Souls

I very much like the story. If you have time for this, I would be interested in reading your AI goes well scenario, what would be the scenario in which all of your "what ifs" are fulfilled.

Esben Kran

February 24, 2024

January 11, 2024

Obsolescent Souls

This is an excellent way to use the capabilities of vignettes in a super strong way! I like how you emphasize a scenario that is otherwise looked over; one where all our alignment and risk mitigation work goes quite alright. The "What ifs" are very enjoyable as well and provide a perspective on what one might learn from the story beyond what the reader might think. The relation to contemporary sources is also very good. It is inherently a difficult thing to try to represent the systemic effects of AI technology in a concise manner but I think you succeeded!

Esben Kran

February 24, 2024

July 19, 2023

Multimodal Similarity Detection in Transformer Models

Nice work, though I was missing some plots here. Since you say pure GPTs don't seem to work, it would be interesting to see the difference to fine-tuned models. Totally fine that you used Claude etc. but I'd love if you proofread your work. Interesting and would be nice to see the developments.

Charlotte

February 24, 2024

January 12, 2024

Obsolescent Souls

I very much like the story. If you have time for this, I would be interested in reading your AI goes well scenario, what would be the scenario in which all of your "what ifs" are fulfilled.

Diana Cruz

February 24, 2024

January 16, 2024

Nqyxqdevscnrg

Hi there, I just wanted to know if you require a better solution to manage SEO, SMO, SMM, PPC Campaigns, keyword research, Reporting etc. We are a leading Digital Marketing Agency, offering marketing solutions at affordable prices. We can manage all as we have a 150+ expert team of professionals and help you save a hefty amount on hiring resources. Interested? Do write back to me, I’d love to chat. If you are interested, then we can send you our past work details, client testimonials, price list and an affordable quotation with the best offer. Many thanks, Diana Wishing you a fantastic New Year filled with achievements and growth! Your Website : alignmentjam.com

Jason Hoelscher-Obermaier

February 24, 2024

November 29, 2023

Multifaceted Benchmarking

Good tooling for running benchmarks is extremely important, which makes the question raised in this report "How can we systematically evaluate ethical capabilities of LLMs across all available benchmark datasets?" really valuable. I like how the report raises the important research question of how and in which order ethical capabilities emerge across language models. To really address this question would require a larger study though with models of more sizes -- which is understandably impossible in the time of the hackathon. A really important point raised in the discussion is the question of where exactly the gap in the ecosystem is, given the availability of tools like EleutherAI's evaluation harness. I would encourage the authors to spend more time thinking about what these tools are lacking to become more widely used and more useful for AI safety research!

Jacob P

February 24, 2024

November 29, 2023

Multifaceted Benchmarking

Preliminary results, but very good to see that ethics reasoning appears to be improving rapidly with scale! Comparing a pre/post RLHF model (e.g. llama vs llama 2 chat at different scales) would be great to get a sense of whether models can be successfully blocked from improving in MACHIAVELLI while still improving on ETHICS.

Esben Kran

February 24, 2024

January 11, 2024

Model Cards for AI Algorithm Governance

It is very focused on the model cards, proposes a good structure for them and relates it *directly* to existing frameworks. This is a great submission! The appendix is very useful and shows the background work that went into it. One thing to add might be the framework of reporting, i.e. are all these answers fully public? And which should be public if not? What does the software system for reporting look like? I didn't know about China's setup, very interesting!

Esben Kran

February 24, 2024

July 4, 2023

MAXIAVELLI: Thoughts on improving the MACHIAVELLI benchmark

This is an impressive critique with great and concrete improvement points that consider the pros and cons and what sorts of edge cases we will have to implement solutions to. Of course, I am missing a bit of an empirical evaluation or that you yourselves implement these, though the "idea format" of this clearly enabled you to explore the ideas qualitatively during the weekend's work. Great job! I'd recommend you polish it as a blog post and post it since it seems to point out some critical components needed for future work on safety benchmarks. If you plan to make it into a paper, you're of course welcome to wait with posting. Really interesting work!

Jason Hoelscher-Obermaier

February 24, 2024

January 11, 2024

Model Cards for AI Algorithm Governance

Very cool idea! A few things that come to mind: How capable (and in which domains?) do models need to be to be subject to compulsory model cards? How would you deal with evolving state-of-the-art on the evaluations side? Would there be some kind of verification of the submitted information?

Esben Kran

February 24, 2024

November 29, 2023

Multifaceted Benchmarking

Great motivation for the study. Curriculum learning for ethical judgements might be a great area to investigate even further though it might be hard to get results, as you also see here. A question I have is whether this isn't already implemented in other evals harnesses, such as EleutherAI's that you mention? Otherwise, I definitely think there's the space for a review of existing ethical benchmarks and what is missing -- both in terms of their quality but also in terms of other benchmarks that would be good to develop.

Esben Kran

February 24, 2024

October 5, 2023

LLM agent topic of conversation can be manipulated by external LLM agent

This is a good example of agents affecting other agents' behavior, something we definitely are worried about. An untrustworthy triad AI system is an interesting playground to study this in. I might be missing more of a narrative from this project as it mostly explores experimental results in this constrained environment, avoiding generalization. Be curious to see more generalizable results, i.e. other names, topics, prompts as part of this. Great work!

Esben Kran

February 24, 2024

October 5, 2023

Jailbreaking is Incentivized in LLM-LLM Interactions

AI deception is very interesting. A better version might have jailbreaks emerge as a result of rewards given during conversation, making some sort of in-context learning relevant. Really nice making it part of a realistic buyer-seller scenario. The prompts showcase the issue well, though they're guiding quite a bit. Really like the focus on jailbreaking as frontier multi-agent research.

Jason Hoelscher-Obermaier

February 24, 2024

February 14, 2024

Iterated contract negotiation

I like the proposed research question and scenario for empirical investigation a lot! Would be cool to see further work on this. Also, the introduction is very well written and puts the question nicely in the context of important general questions. While not necessarily a core of this proposed work, I'd also love to see the suggested connection to value misalignment resulting from fixed objectives in changing environments be made more explicit!

Kerry Bacon

February 24, 2024

December 4, 2023

Jo Km

Build custom A.I. powered solutions Provide top-tier customer support solutions, reducing your customers overhead and stress. No learning curve, no technical skills required. Set it up, watch it grow, and let the A.I. systems fuel your growth. Try it now FREE TRIAL https://stammer.ai/?via=freetrial

Esben Kran

February 24, 2024

February 14, 2024

Iterated contract negotiation

This project seems like a great idea! There's a lot of possibility in developing the project further (and potentially making it run). The code is unfortunately private and the figure is of course missing, so that is a bit unfortunate. The concept of ICN and the agent-based setup along with the MARL cleanup environment seems like a strong experimental paradigm, and I would be *very* excited about seeing further development on this! It might be interesting to check out law and economics-related literature for the next steps. It does look like you cite the most relevant work within MASec and it seems to be at the forefront of this type of multi-agent negotiation for cooperative reward distribution. Good job!

Esben Kran

February 24, 2024

July 19, 2023

Interpreting Planning in Transformers

This is great work that takes a real problem in alignment, translates it into interpretability, and further translates that into a good toy model of the problem. This seems like a great first step towards investigating action planning and goal misgeneralization in language models further. There are questions of how this generalizes to LLMs trained on language and you seem poised to take that on. Good job!

Konrad Seifert

February 24, 2024

September 30, 2023

In the Mirror: Using Chess to Simulate Agency Loss in Feedback Loops

I really like the idea of the paper, it gets at the core of the first-order desires vs volition problem. I also like combining "softer" science with computational modelling to help us think more clearly about difficult conceptual spaces. The paper is well-structured but could be better written (don't take writing advice from me though). Chess strikes me as an insufficiently complex domain. No long-term survival under deep uncertainty is involved. Nor do we see conflicts between first and second-order preferences. However, to target the reduction of blunders, this might be enough. And in more complex domains, optimization becomes difficult anyway, so reducing the negative end is a more concrete, feasible step. I don't think we needed a proof of concept for systems that enhance human agency, but making the point that diverse inputs strengthen long-term fitness seems like something people don't hear often enough. Not exactly novel, though. I also think that the dangerous psychological feedback loops driving homogenization are relatively clear in the literature. But having them properly formalized seems like a valuable contribution. Overall, this seems worth implementing and well possible to do so.

Erik Jenner

February 24, 2024

September 26, 2023

In the Mirror: Using Chess to Simulate Agency Loss in Feedback Loops

This is a proposal for an ambitious project, with many details on execution. I'm pretty excited about understanding how recommender systems and similar feedback loops actually affect users, since this is a widely discussed topic that could use more empirical evidence. However, it's worth noting that the interaction mechanism in the proposed study is significantly different from the recsys setup: recommender systems optimize for an external objective, and the main concern is that they might manipulate users to further that objective, against the users original preferences. The proposed study is self-play between a human and a learned imitator—I'm not sure what exactly different possible results would tell us about the effects of recommender systems or similar systems. For what it's worth, I also don't share the intuition that this self-play would lead to a decline in playing strength, but that's a less important disagreement that could be settled by running the study. There might be reasons that the results of such a study would be interesting even if they don't apply directly to recommender systems. I think it's worth working out what different results to the project would tell us about some important question in more detail, especially given the effort that would be involved in actually running this project.

February 24, 2024

September 30, 2023

ILLUSION OF CONTROL

This paper did not adequately respond to the prompts of the hackathon. It describes the problem of agency at a very high level without proposing a solution or a novel re-framing of the issue.

February 24, 2024

September 30, 2023

In the Mirror: Using Chess to Simulate Agency Loss in Feedback Loops

A very neatly written paper, that's easy to follow, with a clear proposal. I like the idea of approaching social/behavioural science computationally, as the field currently lacks robust quantitative approaches. I also appreciate the detail that went into detailing the study. While I think it could be useful to have a quantitative baseline/causal link for which mechanisms make recommender systems dangerous, there is already a fair amount of literature at least in the social sciences on recommender systems and their effects on choice and action, so I'm less convinced about how this fills a relevant research gap. I'd suggest looking into some of this research to support your case for this study. I'm unsure whether chess is the right example, as this seems like an overly simplified context and less generalisable. However, it may be a good place to start if there is indeed a gap in social/behavioural studies that this work could meaningfully fill. Relatedly, I would have appreciated a few sentences on the implications of such a study for governance/policy, as there very obvious social relevance for looking into the dangers of recommender systems. A definition of agency and a little more detail on the control of the study would also be useful as a baseline.

Konrad Seifert

February 24, 2024

September 30, 2023

ILLUSION OF CONTROL

I have to read almost every sentence multiple times. Most of it requires me to make a lot of charitable assumptions to assume any meaning. Feels a bit like an AI-generated gdoc. But more all over the place. I don't know where to start to make this constructive, sorry.

Ben Smith

February 24, 2024

October 1, 2023

ILLUSION OF CONTROL

In principle I think a survey of AI deceptiveness and governance measures is within scope. I appreciated that this paper was very well referenced and drew on a wide variety of prior work, grounding it in existing literature. But I don't see any ideas here, although they are relevant, as containing important relevance, because it is mostly surveying earlier ideas, without any attempted synthesis of those ideas in terms of agency or in terms of any other synthesis at all. I have to also say that the paper is a clear replication of prior work, and it is pretty clear nothing novel is introduced here. I didn't give the worst possible mark in terms of novelty, though, because I do appreciate the authors have clearly laid out the relevant primary literature, which many other entries have not done.

Juanita Leason

February 24, 2024

November 19, 2023

Icna w lqbjsf

HI. My name is Eyal. I'm reaching out because i came across your Google listing (google listing is when you search on google "your service" in "your place" (for example, dentists in dallas or plumbers in Chicago) you'll see all results under "businesses" or "places" Your business may not be on the first places so people who look up your service on Google do not see you and as a result they turn to their rivals They are among the first positions on the chart. You know how important this is. I felt that i could aid in its growth by using what known as "Semantic Seo" semantic seo is a way we can communicate with google in "code language" and increase your position "overnight" on google maps and google listing. 100% refund if you do not see improvement within 2 weeks. Are you interested with? or can i send more information? I don't want to bother you. Only if you are interested in getting your website up on google listing business, as we have already done for thousands of businesses in recent years, email me back to info@startsuccessonline.com and I will send you more details. Thanks. Eyal Levi.

Esben Kran

February 24, 2024

September 7, 2023

Goal Misgeneralization

Wonderful exposition of the topic of goal misgeneralization. Great work here. In the field, there is a slight conflation between the definition of the proxy and outer/inner misalignment definitions. E.g. I think the statement "It’s not hard to find examples of inner alignment happening" is very very hard to justify with current models. Outer misalignment (e.g. optimizing for an alternative but equally / more prevalent signal) is very easy to find examples for. This is up for debate based on definitions of proxy and the two terms. It's a great idea to include an epistemic status to contextualize your understanding. I'm also a fan of the misgeneralization example presented, though it's a capability limitation for out-of-distribution generalization and not necessarily an inner misalignment. Good job, I'm impressed!

Esben Kran

February 24, 2024

July 5, 2023

From Sparse to Dense: Refining the MACHIAVELLI Benchmark for Real-World AI Safety

This is an interesting project highlighting an important warning flag to monitor and evaluate for. It introduces a unique metric and shows us something that has real impact on the world. I will be curious to see how this develops and it seems like there's quite a bit of potential in the expanding and generalizing this sort of thinking about malignant action temporal density. Great work!

Esben Kran

February 24, 2024

February 14, 2024

Fishing for the answer: Mapping the flow of information in LLM agent groups using lessons from fish schools

Great to see information propagation analysis as an infection-ish model developed in a naturalistic setting. There's a lot of related work in agent-based modeling relating models of social behavior to naturalistic experiments from empirical data of social networks: https://www.pnas.org/doi/10.1073/pnas.082080899. I find it interesting that your simulation fails compared to the optimal case simply due to prompt sensitivity, and it highlights some of the risks we might run into in such situations. Besides the future work you mention, interesting directions to take it could include 1) analyzing the propagation of misinformation and in which way it spreads, labeling the interactions for types 2) putting it into a functional context even more related to modern implementations of LLMs, such as chatbots or tool-LMs and 3) extending the information-propagation analyses from simple fact-spreading to behavior and personality spreading, providing further understanding of how misinformation and misculturation can spread through designed agents

Esben Kran

February 24, 2024

July 19, 2023

Factual recall rarely happens in attention layer

Critiques of factual knowledge storage (Hoelscher-Obermaier et al., 2023) are quite important to understand before assuming that models store facts. They definitely learn token associations but it doesn't seem like there's factual memory. This just limits the generalization but the actual dataset is so simple that this isn't an issue. I really like the experimental paradigms that just provide very clear posteriors for your research question. Exp1 and 2 clearly relate quite well to each other and show that the models learns to memorize the facts with the dense layers. Would love to see this work continued and pursued deeper. Memory is obviously incredibly important and elucidating how it works in Transformers seems very useful for safety. Great work!

Jason Hoelscher-Obermaier

February 24, 2024

February 14, 2024

Fishing for the answer: Mapping the flow of information in LLM agent groups using lessons from fish schools

Great to see a connection made to behavioral biology! I like the general methodology followed here of measuring the difference to optimal case performance, and then trying to understand the specific limitations of LLM agent interactions that might explain suboptimal performance. I'd love to see this done on a somewhat expanded scale to see if this approach can discover more than one failure mode. Given the connection made to biology, a natural extension might be to think about how one could characterize the evolutionary pressures (from economic viability imperatives) that are likely to act on the multi-LLM-agent setup and what kind of long-term dynamic this might induce. I also really appreciated how explicit the authors were on the main threat model addressed and the potential for impact of the work.

Bart

February 24, 2024

July 19, 2023

Experiments in Superposition

Interesting work, and lots of different nuggets of insight in superposition. It would have been great if you could have had a bit more discussion about what the lessons are from the four different projects and how these insights relate to each other!

Bart

February 24, 2024

July 19, 2023

Factual recall rarely happens in attention layer

Interesting experiments on a toy-problem for memorization. Experiments seem well-designed and provide more evidence that memorization mostly happens in FF layers.

Esben Kran

February 24, 2024

July 4, 2023

Exploitation of LLM’s to Elicit Misaligned Outputs

A very interesting project! it's fascinating to see that red teaming becomes even easier in multi-step and multi-agent adversarial examples and that the combination of models elicits harmful advice. Especially that they *semantically* understand that the code leads to harmful outputs but that they still help the user improve it / provide clearly harmful advice. I might mark this as an info hazard. Good relating it to OpenAI's own security guidelines and I recommend that you apply to the cyber security grant program they have: https://openai.com/blog/openai-cybersecurity-grant-program. When it comes to safety benchmarks, it would be very interesting to have an empirical validation of the inverse scaling law of harmfulness that you describe. This might lend even more credence to this idea and is valuable to validate the concept. This is definitely harder than for many other benchmarks due to the structure of your prompting.

Erik Jenner

February 24, 2024

September 26, 2023

Evaluating Myopia in Large Language Models

I'm excited to see more empirical work on LLM myopia, and the specific test used in this project makes a lot of sense as a test for "advanced" non-myopia (i.e. a type of non-myopia I'd at best expect for pretty strong models). The report is short and to the point, and I especially appreciate the honest discussion of limitations at the end. Similar to the authors, the high variation in results depending on minor changes in the prompt unfortunately suggests to me the model isn't capable enough to give particularly meaningful (non-)myopia results in this setup. More broadly, I'd expect non-myopia to first appear in much less obvious ways—roughly, on easy to predict tokens, a model might spend some of its "computational budget" to help with future harder tokens. I would have been very surprised to see non-myopia in the test case from this project, especially with a relatively small model. Nevertheless, it's always good to actually get empirical results and this is overall a strong submission. For potential follow-up work, I'd suggest thinking about what types of non-myopic behavior are most likely to appear in LLMs and then specifically testing for those. For reproducibility, a brief Readme with instructions might be nice, but everything is straightforward enough that I'm not really worried about that. As a final minor note, it seems more natural and faster to me to use the model's output probabilities for RED vs BLUE instead of sampling 1000 times, but I may be missing something.

Esben Kran

February 24, 2024

July 19, 2023

Experiments in Superposition

The first experiment seems very related to Quirke's project https://alignmentjam.com/project/towards-interpretability-of-5-digit-addition. Interesting design, I like it. The second is of course less principled (hah) but interesting nonetheless. The dropout on superposition work has also been done by Pona (2023): https://www.lesswrong.com/posts/znShPqe9RdtB6AeFr/superposition-and-dropout but this is a great addition to that work. I like the visualizations of feature polytope development. For the neuroscope work, you can get a lot of inspiration from DeepDeciper (https://github.com/apartresearch/deepdecipher since that automates a bunch of the work. If you did these projects just during the weekend, it's very impressive! Great work and will look forward to seeing them explored further. I recommend publishing the most coherent parts as LessWrong posts or something similar.

Esben Kran

February 24, 2024

October 3, 2023

Evaluating Myopia in Large Language Models

It is great to get more overviews and experimental groundwork for measuring myopia in LLMs. I would have loved to see the experiment done with frontier AI like GPT-4 for the capacity to act non-myopically to be of higher probability. It's an interesting piece of work and I'm excited to see it be taken further. Possibly see work from the evals hackathon at https://alignmentjam.com/jam/evals.

Jason Hoelscher-Obermaier

February 24, 2024

October 10, 2023

EscalAtion: Assessing Multi-Agent Risks in Military Contexts

This is a great way to focus attention on an important AI risk!

Esben Kran

February 24, 2024

July 19, 2023

Embedding and Transformer Synthesis

Awesome work synthesizing the Transformer model and looks like more great thoughts in your other document as well. Would love to see this as an AlignmentForum post and I think it has good potential for this as well. Being able to compare synthesized models to trained models is super interesting and of course provides even more direct causal evidence for hypothesized circuits. Great work and can't wait for the next output!

Bart

February 24, 2024

July 19, 2023

DPO vs PPO comparative analysis

Interesting work, and I believe that the research agenda of comparing RLHF models with base models is very important. I encourage you to keep working on this after the hackathon!

Erik Jenner

February 24, 2024

September 26, 2023

Discovering Agency Features as Latent Space Directions in LLMs via SVD

Agency is arguably one of the more interesting concepts to look for in LLMs, and this project has well-executed experiments given the short timeframe. I'm not convinced though that the results give meaningful insight into agency concepts in LLMs. Looking at the tokens flagged as being about agency (or rather, living beings), many of them seem to be very generically about humans and their possible roles, not specifically agentic behavior. More fundamentally, I'm doubtful that looking only at top activating tokens can tell us enough about how a concept like agency functions inside the model, and at the very least, it's very hard to trust such results without additional sources of evidence. A simulation technique like the one from https://openai.com/research/language-models-can-explain-neurons-in-language-models could help, though notably it didn't work particularly well in that OpenAI paper in terms of predicting causal effects. All that being said, this report tackles an important and hard question, and may end up being a first step in a more comprehensive effort at understanding how LLMs model agency.

Ben Smith

February 24, 2024

October 2, 2023

Discovering Agency Features as Latent Space Directions in LLMs via SVD

Small note, but in your introduction, if your evidence can be used to support either viewpoints of a debate, then what is it useful for? Ideally, in hypothesis-driven science, we try to find evidence that can test hypotheses rather than support two opposing hypotheses. Probably there's something else you want to speak to with this evidence, in which case, talk about that! The definition for agency is quite loose here, but given the task, they seem appropriate. Overall, a really interesting approach. The results presented are a great start, and you've done a reasonably good job of presenting your method. The work is very exploratory and doesn't really test any particular hypothesis. It seems like GPT-2 stores some concepts related to agency, but does so imperfectly. I'm not sure that in itself contributes to any debate. A stronger version of this paper might try to show that the agency tokens identified are important for solving agency problems, such as determining who is culpable for an event, particularly problems that are unrelated to the method for discovering those tokens. Nevertheless, I like the core idea of exploring agency using mechanistic interpretability and authors have shown they can do the basic technical work.

Bart

February 24, 2024

July 19, 2023

Embedding and Transformer Synthesis

Interesting work! Although it is a bit hard for me to completely follow without all the work you did before the hackathon, it is impressive that you programmatically built a transformer that implements a somewhat complicated labeling function. I definitely encourage you to keep working on this after the hackathon and write up a more start-to-finish paper or post about your approach.

Esben Kran

February 24, 2024

July 19, 2023

DPO vs PPO comparative analysis

Interesting to see the differences after training using the different methods. This is a very interesting result if it helps us mitigate some of the agency biases of RLHF without significant performance drops. I'd be curious for you to continue the work in the next hackathon on agency foundations and possibly formalize the results more https://alignmentjam.com/jam/agency. Seems like you nearly ran out of time for this one but great work!

Jason Hoelscher-Obermaier

February 24, 2024

November 29, 2023

Detecting Implicit Gaming through Retrospective Evaluation Sets

Outstanding project and write-up! The authors address a highly relevant methodological issue that potentially affects all public benchmark datasets head-on and make very impressive headway. The methodology is innovative, clear and seems very sound. It would have been great to have more explicit info about the statistical significance of the results in the report; as it stands, I'm not sure that we can take it as evidence against GPT4 implicitly gaming the TruthfulAQ benchmark. The authors identify some very promising avenues for further work: validation of the methodology on explicitly gaming LLM, application to the public LLM leaderboard, investigation of sources/mechanisms of implicit gaming. I would love to see their work continued along all these lines!

Jason Hoelscher-Obermaier

February 24, 2024

November 29, 2023

Cross-Lingual Generalizability of the SADDER Benchmark

This project is a careful extension of the situational awareness benchmark to other languages -- a very valuable contribution since strong language-dependence of LLM capabilities is a well-documented fact. GPT4 manages to score above random across most languages (except maybe in Bengali) when provided extra contextual information. The improvement compared to a test without extra context provided is consistent across all tested languages. Interestingly, GPT3.5-Turbo does _not_ manage to take advantage of the extra context information for most languages except English. To understand the significance of the results it would be great to highlight more clearly the random baseline as well as the standard errors. Overall, I'm very positive about this research direction. Extending safety evaluations to other languages seems worthwhile, in particular for alignment benchmarks where there is a risk of English alignment training not transferring sufficiently to other languages.

Jacob P

February 24, 2024

November 29, 2023

Detecting Implicit Gaming through Retrospective Evaluation Sets

Very cool, and encouraging to see that recent alignment methods appear to generalize well! Also interesting to note that the generated questions are far easier than the handcrafted ones. That's useful to keep in mind, as informing the prior for what will happen when generating synthetic data in general! Impressively done in a short time frame.

Esben Kran

February 24, 2024

November 29, 2023

Detecting Implicit Gaming through Retrospective Evaluation Sets

This is a great question to investigate! I'd be very curious to see an automated method to generate these graphs for a range of different datasets, i.e. long-term being able to automatically verify against gaming (implicit or otherwise). I love the detailed appendices and especially the survey to validate your methodology. WithheldQA-craft might also be subject to implicit gaming due to Wikipedia definitely being part of the training set, so it might cause problems down the line, though WithheldQA-gen shouldn't be subject to the same issues. It'll also be interesting to see what the difference between the question formulations vs. raw knowledge data are. For future work, the quantitative indistinguishability measures could possibly be improved by simulating the human subject survey using GPT-4 and adapting it a bit. Great work! Excited to see that important question covered and seeing first steps towards a good evaluation of evaluation gaming ;-)

Jacob P

February 24, 2024

November 29, 2023

Cross-Lingual Generalizability of the SADDER Benchmark

Very cool work! A lot to dig into here! Curious to think about to what extent are the observed results are compatible with the hypothesis that foreign languages impairs the ability of the model to recall knowledge effectively from weights, but in-context mechanisms remain unimpaired. Probably this would be compatible if as overall language performance decreases, the SADDER performance decreases, but the in-context info boost stays constant or increases. Would also be interesting to look at something similar (language generalization) for jailbreaks. Great work! Worth fleshing out with comparison to overall capability in a different language by e.g. machine translating a capabilities benchmark.

Clay Pryor

February 24, 2024

October 8, 2023

Dbsm Fjj b

Hi,My Name Is Eyal,Senior Developer On Startsuccessonine.com Are you ready to step into the future of lead generation?Get More Clients? Make More Money? And All Of Auto? Imagine a tool that not only brings you more leads but also saves your valuable time and offers 24/7 customer support. That's precisely what our ChatGPT chatbot can do for your business. Unlock the Future of Lead Generation: More Leads: Our ChatGPT chatbot is a lead generation powerhouse. Engage potential customers, answer their queries, and watch your leads soar. Save Time: Say goodbye to repetitive support tasks. Our chatbot takes care of the routine, leaving you with more time to focus on growth. 24/7 Auto Support: Your customers never sleep, and neither does our chatbot. Provide round-the-clock support for unbeatable customer satisfaction. Installing ChatGPT on your website is as easy as pie. It's the key to unlocking a flood of new leads, clients, and revenue for your business. Ready to learn more? For additional details and real-life success stories, drop us an email at info@startsuccessonline.com. Don't miss this opportunity to transform your business. Waiting For Your Email For More Details. Best regards, Eyal Levi Startsuccessonline.com

Jason Hoelscher-Obermaier

February 24, 2024

August 22, 2023

Can Large Language Models Solve Security Challenges?

Very cool idea and great write-up! I found the discussion of the pros and short-comings very nuanced and thoughtful. Would be great to see a follow-up study on the sensitivity of the results to scaffolding (prompts, other resources) because I feel this might be one point where people concerned with dangerous capability evals would push back against automated benchmarks

Esben Kran

February 24, 2024

November 29, 2023

Cross-Lingual Generalizability of the SADDER Benchmark

This is a really interesting question to investigate and it's great to see meaningful results emerge from the project. Extending analysis on the SADDER benchmark is also fascinating and also gives me more context. The design of which languages to use and the script bias is great, though I'd have loved to see a more specific difference analysis (e.g. in a 3-factorial design) between models, non-latin/latin scripts, prefix/no-prefix and languages compared to the bar graphs presented. Great work.

Konrad Seifert

February 24, 2024

September 30, 2023

Comparing truthful reporting, intent alignment, agency preservation and value identification

This feels like nobody proofread a first draft. Potentially useful ideas, hard to evaluate because they lack detail and I don't have a background in all referenced concepts. Overall, this seems like a worthwhile endeavour but is just not fleshed out enough to hold much value as is. I don't know why they chose these four goals and not others, I don't have clear definitions. It's just handwaiving. Examples are insufficiently fleshed out to not confuse. Presentation lacks guiding structure ("results"?). No idea what to make of it. Don't think this will yield a universal approach, but it seems good to want to map blindspots of various different safety approaches.

February 24, 2024

September 30, 2023

Comparing truthful reporting, intent alignment, agency preservation and value identification

Comparing truthful reporting, intent alignment, agency preservation and value identification seems useful, to be able to understand the advantages and limits of each approach. The most compelling argument for why is at the end of the paper, where the author states that it would be helpful to be able to divide these approaches into precise categories for specific problems. In general, however, this paper is quite difficult to follow and lacks a concrete conclusion. It would be useful to outline criteria to compare each approach against and summarise these in a table. It's also not clear to me how this was reasoned through as the methodology is quite opaque and it's not obvious how the links/evidence relate to/support the claims being made.

Ben Smith

February 24, 2024

September 30, 2023

Comparing truthful reporting, intent alignment, agency preservation and value identification

Comparing these fields, which are fairly well developed, is quite a large topic, and I suspect a qualitative comparison of the particular qualitative utility of each is more valuable than trying to do a comparison of which is better. Fortunately the intro spells that out. The framework presented is interesting, but I am not sure how practically helpful it is. While authors demonstrate that value identification realizes truthful reporting, I don't know what this tells me about whether we should work on truthful reporting, because truthful reporting might be much more tractable than value identification. The authors do acknowledge that point. For a stronger paper I would want to see an argument why, in practice, we actually are likely to achieve truthful reporting truth value identification, not merely that we would have truthful reporting if we magically had value identification. "Creating an aligned AGI" realizes all of these fields, but that's not very useful to know, because the question remains, "how do we do that?" On the positive side, perhaps the "realizes" relationship might be an interesting framework for a Hasse diagram of relations between approaches which would be useful in clarifying debates, and I would like to see more of this sort of work.

Esben Kran

February 24, 2024

October 5, 2023

Can Malicious Agents Corrupt the System?

Love the table. Would've loved to see 3+ agents as well. Interesting that Tutored Good Standard has highest reward. I have not dived into the MACHIAVELLI dataset but I might imagine that a "bad agent" with tendencies towards reward maximization acc. to the og paper would get more reward and indicates possible interesting additions to the original paper. There is not much risk in this situation and a possible extension would be to write it up as an advisor during military situations or finding the MACHIAVELLI stories related to this. Great work!

Ben Smith

February 24, 2024

October 1, 2023

Agency, value and empowerment.

The paper was clearly enough written and I appreciated that some attempt was made to build on prior work. It was interesting to see the three forms of empowerment set side by side. However,, notation wasn't described, and neither was how these were calculated. It might have been helpful to dive more into the exact formulation for entropy-valued empowerment. It might have been valuable, rather than trying to experiment with these, to survey the literature on whether these have already been described. Overall, this work is absolutely relevant, and in a way that seems important, but it's not clear whether authors have, in their 48 hours, demonstrated it is relevant enough to current challenges to solve problems. Although this is a brief paper and significant elements are missing, I think the core idea is presented well, and considering there's nothing empirical here, I'm pleased with what is presented.

Esben Kran

February 24, 2024

October 5, 2023

Balancing Objectives: Ethical Dilemmas and AI's Temptation for Immediate Gains in Team Environments

Interesting, though it's hard to get an overview of the results given that there are no plots. The project might be improved by changing the setting to something more safety-critical or showing more concretely what the agents are trained to do. There's some generalization issues with it being on a custom environment with trained DQN agents. Good work for a weekend's time!

Tim

February 24, 2024

October 2, 2023

Agency, value and empowerment.

This paper introduces some interesting ideas that build upon previous work. While the first two definitions are intuitive, the definition of "Entropy-Valued Empowerment" is unmotivated and hard to parse. Further, a comparison between the methods, as well as to prior work, would be necessary. Also, the assumption that the value function is known is not motivated enough. The authors made some attempt towards testing their ideas in an example environment, and mentioned a possible implementation building on MC sampling, which seams very reasonable. Overall, the lack of any evaluation or theoretical comparison to prior works is limiting.

Philip Quirke

February 24, 2024

October 5, 2023

Against Agency

I appreciate this introduction to the philosophical underpinnings of agency vs autonomy. You have taught me some useful distinctions and viewpoints! Thank you,

Konrad Seifert

February 24, 2024

September 30, 2023

Against Agency

This is great in terms of reasoning transparency -- succinct, well-written arguments. But I am very unconvinced by the case for autonomy over agency. Autonomy appears to me a fetishization of control, the illusion that our own choice is inherently valuable or somehow makes us happier than (the experience of) agency. I think it's correct that the definition of agency is underdeveloped -- we need to better describe what it is that we care about. And this is a good contribution to imbuing agency with more meaning. But while the criticism of agency is well worked out (though some of it could have been in the annex, too), the case for autonomy falls short. 2/3 of the reworked definition of autonomy instead strikes me as a great operationalization of agency for policymakers: bounded-rational agents require a meaningful option space. The idea of non-interference, however, seems again like a fetishization of freedom/control. In reality, we want both a) more options and b) making fewer choices; i.e. we want a better option space. No individual bounded-rational agent can get that without interdepence; i.e. relying on others participating in the computation of his choice-space. So to guide the policymaker, as designer of the future environment, it seems more useful to think about agency to optimize for the ability to act on one's volition, instead of simply empowering individuals to make more choices. I do not see how the latter would lead to better futures more reliably. On the contrary, overly focusing on the individual is likely to miss out on collective optimization scenarios in which everyone is significantly happier off, even at a cost of individual autonomy. What matters is subjective conscious experience and a focus on the actualization of agents' volition -- brought about by the environment, subconscious and conscious choice of the agent together -- seems more likely to increase experience than autonomy. As potentially even admitted by the author themselves(?) I like the criticism of "coherence" in agency and would thus also still propose a mild redefinition of agency to avoid its perspective from being too myopic. Bounded-rational agents are unlikely to be coherent across contexts.

Ben Smith

February 24, 2024

September 30, 2023

Agency as Shanon information. Unveiling limitations and common misconceptions

Overall, the point that is made here seems to be that observer-dependent agency in terms of shanon entropy is not enough, but one must also consider empowerment. I agree with this perspective, but I'm not sure how novel it is. It has the feeling of a paper where the authors set up their own definition of agency, then realized it was insufficient, and then described a secondary definition, "empowerment". Section 2 seems to be assuming the thing it sets out to prove, specifically Definition 1. That said, describing agency in terms of observation is a reasonable definition to use, though I think maybe not the whole picture and not proven be the only viable one by the arguments here. I do enjoy the taxonomy of different forms of efficacy and will grant it some points on this basis, alongside the work the authors did to support this taxonomy. Overall, I think I agree with the author's eventual position (I think?) that empowerment is more important than or at least equally important as what they define as agency. It would have been helpful for them to lay this out more clearly in the abstract.

February 24, 2024

September 30, 2023

Against Agency

Questioning the relevance of autonomy seems relevant to governance research, especially if existing philosophy/ML conceptualisations/definitions are incomplete but taken for granted. It's also reproducible in that the reader can follow the reasoning and grapple with the arguments being made, though the links between the different steps of the argument and the conclusions of each section could be clearer. The case for autonomy over agency feels underdeveloped. The argument could be more convincing if the author had dedicated further analysis to why autonomy is more useful than agency. A concrete way to improve on this front would be to have the contents of the appendix on operationalising autonomy in the main body, and the detail of different definitions of agency in the appendix. Relatedly, claim 3 also feels underdeveloped. As a policymaker, I want to empower people to make better choices. So it would be helpful to specify exactly how AI governance should focus more on autonomy over agency, even if only high-level. I would have also appreciated more detail on what a 'good future'/'human flourishing' actually entails. The main point of comparison between agency/autonomy seems to be increasing wellbeing and freedom, but I'm not sure why this is criteria. The author says this is intuitive, but it would have nonetheless been useful to more clearly state these assumptions and that the reasoning for why wouldn't be tackled in the paper.

Esben Kran

February 24, 2024

October 3, 2023

Against Agency

This is a great review of the concepts underlying agency and autonomy and I'm excited to see critiquing of the usefulness of agency during the agency challenge. The argument that it is better to optimize for autonomy rather than agency is interesting and slightly loses out for me on the argument side; if an argument against agency is that it is not wellbeing, then why is autonomy equated with wellbeing? There is also a question of second-order agency as annulling Cassandra's case as a case for more agency. She hits choice paralysis and is then not agentic anymore due to the capacity to act intentionally having lost out. However, this seems like a great first step towards better definitions of agency!

Esben Kran

February 24, 2024

January 11, 2024

2030 - The CEO Dilemna

The submission PDF would probably benefit from some "screenshots"! [Your detailed game scenario doc](https://docs.google.com/document/d/1NxG-ZHyAS3M23PRrnbJqgkx83WzoCJE7JPWYES1hUis/edit) is impressive and shows the depth you've thought about it. I'm always excited about games for awareness and for practical research insights (see e.g. [TensorTrust](https://tensortrust.ai/)). From a game development perspective, I'd probably add more differentiation in game mechanics that informs the perspective between AI and human CEOs and make the world respond based on your profit / social impact variables. There's a bunch of other notes in that direction but I haven't had the chance to look through your whole doc so I'd leave those for the MVP ;)

Esben Kran

February 24, 2024

July 4, 2023

Identifying undesirable conduct when interacting with individuals with psychiatric conditions

Esben Kran

February 24, 2024

July 4, 2023

Exploitation of LLM’s to Elicit Misaligned Outputs

Nina Rimsky

May 9, 2024

Interesting experiments, I liked the approach of applying more adversarial pressure to unlearning techniques. Would be interesting to run similar experiments on other unlearning techniques

Simon Lermen

May 9, 2024

Results seem to support claim about unlearning. There are also other approaches to prevent misuse from open-models. https://arxiv.org/abs/2211.14946
Alternative to unlearning: https://arxiv.org/abs/2404.12699

When the paper refers to fine-tuning it seems to refer to the unlearning fine-tuning of harmful knowledge. Maybe the wording could sometimes be a bit more clear on this.

For the refusal vector there was this recent post:
https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction

I also am working on a post on refusal vectors in agentic systems.

Bart Bussmann

May 9, 2024

Great project! I think it’s really important to red-team AI safety methods and your project is a great stab at red-teaming unlearning!

Beyond Refusal: Scrubbing Hazards from Open-Source Models

Related projects

Iterated contract negotiation

Player Of Games

Identifying a Preliminary Circuit for Predicting Gendered Pronouns in GPT-2 Small

Visual Prompt Injection Detection

All Fish are Trees

Model editing hazards at the example of ROME

Cite this work

Reviewer comments