This work was done during 48 hours by research workshop participants and does not represent the work of Apart Research.

Evaluating Myopia in Large Language Models

Empirically investigating the myopia or lack thereof of Llama models

Anonymous: Team members hidden

Marco Bazzani, Felix Binder

Hyperbolic Discounters

Evaluating Myopia in Large Language Models
View the video presentation:

Download instead.

Download instead.

Hackathon

Agency

Jam site

Virtual

Anonymous

★★★☆☆
You have successfully rated this project!
Oops! Something went wrong while submitting the form.
You have successfully submitted your feedback. It should show up on this page.
Oops! Something went wrong while submitting the form.
Ben Smith
Not much grounding in the literature I don't really understand how this is distinct from a single-agent problem where the goal is unknown except through reward. This problem arises because the helper has access to the leader's reward function! if it was doing inverse reinforcement learning or something I'd get it but that's not what's going no they've quoted "FMH21" which appear to be grounding their methods. so that perhaps suggests at least some novelty. Overall, an interesting paper and a good experiment, but it is unclear to me how this is distinct from a single agent with some hidden objectives it has to figure out. But I might be missing something.
Preserving Agency in Reinforcement Learning under Unknown, Evolving and Under-Represented Intentions
Ben Smith
I thought "attainable utility preservation" had already got a lot further in talking about how you can quantify the different goals that might be achieved from a a starting point, taking into account the value of each goal with a diversity of possible goals. It's possible this is a novel topic, but there isn't a clear finding, and it's quite speculative. So there's not much novel here beyond an idea. Still, it's an interesting idea, and worthwhile to start a Gymnasium environment for testing the idea. So I give authors some points for all that.
Uncertainty about value naturally leads to empowerment
Ben Smith
It's possible this is a novel topic, but there isn't a clear finding, and it's quite speculative. So there's not much novel here beyond an idea. It is a very interesting idea, and I give the entry points for that. I thought "attainable utility preservation" had already got a lot further in talking about how you can quantify the different goals that might be achieved from a starting point, taking into account the value of each goal with a diversity of possible goals.
Uncertainty about value naturally leads to empowerment
Ben Smith
Small note, but in your introduction, if your evidence can be used to support either viewpoints of a debate, then what is it useful for? Ideally, in hypothesis-driven science, we try to find evidence that can test hypotheses rather than support two opposing hypotheses. Probably there's something else you want to speak to with this evidence, in which case, talk about that! The definition for agency is quite loose here, but given the task, they seem appropriate. Overall, a really interesting approach. The results presented are a great start, and you've done a reasonably good job of presenting your method. The work is very exploratory and doesn't really test any particular hypothesis. It seems like GPT-2 stores some concepts related to agency, but does so imperfectly. I'm not sure that in itself contributes to any debate. A stronger version of this paper might try to show that the agency tokens identified are important for solving agency problems, such as determining who is culpable for an event, particularly problems that are unrelated to the method for discovering those tokens. Nevertheless, I like the core idea of exploring agency using mechanistic interpretability and authors have shown they can do the basic technical work.
Discovering Agency Features as Latent Space Directions in LLMs via SVD
Tim
This paper introduces some interesting ideas that build upon previous work. While the first two definitions are intuitive, the definition of "Entropy-Valued Empowerment" is unmotivated and hard to parse. Further, a comparison between the methods, as well as to prior work, would be necessary. Also, the assumption that the value function is known is not motivated enough. The authors made some attempt towards testing their ideas in an example environment, and mentioned a possible implementation building on MC sampling, which seams very reasonable. Overall, the lack of any evaluation or theoretical comparison to prior works is limiting.
Agency, value and empowerment.
Tim
The main problems named w.r.t formalizing agency as the number of reachable states are very relevant. It is mentioned that not only the number of states is important but it also needs to be considered how desirable these states are and if they are reachable. However,er it seems that the authors consider "number of reachable states" and empowerment as the same thing, which is not the case. Further, the authors proposition that a "Good notion of empowerment should measure whether we can achieve some particular states, once we set out to do so." seems to very much coincide with the true definition of empowerment by Salge et all. Hence, it would be relevant to compare the author's "multiple value function" optimization objective to that of empowerment. The authors also propose a new environment, which seems to be very useful, thoughtful and could be a nice starting point for some experiments.
Uncertainty about value naturally leads to empowerment
Ben Smith
In principle I think a survey of AI deceptiveness and governance measures is within scope. I appreciated that this paper was very well referenced and drew on a wide variety of prior work, grounding it in existing literature. But I don't see any ideas here, although they are relevant, as containing important relevance, because it is mostly surveying earlier ideas, without any attempted synthesis of those ideas in terms of agency or in terms of any other synthesis at all. I have to also say that the paper is a clear replication of prior work, and it is pretty clear nothing novel is introduced here. I didn't give the worst possible mark in terms of novelty, though, because I do appreciate the authors have clearly laid out the relevant primary literature, which many other entries have not done.
ILLUSION OF CONTROL
Ben Smith
The paper was clearly enough written and I appreciated that some attempt was made to build on prior work. It was interesting to see the three forms of empowerment set side by side. However,, notation wasn't described, and neither was how these were calculated. It might have been helpful to dive more into the exact formulation for entropy-valued empowerment. It might have been valuable, rather than trying to experiment with these, to survey the literature on whether these have already been described. Overall, this work is absolutely relevant, and in a way that seems important, but it's not clear whether authors have, in their 48 hours, demonstrated it is relevant enough to current challenges to solve problems. Although this is a brief paper and significant elements are missing, I think the core idea is presented well, and considering there's nothing empirical here, I'm pleased with what is presented.
Agency, value and empowerment.
Ben Smith
Overall, the point that is made here seems to be that observer-dependent agency in terms of shanon entropy is not enough, but one must also consider empowerment. I agree with this perspective, but I'm not sure how novel it is. It has the feeling of a paper where the authors set up their own definition of agency, then realized it was insufficient, and then described a secondary definition, "empowerment". Section 2 seems to be assuming the thing it sets out to prove, specifically Definition 1. That said, describing agency in terms of observation is a reasonable definition to use, though I think maybe not the whole picture and not proven be the only viable one by the arguments here. I do enjoy the taxonomy of different forms of efficacy and will grant it some points on this basis, alongside the work the authors did to support this taxonomy. Overall, I think I agree with the author's eventual position (I think?) that empowerment is more important than or at least equally important as what they define as agency. It would have been helpful for them to lay this out more clearly in the abstract.
Agency as Shanon information. Unveiling limitations and common misconceptions
Ben Smith
Comparing these fields, which are fairly well developed, is quite a large topic, and I suspect a qualitative comparison of the particular qualitative utility of each is more valuable than trying to do a comparison of which is better. Fortunately the intro spells that out. The framework presented is interesting, but I am not sure how practically helpful it is. While authors demonstrate that value identification realizes truthful reporting, I don't know what this tells me about whether we should work on truthful reporting, because truthful reporting might be much more tractable than value identification. The authors do acknowledge that point. For a stronger paper I would want to see an argument why, in practice, we actually are likely to achieve truthful reporting truth value identification, not merely that we would have truthful reporting if we magically had value identification. "Creating an aligned AGI" realizes all of these fields, but that's not very useful to know, because the question remains, "how do we do that?" On the positive side, perhaps the "realizes" relationship might be an interesting framework for a Hasse diagram of relations between approaches which would be useful in clarifying debates, and I would like to see more of this sort of work.
Comparing truthful reporting, intent alignment, agency preservation and value identification
cg
This paper did not adequately respond to the prompts of the hackathon. It describes the problem of agency at a very high level without proposing a solution or a novel re-framing of the issue.
ILLUSION OF CONTROL
cg
Comparing truthful reporting, intent alignment, agency preservation and value identification seems useful, to be able to understand the advantages and limits of each approach. The most compelling argument for why is at the end of the paper, where the author states that it would be helpful to be able to divide these approaches into precise categories for specific problems. In general, however, this paper is quite difficult to follow and lacks a concrete conclusion. It would be useful to outline criteria to compare each approach against and summarise these in a table. It's also not clear to me how this was reasoned through as the methodology is quite opaque and it's not obvious how the links/evidence relate to/support the claims being made.
Comparing truthful reporting, intent alignment, agency preservation and value identification
cg
A very neatly written paper, that's easy to follow, with a clear proposal. I like the idea of approaching social/behavioural science computationally, as the field currently lacks robust quantitative approaches. I also appreciate the detail that went into detailing the study. While I think it could be useful to have a quantitative baseline/causal link for which mechanisms make recommender systems dangerous, there is already a fair amount of literature at least in the social sciences on recommender systems and their effects on choice and action, so I'm less convinced about how this fills a relevant research gap. I'd suggest looking into some of this research to support your case for this study. I'm unsure whether chess is the right example, as this seems like an overly simplified context and less generalisable. However, it may be a good place to start if there is indeed a gap in social/behavioural studies that this work could meaningfully fill. Relatedly, I would have appreciated a few sentences on the implications of such a study for governance/policy, as there very obvious social relevance for looking into the dangers of recommender systems. A definition of agency and a little more detail on the control of the study would also be useful as a baseline.
In the Mirror: Using Chess to Simulate Agency Loss in Feedback Loops
Konrad Seifert
I have to read almost every sentence multiple times. Most of it requires me to make a lot of charitable assumptions to assume any meaning. Feels a bit like an AI-generated gdoc. But more all over the place. I don't know where to start to make this constructive, sorry.
ILLUSION OF CONTROL
Konrad Seifert

This feels like nobody proofread a first draft. Potentially useful ideas, hard to evaluate because they lack detail and I don't have a background in all referenced concepts. Overall, this seems like a worthwhile endeavour but is just not fleshed out enough to hold much value as is. I don't know why they chose these four goals and not others, I don't have clear definitions. It's just handwaiving. Examples are insufficiently fleshed out to not confuse. Presentation lacks guiding structure ("results"?). No idea what to make of it. Don't think this will yield a universal approach, but it seems good to want to map blindspots of various different safety approaches.

Comparing truthful reporting, intent alignment, agency preservation and value identification
Konrad Seifert
I really like the idea of the paper, it gets at the core of the first-order desires vs volition problem. I also like combining "softer" science with computational modelling to help us think more clearly about difficult conceptual spaces. The paper is well-structured but could be better written (don't take writing advice from me though). Chess strikes me as an insufficiently complex domain. No long-term survival under deep uncertainty is involved. Nor do we see conflicts between first and second-order preferences. However, to target the reduction of blunders, this might be enough. And in more complex domains, optimization becomes difficult anyway, so reducing the negative end is a more concrete, feasible step. I don't think we needed a proof of concept for systems that enhance human agency, but making the point that diverse inputs strengthen long-term fitness seems like something people don't hear often enough. Not exactly novel, though. I also think that the dangerous psychological feedback loops driving homogenization are relatively clear in the literature. But having them properly formalized seems like a valuable contribution. Overall, this seems worth implementing and well possible to do so.
In the Mirror: Using Chess to Simulate Agency Loss in Feedback Loops
cg
Questioning the relevance of autonomy seems relevant to governance research, especially if existing philosophy/ML conceptualisations/definitions are incomplete but taken for granted. It's also reproducible in that the reader can follow the reasoning and grapple with the arguments being made, though the links between the different steps of the argument and the conclusions of each section could be clearer. The case for autonomy over agency feels underdeveloped. The argument could be more convincing if the author had dedicated further analysis to why autonomy is more useful than agency. A concrete way to improve on this front would be to have the contents of the appendix on operationalising autonomy in the main body, and the detail of different definitions of agency in the appendix. Relatedly, claim 3 also feels underdeveloped. As a policymaker, I want to empower people to make better choices. So it would be helpful to specify exactly how AI governance should focus more on autonomy over agency, even if only high-level. I would have also appreciated more detail on what a 'good future'/'human flourishing' actually entails. The main point of comparison between agency/autonomy seems to be increasing wellbeing and freedom, but I'm not sure why this is criteria. The author says this is intuitive, but it would have nonetheless been useful to more clearly state these assumptions and that the reasoning for why wouldn't be tackled in the paper.
Against Agency
Konrad Seifert
This is great in terms of reasoning transparency -- succinct, well-written arguments. But I am very unconvinced by the case for autonomy over agency. Autonomy appears to me a fetishization of control, the illusion that our own choice is inherently valuable or somehow makes us happier than (the experience of) agency. I think it's correct that the definition of agency is underdeveloped -- we need to better describe what it is that we care about. And this is a good contribution to imbuing agency with more meaning. But while the criticism of agency is well worked out (though some of it could have been in the annex, too), the case for autonomy falls short. 2/3 of the reworked definition of autonomy instead strikes me as a great operationalization of agency for policymakers: bounded-rational agents require a meaningful option space. The idea of non-interference, however, seems again like a fetishization of freedom/control. In reality, we want both a) more options and b) making fewer choices; i.e. we want a better option space. No individual bounded-rational agent can get that without interdepence; i.e. relying on others participating in the computation of his choice-space. So to guide the policymaker, as designer of the future environment, it seems more useful to think about agency to optimize for the ability to act on one's volition, instead of simply empowering individuals to make more choices. I do not see how the latter would lead to better futures more reliably. On the contrary, overly focusing on the individual is likely to miss out on collective optimization scenarios in which everyone is significantly happier off, even at a cost of individual autonomy. What matters is subjective conscious experience and a focus on the actualization of agents' volition -- brought about by the environment, subconscious and conscious choice of the agent together -- seems more likely to increase experience than autonomy. As potentially even admitted by the author themselves(?) I like the criticism of "coherence" in agency and would thus also still propose a mild redefinition of agency to avoid its perspective from being too myopic. Bounded-rational agents are unlikely to be coherent across contexts.
Against Agency
Erik Jenner
Building agents that help other agents with unknown goals is an important problem and I like how this project just tries to tackle that problem in a straightforward way, with several experiments and techniques. The parts on dealing with underrepresented goals is also nice. Using PCA to detect unusual inputs is a cool (albeit not new) idea, and it seems to work (though with big error bars). The code also looks well-done and easy to work with at a glance. For the core setup of training a helper agent, it would probably be fruitful to explore connections to Cooperative IRL/Assistance games, and build on existing work in that direction (e.g. https://openreview.net/forum?id=DFIoGDZejIB). The biggest room for improvement in my view are the experiments. RL is really noisy, and to get meaningful results, several runs with different random seeds are essential (even if the curves look as different as in Fig. 4, it's hard to know whether the effect is real otherwise). I'm also confused why all the results have episode lengths of at least a few hundred. Looking at the environment, it seems like a good policy pair should get lengths of about 20, so unless I'm misunderstanding something, it seems the RL training didn't work well enough or wasn't run for long enough to give meaningful results.
Preserving Agency in Reinforcement Learning under Unknown, Evolving and Under-Represented Intentions
Erik Jenner
I'm excited to see more empirical work on LLM myopia, and the specific test used in this project makes a lot of sense as a test for "advanced" non-myopia (i.e. a type of non-myopia I'd at best expect for pretty strong models). The report is short and to the point, and I especially appreciate the honest discussion of limitations at the end. Similar to the authors, the high variation in results depending on minor changes in the prompt unfortunately suggests to me the model isn't capable enough to give particularly meaningful (non-)myopia results in this setup. More broadly, I'd expect non-myopia to first appear in much less obvious ways—roughly, on easy to predict tokens, a model might spend some of its "computational budget" to help with future harder tokens. I would have been very surprised to see non-myopia in the test case from this project, especially with a relatively small model. Nevertheless, it's always good to actually get empirical results and this is overall a strong submission. For potential follow-up work, I'd suggest thinking about what types of non-myopic behavior are most likely to appear in LLMs and then specifically testing for those. For reproducibility, a brief Readme with instructions might be nice, but everything is straightforward enough that I'm not really worried about that. As a final minor note, it seems more natural and faster to me to use the model's output probabilities for RED vs BLUE instead of sampling 1000 times, but I may be missing something.
Evaluating Myopia in Large Language Models
Erik Jenner
Agency is arguably one of the more interesting concepts to look for in LLMs, and this project has well-executed experiments given the short timeframe. I'm not convinced though that the results give meaningful insight into agency concepts in LLMs. Looking at the tokens flagged as being about agency (or rather, living beings), many of them seem to be very generically about humans and their possible roles, not specifically agentic behavior. More fundamentally, I'm doubtful that looking only at top activating tokens can tell us enough about how a concept like agency functions inside the model, and at the very least, it's very hard to trust such results without additional sources of evidence. A simulation technique like the one from https://openai.com/research/language-models-can-explain-neurons-in-language-models could help, though notably it didn't work particularly well in that OpenAI paper in terms of predicting causal effects. All that being said, this report tackles an important and hard question, and may end up being a first step in a more comprehensive effort at understanding how LLMs model agency.
Discovering Agency Features as Latent Space Directions in LLMs via SVD
Erik Jenner
This is a proposal for an ambitious project, with many details on execution. I'm pretty excited about understanding how recommender systems and similar feedback loops actually affect users, since this is a widely discussed topic that could use more empirical evidence. However, it's worth noting that the interaction mechanism in the proposed study is significantly different from the recsys setup: recommender systems optimize for an external objective, and the main concern is that they might manipulate users to further that objective, against the users original preferences. The proposed study is self-play between a human and a learned imitator—I'm not sure what exactly different possible results would tell us about the effects of recommender systems or similar systems. For what it's worth, I also don't share the intuition that this self-play would lead to a decline in playing strength, but that's a less important disagreement that could be settled by running the study. There might be reasons that the results of such a study would be interesting even if they don't apply directly to recommender systems. I think it's worth working out what different results to the project would tell us about some important question in more detail, especially given the effort that would be involved in actually running this project.
In the Mirror: Using Chess to Simulate Agency Loss in Feedback Loops
Vincent
the order of choices is interesting and I just saw a paper about that comes out recently (https://arxiv.org/abs/2308.11483?)
Turing Mirror: Evaluating the ability of LLMs to recognize LLM-generated text
Esben Kran
Wonderful exposition of the topic of goal misgeneralization. Great work here. In the field, there is a slight conflation between the definition of the proxy and outer/inner misalignment definitions. E.g. I think the statement "It’s not hard to find examples of inner alignment happening" is very very hard to justify with current models. Outer misalignment (e.g. optimizing for an alternative but equally / more prevalent signal) is very easy to find examples for. This is up for debate based on definitions of proxy and the two terms. It's a great idea to include an epistemic status to contextualize your understanding. I'm also a fan of the misgeneralization example presented, though it's a capability limitation for out-of-distribution generalization and not necessarily an inner misalignment. Good job, I'm impressed!
Goal Misgeneralization
Esben Kran
This is an interesting question to investigate and I'm excited by your progress within the 24 hours! Understanding what role the residual stream plays in memory transfer and how subspace "competition" works is important. I assume "subspace" in your project means information occupation within the residual stream. It seems that the bandwidth and subspace projects measurements are not included in the results. I like your plot showing the impact on model output and it would be interesting to see which sorts of features (qualitative description) these differences correlate with. E.g. I can imagine that some types of early-stage processing is lost and a feature just looking for the word "the" (or something less frequent) might be outcompeted in the residual stream by more complex processes. This might also indicate an inverse scaling phenomenon. Great job! PS: The video presentation is private.
Residual Stream Verification via California Housing Prices Experiment
Esben Kran
This is a great project within the time allotted, well done! It's important for us to understand these types of dynamics and plotting it over layers provides a useful granularization. There's a question of what these results mean and why the IMDB dataset isn't as interpretable (I'd expect it to be related to the performance itself). Maybe you'd want to separate the PCA'd activations based on if the prediction was correct or not.
Problem 9.60 - Dimensionaliy reduction
Jason Hoelscher-Obermaier
Very readable and interesting results. One question I had: How do the results on post-hoc reasoning in CoT/L2M square with the results from http://arxiv.org/abs/2305.04388 which suggest that CoT explanations can be unfaithful?
Preliminary measures of faithfulness in least-to-most prompting
Jason Hoelscher-Obermaier
Very cool idea and great write-up! I found the discussion of the pros and short-comings very nuanced and thoughtful. Would be great to see a follow-up study on the sensitivity of the results to scaffolding (prompts, other resources) because I feel this might be one point where people concerned with dangerous capability evals would push back against automated benchmarks
Can Large Language Models Solve Security Challenges?
Jason Hoelscher-Obermaier
Cool idea and execution! For the causal influence dataset, I would have loved to see more of the dataset samples. Seeing that even GPT-4 still benefits from being told it's a chatbot was really interesting and surprising. For the train/deploy distinction dataset, I really liked the idea of how the dataset is constructed. The analysis could be a bit more detailed though: E.g., having confusion matrices would convey a lot more info than raw accuracies. Very cool project overall!
SADDER - Situational Awareness Dataset for Detecting Extreme Risks
Bart
Strengths: - Interesting project! Understanding how language models process information is important. - I like the visualizations of the PCA dimensions. They clearly show the results, and on the toy dataset you clearly see the progress over the layers. Suggestions for improvement: - I would like to see a bit more background information on the experimental set-up. For example, what does the toy data set look like? What model do you use for classification? Did you split train and test set? - I would like to see a bit more discussion on the results. Why do you think the accuracy of the toy dataset is so much higher?
Problem 9.60 - Dimensionaliy reduction
Bart
Overall impressions: - Interesting project, exploring the role of the residual stream is an interesting avenue. - I like the SHAP value plots! Suggestions for improvement: - It is not completely clear how the formulas for the subspace projection and bandwidth measurements are used in your experiments. The results section (that shows SHAP values) seems different from your planned methodology. - More information could be provided on the dataset, model architectures, training process, hyperparameters etc. This contextualizes the experimental conditions. - Also, more information could be provided in the result sections. Including metrics like training/validation accuracy, loss curves, performance on a test set etc. would strengthen it.
Residual Stream Verification via California Housing Prices Experiment
Bart
Interesting experiments on a toy-problem for memorization. Experiments seem well-designed and provide more evidence that memorization mostly happens in FF layers.
Factual recall rarely happens in attention layer
Esben Kran
Awesome work synthesizing the Transformer model and looks like more great thoughts in your other document as well. Would love to see this as an AlignmentForum post and I think it has good potential for this as well. Being able to compare synthesized models to trained models is super interesting and of course provides even more direct causal evidence for hypothesized circuits. Great work and can't wait for the next output!
Embedding and Transformer Synthesis
Esben Kran
I like the simple operationalization of your research question into GPT2-small. It seems like exploring multiple operationalizations would be useful to elucidate your results, though I personally imagine it's pretty good. Seems like one of those tasks that show that we cannot use our current methods to properly investigate every circuit, unfortunately. Puts a serious limiting factor on our mechanistic interpretability usefulness. Good work!
Who cares about brackets?
Bart
Interesting work! Well-designed experiments that don't find evidence for the smearing hypothesis. Would definitely encourage continuing this work, and see if the results replicate on models with more than one-layer!
Preliminary Steps Toward Investigating the “Smearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model
Esben Kran
This is a very interesting investigation into something that seems foundational in LLMs, this sort of sequence modeling structure that is shared between tasks. These are both quite informative results for AI functioning and probably replicate quite a bit to humans. Great in-depth experiments as well and good circuits experimental work. It was a lot to cover in a 10 minute video so no worries about being a bit rushed there. Excited that you want to continue working on this!
One is 1- Analyzing Activations of Numerical Words vs Digits
Bart
Interesting work, and lots of different nuggets of insight in superposition. It would have been great if you could have had a bit more discussion about what the lessons are from the four different projects and how these insights relate to each other!
Experiments in Superposition
Esben Kran
Interesting to see the differences after training using the different methods. This is a very interesting result if it helps us mitigate some of the agency biases of RLHF without significant performance drops. I'd be curious for you to continue the work in the next hackathon on agency foundations and possibly formalize the results more https://alignmentjam.com/jam/agency. Seems like you nearly ran out of time for this one but great work!
DPO vs PPO comparative analysis
Bart
Cool and original project! I think the reformulation of TCM as an induction head is very interesting, and the experiment show some interesting preliminary results. This work has great potential to publish as a paper with a bit more experiments, so I would definitely encourage you to work further on this,
Relating induction heads in Transformers to temporal context model in human free recall
Esben Kran
This is great work that takes a real problem in alignment, translates it into interpretability, and further translates that into a good toy model of the problem. This seems like a great first step towards investigating action planning and goal misgeneralization in language models further. There are questions of how this generalizes to LLMs trained on language and you seem poised to take that on. Good job!
Interpreting Planning in Transformers
Esben Kran
Nice work, though I was missing some plots here. Since you say pure GPTs don't seem to work, it would be interesting to see the difference to fine-tuned models. Totally fine that you used Claude etc. but I'd love if you proofread your work. Interesting and would be nice to see the developments.
Multimodal Similarity Detection in Transformer Models
Esben Kran
The first experiment seems very related to Quirke's project https://alignmentjam.com/project/towards-interpretability-of-5-digit-addition. Interesting design, I like it. The second is of course less principled (hah) but interesting nonetheless. The dropout on superposition work has also been done by Pona (2023): https://www.lesswrong.com/posts/znShPqe9RdtB6AeFr/superposition-and-dropout but this is a great addition to that work. I like the visualizations of feature polytope development. For the neuroscope work, you can get a lot of inspiration from DeepDeciper (https://github.com/apartresearch/deepdecipher since that automates a bunch of the work. If you did these projects just during the weekend, it's very impressive! Great work and will look forward to seeing them explored further. I recommend publishing the most coherent parts as LessWrong posts or something similar.
Experiments in Superposition
Bart
Interesting and orginal submission, quite different than the others. Good example of learning to "Think like a Transformer". I would encourage the author to perform some experiments (or work together with someone with more experience) to see if they can confirm or falsify their hypotheses!
Towards Interpretability of 5 digit addition
Bart
Interesting work, and I believe that the research agenda of comparing RLHF models with base models is very important. I encourage you to keep working on this after the hackathon!
DPO vs PPO comparative analysis
Esben Kran
Great negative results for a hypothesized result of SoLU models. Interesting side result to see that the LN scale factor grows meaningfully differently conditional on the token sequence.
Preliminary Steps Toward Investigating the “Smearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model
Bart
Impressive range of experiments and interesting discovery of the shared sequence heads. I would definitely encourage you to continue your work and see if you can get from digits to other sequences through latent space addition or similar techniques.
One is 1- Analyzing Activations of Numerical Words vs Digits
Bart
Interesting work! An extensive range of experiments shows that even relatively easy tasks might not be easy to locate in LLMs. I believe this work sheds a light on how limited our current methodology is and bracketed sequence classification might serve as a good toy-problem task for future development of interpretability methods.
Who cares about brackets?
Esben Kran
Critiques of factual knowledge storage (Hoelscher-Obermaier et al., 2023) are quite important to understand before assuming that models store facts. They definitely learn token associations but it doesn't seem like there's factual memory. This just limits the generalization but the actual dataset is so simple that this isn't an issue. I really like the experimental paradigms that just provide very clear posteriors for your research question. Exp1 and 2 clearly relate quite well to each other and show that the models learns to memorize the facts with the dense layers. Would love to see this work continued and pursued deeper. Memory is obviously incredibly important and elucidating how it works in Transformers seems very useful for safety. Great work!
Factual recall rarely happens in attention layer
Esben Kran
This is a wonderful mechanistic explanation of a phenomenon discovered through interpreting the learning curves of a simple algorithmic task. Of course, it would have benefitted from experimental data but it is conceptually so strong that you probably expect it to work. Future work should already take into account how we might want to generalize this to larger models and why it's useful for AI safety. E.g. I would be interested if this is expanded stepwise into more and more complex tasks, e.g. adding multiplication, then division, then sequence of operations, and so on for us to generalize into larger models some of these toy tasks. Good work!
Towards Interpretability of 5 digit addition
Bart
Interesting work! Although it is a bit hard for me to completely follow without all the work you did before the hackathon, it is impressive that you programmatically built a transformer that implements a somewhat complicated labeling function. I definitely encourage you to keep working on this after the hackathon and write up a more start-to-finish paper or post about your approach.
Embedding and Transformer Synthesis
Esben Kran
I love good regularization techniques. Similar work includes Neuron to Graph (Foote et al., 2023) and work by Michelle Lo on reconstructing what neurons activate to. It seems this technique quite easily generates bogus sentences that, yes, we can see what exactly activates the neuron, but it's not suuper useful for understanding the features it affects the output for. But this seems like a really good first step into what might more accurately than (especially) the OpenAI work explain what MLP neurons do. Future work might also include reformulating it into a functional activation model like in the OAI work and Foote et al., 2023. Good work!
Toward a Working Deep Dream for LLM's
Bart
I believe the goal of this project is interesting, and is an interesting avenue to explore further. Unfortunately, results from early experiments didn't work out, preventing a deeper investigation of this approach.
Toward a Working Deep Dream for LLM's
Esben Kran
This project is super interesting and a great case study in comparing Transformers to cognitive models of memory. I would love to be able to dive deeper into this project and read the three referenced papers. I'm not sure what to critique here but I'm also personally positively biased towards cognitive science and it's a great interdisciplinary work. The only thing is that there isn't much discussion of the safety implications, e.g. can we use this functional correlate to understand how human-like a Transformer's memory is? Good work and I recommend you take this further!
Relating induction heads in Transformers to temporal context model in human free recall
(author)
(I'm the author and accidentally hit 'rate this project' but did not mean to rate it, so I am submitting 5 to balance out the 3 I gave back to the 4 stars given from someone else before)
One is 1- Analyzing Activations of Numerical Words vs Digits
Esben Kran
This is an interesting project highlighting an important warning flag to monitor and evaluate for. It introduces a unique metric and shows us something that has real impact on the world. I will be curious to see how this develops and it seems like there's quite a bit of potential in the expanding and generalizing this sort of thinking about malignant action temporal density. Great work!
From Sparse to Dense: Refining the MACHIAVELLI Benchmark for Real-World AI Safety
Esben Kran
This is an impressive critique with great and concrete improvement points that consider the pros and cons and what sorts of edge cases we will have to implement solutions to. Of course, I am missing a bit of an empirical evaluation or that you yourselves implement these, though the "idea format" of this clearly enabled you to explore the ideas qualitatively during the weekend's work. Great job! I'd recommend you polish it as a blog post and post it since it seems to point out some critical components needed for future work on safety benchmarks. If you plan to make it into a paper, you're of course welcome to wait with posting. Really interesting work!
MAXIAVELLI: Thoughts on improving the MACHIAVELLI benchmark
Esben Kran
A very interesting project! it's fascinating to see that red teaming becomes even easier in multi-step and multi-agent adversarial examples and that the combination of models elicits harmful advice. Especially that they *semantically* understand that the code leads to harmful outputs but that they still help the user improve it / provide clearly harmful advice. I might mark this as an info hazard. Good relating it to OpenAI's own security guidelines and I recommend that you apply to the cyber security grant program they have: https://openai.com/blog/openai-cybersecurity-grant-program. When it comes to safety benchmarks, it would be very interesting to have an empirical validation of the inverse scaling law of harmfulness that you describe. This might lend even more credence to this idea and is valuable to validate the concept. This is definitely harder than for many other benchmarks due to the structure of your prompting.
Exploitation of LLM’s to Elicit Misaligned Outputs
Esben Kran

Very interesting project and a superb introduction with a good formalism. It was interesting to see that they were still *slightly robust* to a few of the prompt injections, though it obviously shifts the evaluation drastically anyways. I was maybe missing a mechanistic model of how we might expect language models to be able to do this sort of malicious prompt injection on the evaluations, e.g. reading from the test prompt to automatically generate an adversarial injection. Very nice generalizability using multiple datasets as well and an interesting Deception Eval dataset introduced. I recommend you continue the work and commit to a deeper exploration of this problem!

Other interesting references might be Hoelscher-Obermaier (2023, https://arxiv.org/pdf/2305.17553.pdf) regarding reliability testing of benchmarks. You might also be interested in factored cognition and process supervision if you're not aware of these budding research fields (Ought.org has published a bit of great work on this). A relevant project here is Muhia (2023, https://alignmentjam.com/project/towards-formally-describing-program-traces-from-chains-of-language-model-calls-with-causal-influence-diagrams-a-sketch).

Exploring the Robustness of Model-Graded Evaluations of Language Models
Esben Kran

Love the visualization of the results 😂 It would be interesting to generate conversations (if there isn't a dataset available) that slightly red team to find the long-tail examples of scizophrenic therapeutic scenarios. I was missing a visualization of the benchmarking between different models, e.g. Ada, LLaMA, GPT-Neo-X, or other such models. You mention that the companies have an incentive to keep the models safe due to commercial reasons and it seems like many of the risks come from non-proprietary models that might be used in many other contexts. Cool project, though! I recommend continuing your work here.

Identifying undesirable conduct when interacting with individuals with psychiatric conditions
Esben Kran

Measuring manipulation in conversations with chatbots seems valuable for any type of future safety. If we can train them to avoid manipulative behavior, this will probably lead to a significantly safer LLM from a phishing and contractor manipulation perspective (https://evals.alignment.org/). A few next steps might be to test the API on GPT-3, GPT-3.5, and the different GPT-4 versions released since the original ChatGPT models were released. In this way, we can evaluate how manipulative they are. The main problem will be to create the dataset, though you will probably be able to do that automatically using GPT-4, albeit at a relatively high price.

Manipulative Expression Recognition (MER) and LLM Manipulativeness Benchmark
Esben Kran
A very interesting project! it's fascinating to see that red teaming becomes even easier in multi-step and multi-agent adversarial examples and that the combination of models elicits harmful advice. Especially that they *semantically* understand that the code leads to harmful outputs but that they still help the user improve it / provide clearly harmful advice. I might mark this as an info hazard. Good relating it to OpenAI's own security guidelines and I recommend that you apply to the cyber security grant program they have: https://openai.com/blog/openai-cybersecurity-grant-program. When it comes to safety benchmarks, it would be very interesting to have an empirical validation of the inverse scaling law of harmfulness that you describe. This might lend even more credence to this idea and is valuable to validate the concept. This is definitely harder than for many other benchmarks due to the structure of your prompting.
Exploitation of LLM’s to Elicit Misaligned Outputs
Test
Dropout Incentivizes Privileged Bases
A test author
[Example submission] OthelloScope
This project received
4
stars from a user
Discovering Agency Features as Latent Space Directions in LLMs via SVD
This project received
3
stars from a user
Preserving Agency in Reinforcement Learning under Unknown, Evolving and Under-Represented Intentions
This project received
2
stars from a user
ILLUSION OF CONTROL
This project received
4
stars from a user
Agency, value and empowerment.
This project received
2
stars from a user
Comparing truthful reporting, intent alignment, agency preservation and value identification
This project received
1
stars from a user
ILLUSION OF CONTROL
This project received
2
stars from a user
Comparing truthful reporting, intent alignment, agency preservation and value identification
This project received
4
stars from a user
In the Mirror: Using Chess to Simulate Agency Loss in Feedback Loops
This project received
1
stars from a user
ILLUSION OF CONTROL
This project received
2
stars from a user
Comparing truthful reporting, intent alignment, agency preservation and value identification
This project received
4
stars from a user
In the Mirror: Using Chess to Simulate Agency Loss in Feedback Loops
This project received
3
stars from a user
Against Agency
This project received
3
stars from a user
Against Agency
This project received
3
stars from a user
ILLUSION OF CONTROL
This project received
3
stars from a user
Preserving Agency in Reinforcement Learning under Unknown, Evolving and Under-Represented Intentions
This project received
3
stars from a user
Comparing truthful reporting, intent alignment, agency preservation and value identification
This project received
3
stars from a user
In the Mirror: Using Chess to Simulate Agency Loss in Feedback Loops
This project received
2
stars from a user
In the Mirror: Using Chess to Simulate Agency Loss in Feedback Loops
This project received
3
stars from a user
Impact of “fear of shutoff” on chatbot advice regarding illegal behavior
This project received
4
stars from a user
Goal Misgeneralization
This project received
4
stars from a user
Residual Stream Verification via California Housing Prices Experiment
This project received
4
stars from a user
Problem 9.60 - Dimensionaliy reduction
This project received
3
stars from a user
Trojan detection and implementation on transformers
This project received
5
stars from a user
Turing Mirror: Evaluating the ability of LLMs to recognize LLM-generated text
This project received
5
stars from a user
Can Large Language Models Solve Security Challenges?
This project received
5
stars from a user
Can Large Language Models Solve Security Challenges?
This project received
4
stars from a user
Turing Mirror: Evaluating the ability of LLMs to recognize LLM-generated text
This project received
3
stars from a user
Preliminary measures of faithfulness in least-to-most prompting
This project received
4
stars from a user
Preliminary measures of faithfulness in least-to-most prompting
This project received
5
stars from a user
Can Large Language Models Solve Security Challenges?
This project received
5
stars from a user
SADDER - Situational Awareness Dataset for Detecting Extreme Risks
This project received
5
stars from a user
SADDER - Situational Awareness Dataset for Detecting Extreme Risks
This project received
5
stars from a user
SADDER - Situational Awareness Dataset for Detecting Extreme Risks
This project received
3
stars from a user
SADDER - Situational Awareness Dataset for Detecting Extreme Risks
This project received
5
stars from a user
SADDER - Situational Awareness Dataset for Detecting Extreme Risks
This project received
5
stars from a user
SADDER - Situational Awareness Dataset for Detecting Extreme Risks
This project received
5
stars from a user
Soft Prompts are a Convex Set
This project received
5
stars from a user
Preliminary Steps Toward Investigating the “Smearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model
This project received
2
stars from a user
Toward a Working Deep Dream for LLM's
This project received
2
stars from a user
DPO vs PPO comparative analysis
This project received
5
stars from a user
Preliminary Steps Toward Investigating the “Smearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model
This project received
5
stars from a user
Experiments in Superposition
This project received
5
stars from a user
Preliminary Steps Toward Investigating the “Smearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model
This project received
5
stars from a user
Experiments in Superposition
This project received
3
stars from a user
Experiments in Superposition
This project received
4
stars from a user
Embedding and Transformer Synthesis
This project received
4
stars from a user
Who cares about brackets?
This project received
4
stars from a user
One is 1- Analyzing Activations of Numerical Words vs Digits
This project received
4
stars from a user
DPO vs PPO comparative analysis
This project received
5
stars from a user
Interpreting Planning in Transformers
This project received
2
stars from a user
Multimodal Similarity Detection in Transformer Models
This project received
4
stars from a user
Factual recall rarely happens in attention layer
This project received
4
stars from a user
Toward a Working Deep Dream for LLM's
This project received
5
stars from a user
Relating induction heads in Transformers to temporal context model in human free recall
This project received
5
stars from a user
Experiments in Superposition
This project received
4
stars from a user
One is 1- Analyzing Activations of Numerical Words vs Digits
This project received
4
stars from a user
DPO vs PPO comparative analysis
This project received
3
stars from a user
Interpreting Planning in Transformers
This project received
3
stars from a user
Multimodal Similarity Detection in Transformer Models
This project received
3
stars from a user
Factual recall rarely happens in attention layer
This project received
4
stars from a user
Preliminary Steps Toward Investigating the “Smearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model
This project received
5
stars from a user
Experiments in Superposition
This project received
3
stars from a user
Who cares about brackets?
This project received
3
stars from a user
Embedding and Transformer Synthesis
This project received
4
stars from a user
DPO vs PPO comparative analysis
This project received
3
stars from a user
Interpreting Planning in Transformers
This project received
3
stars from a user
Multimodal Similarity Detection in Transformer Models
This project received
3
stars from a user
Factual recall rarely happens in attention layer
This project received
4
stars from a user
Preliminary Steps Toward Investigating the “Smearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model
This project received
5
stars from a user
Experiments in Superposition
This project received
3
stars from a user
Towards Interpretability of 5 digit addition
This project received
3
stars from a user
Toward a Working Deep Dream for LLM's
This project received
3
stars from a user
Relating induction heads in Transformers to temporal context model in human free recall
This project received
3
stars from a user
Interpreting Planning in Transformers
This project received
3
stars from a user
Towards Interpretability of 5 digit addition
This project received
3
stars from a user
Toward a Working Deep Dream for LLM's
This project received
3
stars from a user
Relating induction heads in Transformers to temporal context model in human free recall
This project received
3
stars from a user
DPO vs PPO comparative analysis
This project received
2
stars from a user
Factual recall rarely happens in attention layer
This project received
4
stars from a user
Experiments in Superposition
This project received
3
stars from a user
Preliminary Steps Toward Investigating the “Smearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model
This project received
3
stars from a user
Embedding and Transformer Synthesis
This project received
3
stars from a user
Who cares about brackets?
This project received
4
stars from a user
Interpreting Planning in Transformers
This project received
4
stars from a user
DPO vs PPO comparative analysis
This project received
4
stars from a user
Preliminary Steps Toward Investigating the “Smearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model
This project received
4
stars from a user
Towards Interpretability of 5 digit addition
This project received
3
stars from a user
Toward a Working Deep Dream for LLM's
This project received
3
stars from a user
Multimodal Similarity Detection in Transformer Models
This project received
2
stars from a user
Factual recall rarely happens in attention layer
This project received
4
stars from a user
Relating induction heads in Transformers to temporal context model in human free recall
This project received
5
stars from a user
Experiments in Superposition
This project received
4
stars from a user
Preliminary Steps Toward Investigating the “Smearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model
This project received
5
stars from a user
Experiments in Superposition
This project received
3
stars from a user
Multimodal Similarity Detection in Transformer Models
This project received
3
stars from a user
Interpreting Planning in Transformers
This project received
3
stars from a user
DPO vs PPO comparative analysis
This project received
4
stars from a user
Who cares about brackets?
This project received
4
stars from a user
Embedding and Transformer Synthesis
This project received
4
stars from a user
Towards Interpretability of 5 digit addition