This report investigates the automated identification of neurons which potentially correspond to a feature in a language model, using an initial dataset of maximum activation texts and word embeddings. This method could speed up the rate of interpretability research by flagging high potential feature neurons, and building on existing infrastructure such as Neuroscope. We show that this method is feasible for quantifying the level of semantic relatedness between maximum activating tokens on an existing dataset, performing basic interpretability analysis by comparing activations on synonyms, and generating prompt guidance for further avenues of human investigation. We also show that this method is generalisable across multiple language models and suggest areas of further exploration based on results.
Anonymous: Team members hidden
Michelle Wai Man Lo
Michelle Wai Man Lo
This feels like nobody proofread a first draft. Potentially useful ideas, hard to evaluate because they lack detail and I don't have a background in all referenced concepts. Overall, this seems like a worthwhile endeavour but is just not fleshed out enough to hold much value as is. I don't know why they chose these four goals and not others, I don't have clear definitions. It's just handwaiving. Examples are insufficiently fleshed out to not confuse. Presentation lacks guiding structure ("results"?). No idea what to make of it. Don't think this will yield a universal approach, but it seems good to want to map blindspots of various different safety approaches.
Very interesting project and a superb introduction with a good formalism. It was interesting to see that they were still *slightly robust* to a few of the prompt injections, though it obviously shifts the evaluation drastically anyways. I was maybe missing a mechanistic model of how we might expect language models to be able to do this sort of malicious prompt injection on the evaluations, e.g. reading from the test prompt to automatically generate an adversarial injection. Very nice generalizability using multiple datasets as well and an interesting Deception Eval dataset introduced. I recommend you continue the work and commit to a deeper exploration of this problem!
Other interesting references might be Hoelscher-Obermaier (2023, https://arxiv.org/pdf/2305.17553.pdf) regarding reliability testing of benchmarks. You might also be interested in factored cognition and process supervision if you're not aware of these budding research fields (Ought.org has published a bit of great work on this). A relevant project here is Muhia (2023, https://alignmentjam.com/project/towards-formally-describing-program-traces-from-chains-of-language-model-calls-with-causal-influence-diagrams-a-sketch).
Love the visualization of the results 😂 It would be interesting to generate conversations (if there isn't a dataset available) that slightly red team to find the long-tail examples of scizophrenic therapeutic scenarios. I was missing a visualization of the benchmarking between different models, e.g. Ada, LLaMA, GPT-Neo-X, or other such models. You mention that the companies have an incentive to keep the models safe due to commercial reasons and it seems like many of the risks come from non-proprietary models that might be used in many other contexts. Cool project, though! I recommend continuing your work here.
Measuring manipulation in conversations with chatbots seems valuable for any type of future safety. If we can train them to avoid manipulative behavior, this will probably lead to a significantly safer LLM from a phishing and contractor manipulation perspective (https://evals.alignment.org/). A few next steps might be to test the API on GPT-3, GPT-3.5, and the different GPT-4 versions released since the original ChatGPT models were released. In this way, we can evaluate how manipulative they are. The main problem will be to create the dataset, though you will probably be able to do that automatically using GPT-4, albeit at a relatively high price.