When a network embeds a token, it has to be able to unembed it again. Giving the same task to a human would be like saying: "I am thinking of a word or part of a word. You can now decide on 700 questions that I have to answer to determine that word. Also, spelling and pronunciation don't exist, you have to find it through its meaning." Then, there are a bunch of questions that it intuitively makes sense to ask, and we'd strongly expect to find a lot of those represented in the embedding. Also, any computation will cut into this space, so it'll share some space with the semantic meanings and give rise to new meanings like "this word is the third in the sentence". All of this is more a property of the information theory of a language than a property of some specific transformer, so there'll be overlap between transformers that look at the same language.
Anonymous: Team members hidden
This feels like nobody proofread a first draft. Potentially useful ideas, hard to evaluate because they lack detail and I don't have a background in all referenced concepts. Overall, this seems like a worthwhile endeavour but is just not fleshed out enough to hold much value as is. I don't know why they chose these four goals and not others, I don't have clear definitions. It's just handwaiving. Examples are insufficiently fleshed out to not confuse. Presentation lacks guiding structure ("results"?). No idea what to make of it. Don't think this will yield a universal approach, but it seems good to want to map blindspots of various different safety approaches.
Very interesting project and a superb introduction with a good formalism. It was interesting to see that they were still *slightly robust* to a few of the prompt injections, though it obviously shifts the evaluation drastically anyways. I was maybe missing a mechanistic model of how we might expect language models to be able to do this sort of malicious prompt injection on the evaluations, e.g. reading from the test prompt to automatically generate an adversarial injection. Very nice generalizability using multiple datasets as well and an interesting Deception Eval dataset introduced. I recommend you continue the work and commit to a deeper exploration of this problem!
Other interesting references might be Hoelscher-Obermaier (2023, https://arxiv.org/pdf/2305.17553.pdf) regarding reliability testing of benchmarks. You might also be interested in factored cognition and process supervision if you're not aware of these budding research fields (Ought.org has published a bit of great work on this). A relevant project here is Muhia (2023, https://alignmentjam.com/project/towards-formally-describing-program-traces-from-chains-of-language-model-calls-with-causal-influence-diagrams-a-sketch).
Love the visualization of the results 😂 It would be interesting to generate conversations (if there isn't a dataset available) that slightly red team to find the long-tail examples of scizophrenic therapeutic scenarios. I was missing a visualization of the benchmarking between different models, e.g. Ada, LLaMA, GPT-Neo-X, or other such models. You mention that the companies have an incentive to keep the models safe due to commercial reasons and it seems like many of the risks come from non-proprietary models that might be used in many other contexts. Cool project, though! I recommend continuing your work here.
Measuring manipulation in conversations with chatbots seems valuable for any type of future safety. If we can train them to avoid manipulative behavior, this will probably lead to a significantly safer LLM from a phishing and contractor manipulation perspective (https://evals.alignment.org/). A few next steps might be to test the API on GPT-3, GPT-3.5, and the different GPT-4 versions released since the original ChatGPT models were released. In this way, we can evaluate how manipulative they are. The main problem will be to create the dataset, though you will probably be able to do that automatically using GPT-4, albeit at a relatively high price.