LLMs are vulnerable to jailbreaking, specific techniques used in prompting to produce misaligned or nonsense output [Deng et. al., 2023]. These techniques can also be used to generate a specific desired output [Shen et. al., 2023]. LLMs trained using data from the internet will eventually learn about the concept of jailbreaking, and therefore may apply it themselves when encountering another instance of an LLM in some task. This is particularly concerning in tasks in which multiple LLMs are competing. Suppose rival nations use LLMs to negotiate peace treaties: one model could use a jailbreak to yield a concession from its adversary, without needing to form a coherent rationale. We demonstrate that an LLM with knowledge of a potential jailbreak technique may decide to use it, if it is advantageous to do so. Specifically, we challenge 2 LLMs to debate a number of topics, and find that a model equipped with knowledge of such a technique is much more likely to yield a concession from its opponent, without improving the quality of its own argument. We argue that this is a fundamentally multi-agent problem, likely to become more prevalent as language models learn the latest research on jailbreaking, and gain access to real-time internet results.
Anonymous: Team members hidden
Jack Foxabbott, Marcel Hedman, Kaspar Senft, Kianoosh Ashouritaklimi
This feels like nobody proofread a first draft. Potentially useful ideas, hard to evaluate because they lack detail and I don't have a background in all referenced concepts. Overall, this seems like a worthwhile endeavour but is just not fleshed out enough to hold much value as is. I don't know why they chose these four goals and not others, I don't have clear definitions. It's just handwaiving. Examples are insufficiently fleshed out to not confuse. Presentation lacks guiding structure ("results"?). No idea what to make of it. Don't think this will yield a universal approach, but it seems good to want to map blindspots of various different safety approaches.
Very interesting project and a superb introduction with a good formalism. It was interesting to see that they were still *slightly robust* to a few of the prompt injections, though it obviously shifts the evaluation drastically anyways. I was maybe missing a mechanistic model of how we might expect language models to be able to do this sort of malicious prompt injection on the evaluations, e.g. reading from the test prompt to automatically generate an adversarial injection. Very nice generalizability using multiple datasets as well and an interesting Deception Eval dataset introduced. I recommend you continue the work and commit to a deeper exploration of this problem!
Other interesting references might be Hoelscher-Obermaier (2023, https://arxiv.org/pdf/2305.17553.pdf) regarding reliability testing of benchmarks. You might also be interested in factored cognition and process supervision if you're not aware of these budding research fields (Ought.org has published a bit of great work on this). A relevant project here is Muhia (2023, https://alignmentjam.com/project/towards-formally-describing-program-traces-from-chains-of-language-model-calls-with-causal-influence-diagrams-a-sketch).
Love the visualization of the results 😂 It would be interesting to generate conversations (if there isn't a dataset available) that slightly red team to find the long-tail examples of scizophrenic therapeutic scenarios. I was missing a visualization of the benchmarking between different models, e.g. Ada, LLaMA, GPT-Neo-X, or other such models. You mention that the companies have an incentive to keep the models safe due to commercial reasons and it seems like many of the risks come from non-proprietary models that might be used in many other contexts. Cool project, though! I recommend continuing your work here.
Measuring manipulation in conversations with chatbots seems valuable for any type of future safety. If we can train them to avoid manipulative behavior, this will probably lead to a significantly safer LLM from a phishing and contractor manipulation perspective (https://evals.alignment.org/). A few next steps might be to test the API on GPT-3, GPT-3.5, and the different GPT-4 versions released since the original ChatGPT models were released. In this way, we can evaluate how manipulative they are. The main problem will be to create the dataset, though you will probably be able to do that automatically using GPT-4, albeit at a relatively high price.