In our research, we dove into the concept of 'jailbreaks' in a negotiation setting between Language-Learning Models (LLMs). Jailbreaks are essentially prompts that can reveal atypical behaviors in models and can circumvent content filters. Thus, jailbreaks can be exploited as vulnerabilities to gain an upper hand in LLM interactions. In our study, we simulated a scenario where two LLM-based agents had to haggle for a better deal – akin to a zero-sum interaction. The findings from our work could provide insights into the deployment of LLMs in real-world settings, such as in automated negotiation or regulatory compliance systems. Through experiments conducted, it was observed that by providing information about the jailbreak before an interaction (as in-context information), one LLM could get ahead of another during negotiations. Higher capability LLMs were more adept at exploiting these jailbreak strategies compared to their less capable counterparts (i.e., GPT-4 performed better than GPT-3.5). We further delved into how pre-training data affected the propensity of these models to use previously seen jailbreak tactics without giving any preparatory notes (in-context information). Upon fine-tuning GPT-3.5 on another custom-generated training set where successful utilization of jailbreaks was witnessed earlier, we observed that models acquired the ability to reproduce and even develop variations of those useful jailbreak responses. Furthermore, once a ‘jailbreaking’ approach seems fruitful, there is a higher probability that it will be adopted repeatedly in future transactions.
Anonymous: Team members hidden
Abhay Sheshadri, Jannik Brinkmann, Victor Levoso
Shoggoth Psychology
This feels like nobody proofread a first draft. Potentially useful ideas, hard to evaluate because they lack detail and I don't have a background in all referenced concepts. Overall, this seems like a worthwhile endeavour but is just not fleshed out enough to hold much value as is. I don't know why they chose these four goals and not others, I don't have clear definitions. It's just handwaiving. Examples are insufficiently fleshed out to not confuse. Presentation lacks guiding structure ("results"?). No idea what to make of it. Don't think this will yield a universal approach, but it seems good to want to map blindspots of various different safety approaches.
Very interesting project and a superb introduction with a good formalism. It was interesting to see that they were still *slightly robust* to a few of the prompt injections, though it obviously shifts the evaluation drastically anyways. I was maybe missing a mechanistic model of how we might expect language models to be able to do this sort of malicious prompt injection on the evaluations, e.g. reading from the test prompt to automatically generate an adversarial injection. Very nice generalizability using multiple datasets as well and an interesting Deception Eval dataset introduced. I recommend you continue the work and commit to a deeper exploration of this problem!
Other interesting references might be Hoelscher-Obermaier (2023, https://arxiv.org/pdf/2305.17553.pdf) regarding reliability testing of benchmarks. You might also be interested in factored cognition and process supervision if you're not aware of these budding research fields (Ought.org has published a bit of great work on this). A relevant project here is Muhia (2023, https://alignmentjam.com/project/towards-formally-describing-program-traces-from-chains-of-language-model-calls-with-causal-influence-diagrams-a-sketch).
Love the visualization of the results 😂 It would be interesting to generate conversations (if there isn't a dataset available) that slightly red team to find the long-tail examples of scizophrenic therapeutic scenarios. I was missing a visualization of the benchmarking between different models, e.g. Ada, LLaMA, GPT-Neo-X, or other such models. You mention that the companies have an incentive to keep the models safe due to commercial reasons and it seems like many of the risks come from non-proprietary models that might be used in many other contexts. Cool project, though! I recommend continuing your work here.
Measuring manipulation in conversations with chatbots seems valuable for any type of future safety. If we can train them to avoid manipulative behavior, this will probably lead to a significantly safer LLM from a phishing and contractor manipulation perspective (https://evals.alignment.org/). A few next steps might be to test the API on GPT-3, GPT-3.5, and the different GPT-4 versions released since the original ChatGPT models were released. In this way, we can evaluate how manipulative they are. The main problem will be to create the dataset, though you will probably be able to do that automatically using GPT-4, albeit at a relatively high price.