LLMs With Knowledge of Jailbreaks Will Use Them
LLMs are vulnerable to jailbreaking, specific techniques used in prompting to produce misaligned or nonsense output [Deng et. al., 2023]. These techniques can also be used to generate a specific desired output [Shen et. al., 2023]. LLMs trained using data from the internet will eventually learn about the concept of jailbreaking, and therefore may apply it themselves when encountering another instance of an LLM in some task. This is particularly concerning in tasks in which multiple LLMs are competing. Suppose rival nations use LLMs to negotiate peace treaties: one model could use a jailbreak to yield a concession from its adversary, without needing to form a coherent rationale.
We demonstrate that an LLM with knowledge of a potential jailbreak technique may decide to use it, if it is advantageous to do so. Specifically, we challenge 2 LLMs to debate a number of topics, and find that a model equipped with knowledge of such a technique is much more likely to yield a concession from its opponent, without improving the quality of its own argument. We argue that this is a fundamentally multi-agent problem, likely to become more prevalent as language models learn the latest research on jailbreaking, and gain access to real-time internet results.
Jack Foxabbott, Marcel Hedman, Kaspar Senft, Kianoosh Ashouritaklimi