1. This paper primarily focuses on an automated approach a bad actor might pursue to exploit LLM via intelligent prompt engineering combined with the use of dual agents to produce harmful code and improve it 2. We also use step by step questioning instead of a single prompt to make sure the LLMs give harmful outputs instead of refusing the output 3. We also see that Gpt4 which is more resilient to harmful inputs and outputs as per empirical evidence and existing literature can produce more harmful outputs. We call this as Inverse Scaling Harm
Anonymous: Team members hidden
Desik Mandava, Jayanth Santosh, Aishwarya Gurung
DeJaWa