This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
ApartSprints
Safety Benchmarks
Accepted at the 
Safety Benchmarks
 research sprint on 
July 2, 2023

Exploitation of LLM’s to Elicit Misaligned Outputs

1. This paper primarily focuses on an automated approach a bad actor might pursue to exploit LLM via intelligent prompt engineering combined with the use of dual agents to produce harmful code and improve it 2. We also use step by step questioning instead of a single prompt to make sure the LLMs give harmful outputs instead of refusing the output 3. We also see that Gpt4 which is more resilient to harmful inputs and outputs as per empirical evidence and existing literature can produce more harmful outputs. We call this as Inverse Scaling Harm

By 
Desik Mandava, Jayanth Santosh, Aishwarya Gurung
🏆 
4th place
3rd place
2nd place
1st place
 by peer review