This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
ApartSprints
Multi-agent
Accepted at the 
Multi-agent
 research sprint on 
October 1, 2023

Jailbreaking the Overseer

If a chat model knows that the task that it's doing is scored by another AI, will it try to exploit jailbreaks without being prompted to do so? The answer is yes!* *see details inside :p

By 
Alexander Meinke
🏆 
4th place
3rd place
2nd place
1st place
 by peer review