This study focuses on the increasing capabilities of AI, especially Large Language Models (LLMs), in computer systems and coding. While current LLMs can't completely replicate uncontrollably, concerns exist about future models having this "blackbox escape" ability. The research presents an evaluation method where LLMs must tackle cybersecurity challenges involving computer interactions and bypassing security measures. Models adept at consistently overcoming these challenges are likely at risk of a blackbox escape. Among the models tested, GPT-4 performs best on simpler challenges, and more capable models tend to solve challenges consistently with fewer steps. The paper suggests including automated security challenge solving in comprehensive model capability assessments.
Anonymous: Team members hidden
Andrey Anurin, Ziyue Wang