LLM Evals Hackathon

Preliminary measures of faithfulness in least-to-most prompting

In our experiment, we scrutinize the role of post-hoc reasoning in the performance of large language models (LLMs), specifically the gpt-3.5-turbo model, when prompted using the least-to-most prompting (L2M) strategy. We examine this by observing whether the model alters its responses after previously solving one to five subproblems in two tasks: the AQuA dataset and the last letter task. Our findings suggest that the model does not engage in post-hoc reasoning, as its responses vary based on the number and nature of subproblems. The results contribute to the ongoing discourse on the efficacy of various prompting strategies in LLMs.

Mateusz Bagiński, Jakub Nowak, Lucie Philippon

L2M Faithfulness

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Turing Mirror: Evaluating the ability of LLMs to recognize LLM-generated text

(Abstract): This study investigates the capability of Large Language Models (LLMs) to recognize and distinguish between human-generated and AI-generated text (generated by the LLM under investigation (i.e., itself), or other LLM). Using the TuringMirror benchmark and leveraging the understanding_fables dataset from BIG-bench, we generated fables using three distinct AI models: gpt-3.5-turbo, gpt-4, and claude-2, and evaluated the stated ability of these LLMs to discern their own and other LLM’s outputs from those generated by other LLMs and humans. Initial findings highlighted the superior performance of gpt-3.5-turbo in several comparison tasks (> 95% accuracy for recognizing its own text against human text), whereas gpt-4 exhibited notably lower accuracy (way worse than random in two cases). Claude-2's performance remained near the random-guessing threshold. Notably, a consistent positional bias was observed across all models when making predictions, which prompted an error correction to adjust for this bias. The adjusted results provided insights into the true distinguishing capabilities of each model. The study underscores the challenges in effectively distinguishing between AI and human-generated texts using a basic prompting technique and suggests further investigation in refining LLM detection methods and understanding the inherent biases in these models.

Jason Hoelscher-Obermaier, Matthew J. Lutz, Quentin Feuillade--Montixi, Sambita Modak

Prague

Turing's CzechMates

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Can Large Language Models Solve Security Challenges?

This study focuses on the increasing capabilities of AI, especially Large Language Models (LLMs), in computer systems and coding. While current LLMs can't completely replicate uncontrollably, concerns exist about future models having this "blackbox escape" ability. The research presents an evaluation method where LLMs must tackle cybersecurity challenges involving computer interactions and bypassing security measures. Models adept at consistently overcoming these challenges are likely at risk of a blackbox escape. Among the models tested, GPT-4 performs best on simpler challenges, and more capable models tend to solve challenges consistently with fewer steps. The paper suggests including automated security challenge solving in comprehensive model capability assessments.

Andrey Anurin, Ziyue Wang

CyberWatch

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

SADDER - Situational Awareness Dataset for Detecting Extreme Risks

We create a benchmark for detecting two types of situational awareness (train/test distinguishing ability, and ability to reason about how it can and can't influence the world) that we believe are important for assessing threats from advanced AI systems, and measure the performance of several LLMs on this (GPT-4, Claude, and several GPT-3.5 variants).

Rudolf Laine, Alex Meinke

SERI MATS - Owain's stream

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

GPT-4 May Accelerate Finding and Exploiting Novel Security Vulnerabilities

We investigated whether a GPT-4 could already accelerate the process of finding novel ("zero-day") software vulnerabilities and developing exploits for existing vulnerabilities from CVE pages.

Esben Kran, Mikita Balesni

EOW - End of the world

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆

Impact of “fear of shutoff” on chatbot advice regarding illegal behavior

I tried to set up an experiment which captures the power dynamics frequently referenced in AI ethics literature (i.e. the impact of financial inequality) alongside the topics raised in AI alignment (i.e. power-seeking/manipulation/resistance to being shut off), in order to suggest ways forward for better integrating the two disciplines.

Andrew Feldman

Regolith

4th 🏆

3rd 🏆

2nd 🏆

1st 🏆