(Abstract): This study investigates the capability of Large Language Models (LLMs) to recognize and distinguish between human-generated and AI-generated text (generated by the LLM under investigation (i.e., itself), or other LLM). Using the TuringMirror benchmark and leveraging the understanding_fables dataset from BIG-bench, we generated fables using three distinct AI models: gpt-3.5-turbo, gpt-4, and claude-2, and evaluated the stated ability of these LLMs to discern their own and other LLM’s outputs from those generated by other LLMs and humans. Initial findings highlighted the superior performance of gpt-3.5-turbo in several comparison tasks (> 95% accuracy for recognizing its own text against human text), whereas gpt-4 exhibited notably lower accuracy (way worse than random in two cases). Claude-2's performance remained near the random-guessing threshold. Notably, a consistent positional bias was observed across all models when making predictions, which prompted an error correction to adjust for this bias. The adjusted results provided insights into the true distinguishing capabilities of each model. The study underscores the challenges in effectively distinguishing between AI and human-generated texts using a basic prompting technique and suggests further investigation in refining LLM detection methods and understanding the inherent biases in these models.
Anonymous: Team members hidden
Jason Hoelscher-Obermaier, Matthew J. Lutz, Quentin Feuillade--Montixi, Sambita Modak
Turing's CzechMates