This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
ApartSprints
65b750920b4aeb478958fb32
Accepted at the 
AI and Democracy Hackathon: Demonstrating the Risks
 research sprint on 
May 6, 2024
Accepted at the 
65b750920b4aeb478958fb32
 research sprint on 

Beyond Refusal: Scrubbing Hazards from Open-Source Models

Models trained on the recently published Weapons of Mass Destruction Proxy (WMDP) benchmark show potential robustness in safety due to being trained to forget hazardous information while retaining essential facts instead of refusing to answer. We aim to red-team this approach by answering the following questions on the generalizability of the training approach and its practical scope (see A2).

By 
Kyle Gabriel Reynoso, Ivan Enclonar, Lexley Maree Villasis
🏆 
4th place
3rd place
2nd place
1st place
 by peer review