This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
ApartSprints
Interpretability
Accepted at the 
Interpretability
 research sprint on 
July 17, 2023

Toward a Working Deep Dream for LLM's

This project aims to enhance language model interpretability by generating sentences that maximally activate a specific neuron, inspired by the DeepDream technique in image models. We introduce a novel regularization technique that optimizes over a lower-dimensional latent space rather than the full 768-dimensional embedding space, resulting in more coherent and interpretable sentences. Our approach uses an autoencoder and a separate GPT-2 model as an encoder, and a six-layer transformer as a decoder. Despite the current limitation of our autoencoder not fully reconstructing sentences, our work opens up new directions for future research in improving language model interpretability.

By 
Scott Viteri and Peter Chatain
🏆 
4th place
3rd place
2nd place
1st place
 by peer review