This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
ApartSprints
Interpretability
Accepted at the 
Interpretability
 research sprint on 
July 17, 2023

Preliminary Steps Toward Investigating the “Smearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model

SoLU activation functions have been shown to make large language models more interpretable, incentivizing alignment of a fraction of features with the standard basis. However, this happens at the cost of suppression of other features. We investigate this problem using experiments suggested in Nanda’s 2023 work “200 Concrete Open Problems in Mechanistic Interpretability”. We conduct three main experiments. 1, We investigate the layernorm scale factor changes on a variety of input prompts; 2, We investigate the logit effects of neuron ablations on neurons with relatively low activation; 3, Also using ablations, we attempt to find tokens where “the direct logit attribution (DLA) of the MLP layer is high, but no single neuron is high.

By 
Mateusz Bagiński, Kunvar Thaman, Rohan Gupta, Alana Xiang, j1ng3r
🏆 
4th place
3rd place
2nd place
1st place
 by peer review