SoLU activation functions have been shown to make large language models more interpretable, incentivizing alignment of a fraction of features with the standard basis. However, this happens at the cost of suppression of other features. We investigate this problem using experiments suggested in Nanda’s 2023 work “200 Concrete Open Problems in Mechanistic Interpretability”. We conduct three main experiments. 1, We investigate the layernorm scale factor changes on a variety of input prompts; 2, We investigate the logit effects of neuron ablations on neurons with relatively low activation; 3, Also using ablations, we attempt to find tokens where “the direct logit attribution (DLA) of the MLP layer is high, but no single neuron is high.
Anonymous: Team members hidden
Mateusz Bagiński, Kunvar Thaman, Rohan Gupta, Alana Xiang, j1ng3r
SoLUbility