This work was done during one weekend by research workshop participants and does not represent the work of Apart Research.
ApartSprints
Mechanistic Interpretability Hackathon
Accepted at the 
Mechanistic Interpretability Hackathon
 research sprint on 
January 25, 2023

Distillation by duplication: The importance of layer selection

As layers are chained together in a pipeline where each layer has knowledge on how to decode the information passed to it from the previous layer and how to process it to gain value that ultimately leads to a prediction. Thus, we hypothesise that on one hand it may be beneficial to copy consecutive layers from the teacher to the student, as they can already decode each other's output. However, copying layers that are very separated may copy knowledge on different processing steps while their connections can be learnt more easily.

By 
Roksana Goworek, Paul Martin, Jonathan Frennert
🏆 
4th place
3rd place
2nd place
1st place
 by peer review