This work was done during 48 hours by research workshop participants and does not represent the work of Apart Research.
4th 🏆
3rd 🏆
2nd 🏆
1st 🏆
Mechanistic
Private
Info hazard
See web link
See the code
Visit itch.io page
Read PDF
Read PDF

Identifying a Preliminary Circuit for Predicting Gendered Pronouns in GPT-2 Small

We identify the broad structure of a circuit that is associated with correctly predicting a gendered pronoun given the subject of a rhetorical question. Progress towards identifying this circuit is achieved through a variety of existing tools, namely Conmy’s Automatic Circuit Discovery and Nanda’s Exploratory Analysis tools. We present this report, not only as a preliminary understanding of the broad structure of a gendered pronoun circuit, but also as (perhaps) a structured, re-implementable procedure (or maybe just naive inspiration) for identifying circuits for other tasks in large transformer language models. Further work is warranted in refining the proposed circuit and better understanding the associated human-interpretable algorithm.

Anonymous: Team members hidden

Chris Mathwin, Guillaume Corlouer

Chris Mathwin, Guillaume Corlouer

Identifying a Preliminary Circuit for Predicting Gendered Pronouns in GPT-2 Small
View the video presentation:

Download instead.

Download instead.

Hackathon

Mechanistic Interpretability Hackathon

Sunday, January 22, 2023
Hackathon

Mechanistic Interpretability Hackathon

Jam site

LEAH Mechanistic Interpretability Hackathon

LEAH is hosting the mechanistic interpretability hackathon for the London Univeristies.
Imperial, King's, UCL
, visit event page
Jam site

London

Anonymous

★★★☆☆
You have successfully rated this project!
Oops! Something went wrong while submitting the form.
You have successfully submitted your feedback. It should show up on this page.
Oops! Something went wrong while submitting the form.
This project received
4
stars from a user
Discovering Agency Features as Latent Space Directions in LLMs via SVD
This project received
3
stars from a user
Preserving Agency in Reinforcement Learning under Unknown, Evolving and Under-Represented Intentions
This project received
2
stars from a user
ILLUSION OF CONTROL
This project received
4
stars from a user
Agency, value and empowerment.
This project received
2
stars from a user
Comparing truthful reporting, intent alignment, agency preservation and value identification
This project received
1
stars from a user
ILLUSION OF CONTROL
This project received
2
stars from a user
Comparing truthful reporting, intent alignment, agency preservation and value identification
This project received
4
stars from a user
In the Mirror: Using Chess to Simulate Agency Loss in Feedback Loops
This project received
1
stars from a user
ILLUSION OF CONTROL
This project received
2
stars from a user
Comparing truthful reporting, intent alignment, agency preservation and value identification
This project received
4
stars from a user
In the Mirror: Using Chess to Simulate Agency Loss in Feedback Loops
This project received
3
stars from a user
Against Agency
This project received
3
stars from a user
Against Agency
This project received
3
stars from a user
ILLUSION OF CONTROL
This project received
3
stars from a user
Preserving Agency in Reinforcement Learning under Unknown, Evolving and Under-Represented Intentions
This project received
3
stars from a user
Comparing truthful reporting, intent alignment, agency preservation and value identification
This project received
3
stars from a user
In the Mirror: Using Chess to Simulate Agency Loss in Feedback Loops
This project received
2
stars from a user
In the Mirror: Using Chess to Simulate Agency Loss in Feedback Loops
This project received
3
stars from a user
Impact of “fear of shutoff” on chatbot advice regarding illegal behavior
This project received
4
stars from a user
Goal Misgeneralization
This project received
4
stars from a user
Residual Stream Verification via California Housing Prices Experiment
This project received
4
stars from a user
Problem 9.60 - Dimensionaliy reduction
This project received
3
stars from a user
Trojan detection and implementation on transformers
This project received
5
stars from a user
Turing Mirror: Evaluating the ability of LLMs to recognize LLM-generated text
This project received
5
stars from a user
Can Large Language Models Solve Security Challenges?
This project received
5
stars from a user
Can Large Language Models Solve Security Challenges?
This project received
4
stars from a user
Turing Mirror: Evaluating the ability of LLMs to recognize LLM-generated text
This project received
3
stars from a user
Preliminary measures of faithfulness in least-to-most prompting
This project received
4
stars from a user
Preliminary measures of faithfulness in least-to-most prompting
This project received
5
stars from a user
Can Large Language Models Solve Security Challenges?
This project received
5
stars from a user
SADDER - Situational Awareness Dataset for Detecting Extreme Risks
This project received
5
stars from a user
SADDER - Situational Awareness Dataset for Detecting Extreme Risks
This project received
5
stars from a user
SADDER - Situational Awareness Dataset for Detecting Extreme Risks
This project received
3
stars from a user
SADDER - Situational Awareness Dataset for Detecting Extreme Risks
This project received
5
stars from a user
SADDER - Situational Awareness Dataset for Detecting Extreme Risks
This project received
5
stars from a user
SADDER - Situational Awareness Dataset for Detecting Extreme Risks
This project received
5
stars from a user
Soft Prompts are a Convex Set
This project received
5
stars from a user
Preliminary Steps Toward Investigating the “Smearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model
This project received
2
stars from a user
Toward a Working Deep Dream for LLM's
This project received
2
stars from a user
DPO vs PPO comparative analysis
This project received
5
stars from a user
Preliminary Steps Toward Investigating the “Smearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model
This project received
5
stars from a user
Experiments in Superposition
This project received
5
stars from a user
Preliminary Steps Toward Investigating the “Smearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model
This project received
5
stars from a user
Experiments in Superposition
This project received
3
stars from a user
Experiments in Superposition
This project received
4
stars from a user
Embedding and Transformer Synthesis
This project received
4
stars from a user
Who cares about brackets?
This project received
4
stars from a user
One is 1- Analyzing Activations of Numerical Words vs Digits
This project received
4
stars from a user
DPO vs PPO comparative analysis
This project received
5
stars from a user
Interpreting Planning in Transformers
This project received
2
stars from a user
Multimodal Similarity Detection in Transformer Models
This project received
4
stars from a user
Factual recall rarely happens in attention layer
This project received
4
stars from a user
Toward a Working Deep Dream for LLM's
This project received
5
stars from a user
Relating induction heads in Transformers to temporal context model in human free recall
This project received
5
stars from a user
Experiments in Superposition
This project received
4
stars from a user
One is 1- Analyzing Activations of Numerical Words vs Digits
This project received
4
stars from a user
DPO vs PPO comparative analysis
This project received
3
stars from a user
Interpreting Planning in Transformers
This project received
3
stars from a user
Multimodal Similarity Detection in Transformer Models
This project received
3
stars from a user
Factual recall rarely happens in attention layer
This project received
4
stars from a user
Preliminary Steps Toward Investigating the “Smearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model
This project received
5
stars from a user
Experiments in Superposition
This project received
3
stars from a user
Who cares about brackets?
This project received
3
stars from a user
Embedding and Transformer Synthesis
This project received
4
stars from a user
DPO vs PPO comparative analysis
This project received
3
stars from a user
Interpreting Planning in Transformers
This project received
3
stars from a user
Multimodal Similarity Detection in Transformer Models
This project received
3
stars from a user
Factual recall rarely happens in attention layer
This project received
4
stars from a user
Preliminary Steps Toward Investigating the “Smearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model
This project received
5
stars from a user
Experiments in Superposition
This project received
3
stars from a user
Towards Interpretability of 5 digit addition
This project received
3
stars from a user
Toward a Working Deep Dream for LLM's
This project received
3
stars from a user
Relating induction heads in Transformers to temporal context model in human free recall
This project received
3
stars from a user
Interpreting Planning in Transformers
This project received
3
stars from a user
Towards Interpretability of 5 digit addition
This project received
3
stars from a user
Toward a Working Deep Dream for LLM's
This project received
3
stars from a user
Relating induction heads in Transformers to temporal context model in human free recall
This project received
3
stars from a user
DPO vs PPO comparative analysis
This project received
2
stars from a user
Factual recall rarely happens in attention layer
This project received
4
stars from a user
Experiments in Superposition
This project received
3
stars from a user
Preliminary Steps Toward Investigating the “Smearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model
This project received
3
stars from a user
Embedding and Transformer Synthesis
This project received
3
stars from a user
Who cares about brackets?
This project received
4
stars from a user
Interpreting Planning in Transformers
This project received
4
stars from a user
DPO vs PPO comparative analysis
This project received
4
stars from a user
Preliminary Steps Toward Investigating the “Smearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model
This project received
4
stars from a user
Towards Interpretability of 5 digit addition
This project received
3
stars from a user
Toward a Working Deep Dream for LLM's
This project received
3
stars from a user
Multimodal Similarity Detection in Transformer Models
This project received
2
stars from a user
Factual recall rarely happens in attention layer
This project received
4
stars from a user
Relating induction heads in Transformers to temporal context model in human free recall
This project received
5
stars from a user
Experiments in Superposition
This project received
4
stars from a user
Preliminary Steps Toward Investigating the “Smearing” Hypothesis for Layer Normalizing in a 1-Layer SoLU Model
This project received
5
stars from a user
Experiments in Superposition
This project received
3
stars from a user
Multimodal Similarity Detection in Transformer Models
This project received
3
stars from a user
Interpreting Planning in Transformers
This project received
3
stars from a user
DPO vs PPO comparative analysis
This project received
4
stars from a user
Who cares about brackets?
This project received
4
stars from a user
Embedding and Transformer Synthesis
This project received
4
stars from a user
Towards Interpretability of 5 digit addition