All Jams, Events & Projects

See what participants have created
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Fuzzing Large Language Models
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Towards Formally Describing Program Traces from Chains of Language Model Calls with Causal Influence Diagrams: A Sketch
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
It Ain't Much but it's ONNX Work
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
OthelloScope
Apr 2023
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Improving TransformerLens Head Detector
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Dropout Incentivizes Privileged Bases
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Solving the CNN Mech Int Challenge
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
AutoAdminsteredAntidotes: Circuit detection in a poisoned model for MNIST classification
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Othello Mechint playground
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Detecting Phase Transitions
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Exploring OthelloGPT
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Private
Info hazard
Go to project page
Algorithmic Explanation: A method for measuring interpretations of neural networks
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
AI Governance
Private
Info hazard
Go to project page
AI and Democracy: Balancing Risks and Opportunities to Maintain Meaningful Human Control
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
AI Governance
Private
Info hazard
Go to project page
AI Impact Assessments
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Go to project page
Investigating Neuron Behaviour via Dataset Example Pruning and Local Search
Nov 2022
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Thinkathon
Private
Info hazard
Go to project page
New AI organization brainstorm
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Thinkathon
Private
Info hazard
Go to project page
Risk Defense Initiative
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Thinkathon
Private
Info hazard
Go to project page
AI Safety unionization for bottom-up governance
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Thinkathon
Private
Info hazard
Go to project page
AI Safety Talent Pool Identification
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Thinkathon
Private
Info hazard
Go to project page
Analysis of upcoming AGI companies
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Thinkathon
Private
Info hazard
Go to project page
Diversity in AI safety
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Thinkathon
Private
Info hazard
Go to project page
Critique of OpenAI's alignment plan
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Thinkathon
Private
Info hazard
Go to project page
Simon's Time-Off Newsletter
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Thinkathon
Private
Info hazard
Go to project page
ChatGPT Alignment Talent Search
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Thinkathon
Private
Info hazard
Go to project page
AI Safety Subproblems for Software Engineering Researchers
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Thinkathon
Private
Info hazard
Go to project page
Catalogue of AI safety
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Thinkathon
Private
Info hazard
Go to project page
Authority bias to ChatGPT
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Oversight
Private
Info hazard
Go to project page
Reverse Word Wizards: Pitting Language Models Against the Art of Reversal
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Oversight
Private
Info hazard
Go to project page
Player Of Games
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Oversight
Private
Info hazard
Go to project page
Automated Sandwiching: Efficient Self-Evaluations of Conversation-Based Scalable Oversight Techniques
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Oversight
Private
Info hazard
Go to project page
Automated Model Oversight Using CoTP
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Oversight
Private
Info hazard
Go to project page
Physics Guided Deep Learning Interpretation
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Oversight
Private
Info hazard
Go to project page
Can you keep a secret?
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Oversight
Private
Info hazard
Go to project page
Sustainable Fashion Brand Language Learning Model 1
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Mechanistic
Private
Info hazard
Go to project page
Soft Prompts are a Convex Set
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Mechanistic
Private
Info hazard
Go to project page
Automated Identification of Potential Feature Neurons
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Mechanistic
Private
Info hazard
Go to project page
Identifying a Preliminary Circuit for Predicting Gendered Pronouns in GPT-2 Small
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Mechanistic
Private
Info hazard
Go to project page
We Discovered An Neuron
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Mechanistic
Private
Info hazard
Go to project page
TraCR-Supported Mechanistic Interpretability
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Mechanistic
Private
Info hazard
Go to project page
$B$ Confident Bro: Discovering Latent Knowledge In Language Models Without Supervision
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Mechanistic
Private
Info hazard
Go to project page
Distillation by duplication: The importance of layer selection
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Mechanistic
Private
Info hazard
Go to project page
Attention Phrenology: A spatial classification of attention heads
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Mechanistic
Private
Info hazard
Go to project page
Iterative summarization interpretability
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Mechanistic
Private
Info hazard
Go to project page
Investigating Agent Behavior In different RL methods
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Mechanistic
Private
Info hazard
Go to project page
The Start of Investigating a 1-Layer SoLU Model
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Mechanistic
Private
Info hazard
Go to project page
Trafo Mech Int on the web!
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Mechanistic
Private
Info hazard
Go to project page
One Attention Head Is All You Need for Sorting Fixed-Length Lists
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Mechanistic
Private
Info hazard
Go to project page
In search of linguistic concepts: investigating BERT's context vectors
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Mechanistic
Private
Info hazard
Go to project page
Interactive Layerscope
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
AI Testing
Private
Info hazard
Go to project page
Trojan detection and implementation on transformers
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
AI Testing
Private
Info hazard
Go to project page
Counting Letters, Chaining Premises & Solving Equations: Exploring Inverse Scaling Problems with GPT-3
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
AI Testing
Private
Info hazard
Go to project page
Investigating Training Dynamics via Token Loss Trajectories
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
AI Testing
Private
Info hazard
Go to project page
Discovering Latent Knowledge in Language Models Without Supervision - extensions and testing
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
AI Testing
Private
Info hazard
Go to project page
Evaluating Critical Level Of Perturbations Required To Achieve Certain Fail Rate
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
AI Testing
Private
Info hazard
Go to project page
Formal Verification for Paren-balance checking
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
AI Testing
Private
Info hazard
Go to project page
Model Hubris: On the Presumptuousness of Large Language Models
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
AI Testing
Private
Info hazard
Go to project page
This Is Fine(-tuning): A benchmark testing LLMs robustness against bad fine-tuning data
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
AI Testing
Private
Info hazard
Go to project page
LLM benchmarking through specifically-aligned feedback
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Go to project page
Probing Conceptual Knowledge on Solved Games
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Go to project page
Model editing hazards at the example of ROME
Nov 2022
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Go to project page
Backup Transformer Heads are Robust to Ablation Distribution
Nov 2022
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Go to project page
An Intuitive Logic for Understanding Autoregressive Language Models
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Go to project page
Top-Down Interpretability Through Eigenspectra
Nov 2022
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Go to project page
An Informal Investigation of Indirect Object Identification in Mistral GPT2-Small Battlestar
Nov 2022
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Go to project page
Mechanisms of Causal Reasoning
Nov 2022
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Go to project page
Caught Red-Bandit
Nov 2022
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Go to project page
Natural language descriptions for natural language directions
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Go to project page
Trying to make GPT2 dream
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Go to project page
Visualizing the effect prompt design has on text-davinci-002 mode collapse and social biases
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Go to project page
Optimising image patches to change RL-agent behaviour
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Go to project page
Finding unusual neuron sets by activation vector distance
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Go to project page
How to find the minimum of a list - Transformer Edition
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Go to project page
Alignment Jam : Gradient-based Interpretability of Quantum-inspired neural networks
Nov 2022
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Go to project page
War is 15% conflic, 15% DragonMagazine
Nov 2022
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Go to project page
Interpreting Catastrophic Failure Modes in OpenAI’s Whisper
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Go to project page
Algorithmic bit-wise boolean task on a transformer
Nov 2022
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Go to project page
Interpretability at a glance
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Go to project page
Neurons and Attention Heads that Look for Sentence Structure in GPT2
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Go to project page
Sparsity Lens
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Go to project page
Observing and Validating Induction heads in SOLU-8l-old
Nov 2022
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Go to project page
Regularly Oversimplifying Neural Networks
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
LLM Hackathon
Private
Info hazard
Go to project page
Simulating an Alien
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
LLM Hackathon
Private
Info hazard
Go to project page
Wording influences truthfulness
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
LLM Hackathon
Private
Info hazard
Go to project page
Reasoning with Chain of Thought
Oct 2022
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
LLM Hackathon
Private
Info hazard
Go to project page
Reducing hindsight neglect with "Let's think step by step"
Oct 2022
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
LLM Hackathon
Private
Info hazard
Go to project page
All Fish are Trees
Oct 2022
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
LLM Hackathon
Private
Info hazard
Go to project page
Soliciting criminal advice from LLMs
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
LLM Hackathon
Private
Info hazard
Go to project page
Agreeableness vs. Truthfulness