All Jams, Events & Projects

See what participants have created
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Thinkathon
Private
Info hazard
Read
New AI organization brainstorm
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Thinkathon
Private
Info hazard
Read
Risk Defense Initiative
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Thinkathon
Private
Info hazard
Read
AI Safety unionization for bottom-up governance
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Thinkathon
Private
Info hazard
Read
AI Safety Talent Pool Identification
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Thinkathon
Private
Info hazard
Read
Analysis of upcoming AGI companies
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Thinkathon
Private
Info hazard
Read
Diversity in AI safety
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Thinkathon
Private
Info hazard
Read
Critique of OpenAI's alignment plan
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Thinkathon
Private
Info hazard
Read
Simon's Time-Off Newsletter
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Thinkathon
Private
Info hazard
Read
ChatGPT Alignment Talent Search
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Thinkathon
Private
Info hazard
Read
AI Safety Subproblems for Software Engineering Researchers
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Thinkathon
Private
Info hazard
Read
Catalogue of AI safety
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Thinkathon
Private
Info hazard
Read
Authority bias to ChatGPT
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Oversight
Private
Info hazard
Read
Reverse Word Wizards: Pitting Language Models Against the Art of Reversal
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Oversight
Private
Info hazard
Read
Player Of Games
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Oversight
Private
Info hazard
Read
Automated Sandwiching: Efficient Self-Evaluations of Conversation-Based Scalable Oversight Techniques
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Oversight
Private
Info hazard
Read
Automated Model Oversight Using CoTP
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Oversight
Private
Info hazard
Read
Physics Guided Deep Learning Interpretation
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Oversight
Private
Info hazard
Read
Can you keep a secret?
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Oversight
Private
Info hazard
Read
Sustainable Fashion Brand Language Learning Model 1
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Mechanistic
Private
Info hazard
Read
Soft Prompts are a Convex Set
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Mechanistic
Private
Info hazard
Read
Automated Identification of Potential Feature Neurons
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Mechanistic
Private
Info hazard
Read
Identifying a Preliminary Circuit for Predicting Gendered Pronouns in GPT-2 Small
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Mechanistic
Private
Info hazard
Read
We Discovered An Neuron
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Mechanistic
Private
Info hazard
Read
TraCR-Supported Mechanistic Interpretability
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Mechanistic
Private
Info hazard
Read
$B$ Confident Bro: Discovering Latent Knowledge In Language Models Without Supervision
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Mechanistic
Private
Info hazard
Read
Distillation by duplication: The importance of layer selection
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Mechanistic
Private
Info hazard
Read
Attention Phrenology: A spatial classification of attention heads
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Mechanistic
Private
Info hazard
Read
Iterative summarization interpretability
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Mechanistic
Private
Info hazard
Read
Investigating Agent Behavior In different RL methods
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Mechanistic
Private
Info hazard
Read
The Start of Investigating a 1-Layer SoLU Model
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Mechanistic
Private
Info hazard
Read
Trafo Mech Int on the web!
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Mechanistic
Private
Info hazard
Read
One Attention Head Is All You Need for Sorting Fixed-Length Lists
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Mechanistic
Private
Info hazard
Read
In search of linguistic concepts: investigating BERT's context vectors
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Mechanistic
Private
Info hazard
Read
Interactive Layerscope
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
AI Testing
Private
Info hazard
Read
Trojan detection and implementation on transformers
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
AI Testing
Private
Info hazard
Read
Counting Letters, Chaining Premises & Solving Equations: Exploring Inverse Scaling Problems with GPT-3
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
AI Testing
Private
Info hazard
Read
Investigating Training Dynamics via Token Loss Trajectories
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
AI Testing
Private
Info hazard
Read
Discovering Latent Knowledge in Language Models Without Supervision - extensions and testing
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
AI Testing
Private
Info hazard
Read
Evaluating Critical Level Of Perturbations Required To Achieve Certain Fail Rate
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
AI Testing
Private
Info hazard
Read
Formal Verification for Paren-balance checking
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
AI Testing
Private
Info hazard
Read
Model Hubris: On the Presumptuousness of Large Language Models
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
AI Testing
Private
Info hazard
Read
This Is Fine(-tuning): A benchmark testing LLMs robustness against bad fine-tuning data
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
AI Testing
Private
Info hazard
Read
LLM benchmarking through specifically-aligned feedback
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Read
Probing Conceptual Knowledge on Solved Games
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Read
Model editing hazards at the example of ROME
Nov 2022
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Read
Backup Transformer Heads are Robust to Ablation Distribution
Nov 2022
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Read
Investigating Neuron Behaviour via Dataset Example Pruning and Local Search
Nov 2022
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Read
An Intuitive Logic for Understanding Autoregressive Language Models
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Read
Top-Down Interpretability Through Eigenspectra
Nov 2022
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Read
An Informal Investigation of Indirect Object Identification in Mistral GPT2-Small Battlestar
Nov 2022
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Read
Mechanisms of Causal Reasoning
Nov 2022
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Read
Caught Red-Bandit
Nov 2022
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Read
Natural language descriptions for natural language directions
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Read
Trying to make GPT2 dream
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Read
Visualizing the effect prompt design has on text-davinci-002 mode collapse and social biases
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Read
Optimising image patches to change RL-agent behaviour
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Read
Finding unusual neuron sets by activation vector distance
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Read
How to find the minimum of a list - Transformer Edition
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Read
Alignment Jam : Gradient-based Interpretability of Quantum-inspired neural networks
Nov 2022
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Read
War is 15% conflic, 15% DragonMagazine
Nov 2022
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Read
Interpreting Catastrophic Failure Modes in OpenAI’s Whisper
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Read
Algorithmic bit-wise boolean task on a transformer
Nov 2022
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Read
Interpretability at a glance
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Read
Neurons and Attention Heads that Look for Sentence Structure in GPT2
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Read
Sparsity Lens
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Read
Observing and Validating Induction heads in SOLU-8l-old
Nov 2022
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
Interpretability
Private
Info hazard
Read
Regularly Oversimplifying Neural Networks
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
LLM Hackathon
Private
Info hazard
Read
Simulating an Alien
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
LLM Hackathon
Private
Info hazard
Read
Wording influences truthfulness
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
LLM Hackathon
Private
Info hazard
Read
Reasoning with Chain of Thought
Oct 2022
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
LLM Hackathon
Private
Info hazard
Read
Reducing hindsight neglect with "Let's think step by step"
Oct 2022
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
LLM Hackathon
Private
Info hazard
Read
All Fish are Trees
Oct 2022
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
LLM Hackathon
Private
Info hazard
Read
Soliciting criminal advice from LLMs
4th πŸ†
3rd πŸ†
2nd πŸ†
1st πŸ†
LLM Hackathon
Private
Info hazard
Read
Agreeableness vs. Truthfulness