
The Self-Hating Attention Head: A Deep Dive in
Jul 4, 2025 · TL;DR: gpt2-small's head L1H5 directs attention to semantically similar tokens and actively suppresses self-attention. The head computes attention purely based on token …
We Inspected Every Head In GPT-2 Small using SAEs So You Don’t …
Mar 5, 2024 · Using our SAEs, we inspect the roles of every attention head in GPT-2 small, discovering a wide range of previously unidentified behaviors. We manually examined every …
Attention SAEs Scale to GPT-2 Small — AI Alignment Forum
Feb 3, 2024 · We feel pretty convinced that attention SAEs extract interpretable features, and allow for interesting exploration of what attention layers have learned. Now we focus on …
An Extremely Opinionated Annotated List of My Favourite …
One of my favourite phenomenons is when someone puts out an exciting paper, that gets a lot of attention yet has some subtle flaws, and follow-up work identifies and clarifies these.
Attention Output SAEs Improve Circuit - Alignment Forum
Jun 21, 2024 · Rob designed and built the tool to discover attention feature circuits on arbitrary prompts with recursive DFA, and performed automated circuit scans for examples of attention …
Sparse Autoencoders Work on Attention Layer - Alignment Forum
We replicate Anthropic's MLP Sparse Autoencoder (SAE) paper on attention outputs and it works well: the SAEs learn sparse, interpretable features, which gives us insight into what attention …
Polysemantic Attention Head in a 4-Layer - AI Alignment Forum
Nov 9, 2023 · This post provides evidence of the complex role that attention heads play within a model’s computation, and that simplifying an attention head to a simple, singular behaviour …
Thought Anchors: Which LLM Reasoning Steps - AI Alignment Forum
Jul 2, 2025 · We identify "receiver heads": attention heads that tend to pinpoint and narrow attention to a small set of sentences. These heads reveal sentences receiving disproportionate …
Attention Output SAEs - AI Alignment Forum
Jun 21, 2024 · We perform a qualitative study of the features computed by attention layers, and find multiple feature families: long-range context, short-range context and induction features, …
The positional embedding matrix and previous-token heads: how …
Aug 9, 2023 · Looking at the attention patterns of L4H11 more carefully, we can see right away that there is a qualitative difference between how it implements previous-token attention and …