LLM Labyrinth — reconstruct a language model's circuit

prompt

The maze starts hidden. You know only the answer (top) and the input tokens (bottom). Click a node to reveal what it connects to and explore inward; tap the ⊕ badge to add a feature to your circuit. Find the smallest set of features sufficient to produce the answer — and avoid the decoys that lead to other answers.

scroll to zoom · drag to pan

in circuit: 0 · revealed: 0 · best: —

What am I looking at?

This is a real attribution graph of the open model gemma-2-2b, traced with sparse transcoder features (Gemma Scope), sourced from Neuronpedia. Each door is a feature (or a labelled group of features) the model actually used. The edges are direct attributions — causal-ish, not correlations.

Honesty notes, because this is interpretability: the graph is a linear approximation under the local-replacement model (MLPs swapped for transcoders + an error term), valid around this one prompt — not the literal forward pass. The v1 score is an attribution-weight proxy for sufficiency, not a true ablation. The grey fog nodes are error regions where the transcoders fail to reconstruct and the trace genuinely breaks down — the field's most important caveat: interpretability is lossy.