About LLM Labyrinth
LLM Labyrinth is a puzzle played over a real attribution graph of a language model. When a model like gemma-2-2b answers a question, it does so through an internal circuit of features β interpretable units of computation recovered with sparse transcoders (Gemma Scope). This game hands you that circuit and asks you to reconstruct it: pick the smallest set of features that is still sufficient to produce the answer.
It is mechanistic interpretability turned into a game β not a passive graph viewer, but an active predict β assemble β score loop. The circuit data is sourced from Neuronpedia.
βΆ Play the first levelHow to play
- The model was given a prompt β βthe capital of the state containing Dallas isβ β and answered with the single token βAustinβ.
- The graph shows the features it used. Input tokens sit at the bottom; the answer logit at the top; the doors (features and grouped concepts) in between.
- Click doors to assemble the circuit you believe is sufficient to produce the answer, then Submit.
- You are scored on two axes: sufficiency (how much of the answer's attribution your circuit recovers) and minimality (fewer doors is better).
How the scoring works
The answer logit emits one unit of βcreditβ that flows backward through the graph along normalized attribution weights β but it may only pass through the doors you selected. Whatever credit reaches the input tokens is the fraction of the answer your circuit recovers. Selecting every reference door recovers 100%; dropping a load-bearing hop strands the credit and your score falls.
Honesty notes (this is interpretability)
- The graph is a linear approximation under the local-replacement model β MLPs are swapped for transcoders plus an error term β valid around this one prompt. It is not the literal forward pass.
- The v1 score is an attribution-weight proxy for sufficiency, not a true ablation. A later phase will replace it with real interventions.
- The grey fog nodes are error regions where the transcoders fail to reconstruct and the trace genuinely breaks down β a reminder that interpretability is lossy.
Who it's for
Anyone curious how language models actually work β and the mechanistic interpretability community in particular. Every door carries a plain-English label on top and the real feature id, activation, and a link to its Neuronpedia dashboard underneath. Flip the raw switch to peel the labels and reason from the underlying features.