🧬 LLM Labyrinth

by lilywhitegames

About LLM Labyrinth

LLM Labyrinth is a puzzle played over a real attribution graph of a language model. When a model like gemma-2-2b answers a question, it does so through an internal circuit of features β€” interpretable units of computation recovered with sparse transcoders (Gemma Scope). This game hands you that circuit and asks you to reconstruct it: pick the smallest set of features that is still sufficient to produce the answer.

It is mechanistic interpretability turned into a game β€” not a passive graph viewer, but an active predict β†’ assemble β†’ score loop. The circuit data is sourced from Neuronpedia.

β–Ά Play the first level

How to play

  1. The model was given a prompt β€” β€œthe capital of the state containing Dallas is” β€” and answered with the single token β€œAustin”.
  2. The graph shows the features it used. Input tokens sit at the bottom; the answer logit at the top; the doors (features and grouped concepts) in between.
  3. Click doors to assemble the circuit you believe is sufficient to produce the answer, then Submit.
  4. You are scored on two axes: sufficiency (how much of the answer's attribution your circuit recovers) and minimality (fewer doors is better).

How the scoring works

The answer logit emits one unit of β€œcredit” that flows backward through the graph along normalized attribution weights β€” but it may only pass through the doors you selected. Whatever credit reaches the input tokens is the fraction of the answer your circuit recovers. Selecting every reference door recovers 100%; dropping a load-bearing hop strands the credit and your score falls.

Honesty notes (this is interpretability)

Who it's for

Anyone curious how language models actually work β€” and the mechanistic interpretability community in particular. Every door carries a plain-English label on top and the real feature id, activation, and a link to its Neuronpedia dashboard underneath. Flip the raw switch to peel the labels and reason from the underlying features.

← back to the labyrinth Β· privacy