πŸ“š Learning Hub
Β· 7 min read

How Transformers Actually Work β€” A Visual Guide for Developers


Every major language model you’ve heard of β€” GPT, Claude, Gemini, Llama, Mistral β€” runs on the same fundamental architecture: the transformer. It was introduced in a 2017 paper called β€œAttention Is All You Need,” and it genuinely changed everything. Before transformers, sequence models like RNNs processed words one at a time, like reading a sentence through a keyhole. Transformers read the whole page at once.

If you build on top of LLMs, understanding what’s happening inside them isn’t optional curiosity β€” it’s the difference between debugging effectively and guessing. This post walks through the transformer architecture piece by piece, using analogies instead of equations.

The architecture at 30,000 feet

The original transformer has two halves: an encoder (reads input) and a decoder (produces output). Think of it like a translator: the encoder understands the source language, the decoder writes in the target language.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   ENCODER   β”‚ ───► β”‚   DECODER   β”‚
β”‚ (understand)β”‚      β”‚  (generate) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

But here’s the thing β€” most modern LLMs (GPT, Claude, Llama) are decoder-only. They don’t have a separate encoder. They just predict the next token, over and over, using only the decoder stack. The encoder-decoder design still shows up in models built for translation or summarization (like T5), but for general-purpose text generation, decoder-only won.

Why? Because β€œpredict the next word” turns out to be a shockingly powerful training objective. You don’t need labeled data. You just need text β€” and the internet has plenty of that.

Step 1: Tokenization and embeddings

Before a transformer sees your prompt, it gets broken into tokens β€” chunks that might be words, subwords, or individual characters. β€œunhappiness” might become ["un", "happiness"]. (For the full story, see how tokenizers work.)

Each token is then converted into an embedding β€” a dense vector of numbers (typically 768 to 12,288 dimensions depending on model size) that captures meaning. β€œKing” and β€œqueen” end up as nearby vectors. β€œKing” and β€œtoaster” don’t. These embeddings are learned during training, not hand-crafted. (More on this in how embeddings work.)

"The cat sat" β†’ ["The", "cat", "sat"]
                      ↓
              [ [0.12, -0.34, ...],    ← embedding for "The"
                [0.87,  0.21, ...],    ← embedding for "cat"
                [-0.05, 0.63, ...] ]   ← embedding for "sat"

At this point, the model has a list of vectors β€” one per token. But there’s a problem: these vectors don’t know what order they’re in.

Step 2: Positional encoding β€” giving tokens a sense of place

Transformers process all tokens in parallel, not sequentially. That’s what makes them fast, but it also means the model has no built-in notion of word order. β€œThe cat sat on the mat” and β€œThe mat sat on the cat” would look identical.

Positional encodings fix this. They’re vectors added to each token embedding that encode position information. Think of it like seat numbers at a concert β€” the music (embedding) tells you what each instrument sounds like, and the seat number (position) tells you where it’s sitting in the orchestra.

The original paper used fixed sine/cosine patterns. Many modern models use learned positional encodings or rotary positional embeddings (RoPE), which encode relative distances between tokens rather than absolute positions. The details vary, but the purpose is the same: let the model know that token 3 comes after token 2.

After this step, each token’s vector carries two pieces of information: what it means and where it sits.

Step 3: Self-attention β€” the core innovation

This is the big one. Self-attention is the mechanism that lets every token look at every other token and decide: β€œHow relevant are you to me?”

Here’s the analogy. Imagine you’re at a crowded party and someone says the word β€œbank.” To understand what they mean, you look around the conversation for context clues. If someone nearby said β€œriver,” you think riverbank. If someone said β€œmoney,” you think financial bank. Self-attention is the model doing exactly this β€” for every token, simultaneously.

Mechanically, each token produces three vectors from its embedding:

  • Query (Q): β€œWhat am I looking for?”
  • Key (K): β€œWhat do I contain?”
  • Value (V): β€œWhat information do I carry?”

Every token’s Query is compared against every other token’s Key to produce an attention score β€” a measure of relevance. Those scores are used to create a weighted blend of all the Value vectors. The result is a new representation of each token that’s been enriched by context from the entire sequence.

Token: "bank"

  Q("bank") asks: "Who's relevant to me?"
      ↓
  Compares against K("river"), K("the"), K("near"), K("money")...
      ↓
  Scores:  river=0.7,  the=0.02,  near=0.1,  money=0.18
      ↓
  Weighted sum of V("river"), V("the"), V("near"), V("money")
      ↓
  New enriched representation of "bank" (leaning toward riverbank)

The critical insight: after self-attention, the vector for β€œbank” is no longer just about the word β€œbank” in isolation. It’s been reshaped by its context. This is how transformers build understanding β€” not from dictionary definitions, but from relationships between tokens.

In decoder-only models, there’s one extra rule: causal masking. Each token can only attend to tokens that came before it (and itself), never to future tokens. This prevents the model from β€œcheating” during generation by peeking at the answer. It’s like writing an essay where you can re-read what you’ve already written, but you can’t skip ahead.

Step 4: Multi-head attention β€” looking at things from multiple angles

A single attention pass captures one type of relationship. But language is layered. In the sentence β€œThe animal didn’t cross the street because it was too tired,” the word β€œit” relates to β€œanimal” syntactically, to β€œtired” semantically, and to β€œstreet” spatially.

Multi-head attention runs several attention operations in parallel, each with its own learned Q, K, V projections. Each β€œhead” can specialize in a different type of relationship β€” one might track grammar, another coreference, another proximity.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚        Multi-Head Attention       β”‚
β”‚                                   β”‚
β”‚  Head 1: grammar relationships    β”‚
β”‚  Head 2: coreference ("it"β†’"animal") β”‚
β”‚  Head 3: proximity/position       β”‚
β”‚  Head 4: semantic similarity      β”‚
β”‚  ...                              β”‚
β”‚  Head N: (learned, not labeled)   β”‚
β”‚                                   β”‚
β”‚  All heads β†’ concatenate β†’ projectβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The outputs of all heads are concatenated and projected back down to the model’s hidden dimension. The model learns what each head should focus on during training β€” nobody hand-assigns β€œyou do grammar.”

Typical models use 12 to 128 attention heads depending on size.

Step 5: Feed-forward layers β€” the thinking step

After attention gathers context, each token passes through a feed-forward network (FFN) β€” a small, independent neural network applied to each token position separately. If attention is β€œgathering information from the room,” the FFN is β€œsitting down to think about what you heard.”

These layers are where a lot of the model’s factual knowledge is believed to be stored. They typically expand the hidden dimension by 4x, apply a nonlinear activation, then project back down. It’s the same operation for every token, but with shared learned weights.

  Attention output (per token)
        ↓
  Expand: 4096 β†’ 16384
        ↓
  Activation (GeLU / SwiGLU)
        ↓
  Contract: 16384 β†’ 4096
        ↓
  Feed-forward output

Step 6: Layer normalization and residual connections β€” keeping things stable

Deep networks have a problem: as signals pass through dozens of layers, values can explode or vanish. Two mechanisms prevent this:

Residual connections add the input of each sub-layer back to its output. Instead of replacing the signal, each layer refines it. Think of it like editing a document β€” you don’t rewrite from scratch each time, you make incremental improvements to the existing draft.

Layer normalization rescales the values at each layer to keep them in a stable range. It’s the thermostat of the network β€” without it, training becomes chaotic.

  Input ──────────────────────┐
    ↓                         β”‚ (residual / skip connection)
  [ Attention or FFN ]        β”‚
    ↓                         β”‚
  + β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    ↓
  Layer Norm
    ↓
  Output

Modern models sometimes apply normalization before the sub-layer (pre-norm) rather than after (post-norm). The effect is similar β€” stability during training.

The full forward pass

Let’s put it all together. Here’s what happens when a decoder-only transformer processes a sequence:

Input tokens
    ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Token Embeddings       β”‚
β”‚  + Positional Encoding  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” ─┐
β”‚  Multi-Head Attention   β”‚  β”‚
β”‚  (causal mask applied)  β”‚  β”‚
β”‚  + Residual + LayerNorm β”‚  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  β”œβ”€β”€ Γ— N layers (e.g., 32, 80, 128)
β”‚  Feed-Forward Network   β”‚  β”‚
β”‚  + Residual + LayerNorm β”‚  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”€β”˜
             ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Final LayerNorm        β”‚
β”‚  β†’ Linear projection    β”‚
β”‚  β†’ Softmax over vocab   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             ↓
    Next token probabilities

The model picks the most likely next token (or samples from the distribution), appends it to the sequence, and runs the whole thing again. That’s autoregressive generation β€” one token at a time, each time re-attending to everything before it.

This is also where the KV cache becomes essential. Without it, the model would recompute attention for all previous tokens at every step. The KV cache stores the Key and Value vectors from prior steps so only the new token needs full computation. For more on how this plays out at scale, see LLM inference explained and continuous batching.

Why this matters for developers

You don’t need to implement a transformer from scratch to build with LLMs. But knowing the architecture helps you reason about:

  • Context windows β€” why there’s a token limit (attention is quadratic in sequence length)
  • Prompt engineering β€” why token order and phrasing matter (attention patterns are sensitive to position and wording)
  • Cost β€” why longer prompts cost more (more tokens = more computation per layer, per head)
  • Hallucinations β€” the model is always just predicting the next plausible token, not retrieving facts from a database
  • Fine-tuning β€” what you’re actually changing (the weights in attention projections and FFN layers)

The transformer is a surprisingly simple architecture β€” attention, feed-forward, normalize, repeat. The magic isn’t in any single component. It’s in stacking these layers deep and training on enormous amounts of text. Understanding the building blocks gives you a mental model for everything built on top of them.