Jun 17, 2026 · 7 min read

How Transformers Actually Work — A Visual Guide for Developers

Every major language model you’ve heard of — GPT, Claude, Gemini, Llama, Mistral — runs on the same fundamental architecture: the transformer. It was introduced in a 2017 paper called “Attention Is All You Need,” and it genuinely changed everything. Before transformers, sequence models like RNNs processed words one at a time, like reading a sentence through a keyhole. Transformers read the whole page at once.

If you build on top of LLMs, understanding what’s happening inside them isn’t optional curiosity — it’s the difference between debugging effectively and guessing. This post walks through the transformer architecture piece by piece, using analogies instead of equations.

The architecture at 30,000 feet

The original transformer has two halves: an encoder (reads input) and a decoder (produces output). Think of it like a translator: the encoder understands the source language, the decoder writes in the target language.

┌─────────────┐      ┌─────────────┐
│   ENCODER   │ ───► │   DECODER   │
│ (understand)│      │  (generate) │
└─────────────┘      └─────────────┘

But here’s the thing — most modern LLMs (GPT, Claude, Llama) are decoder-only. They don’t have a separate encoder. They just predict the next token, over and over, using only the decoder stack. The encoder-decoder design still shows up in models built for translation or summarization (like T5), but for general-purpose text generation, decoder-only won.

Why? Because “predict the next word” turns out to be a shockingly powerful training objective. You don’t need labeled data. You just need text — and the internet has plenty of that.

Step 1: Tokenization and embeddings

Before a transformer sees your prompt, it gets broken into tokens — chunks that might be words, subwords, or individual characters. “unhappiness” might become ["un", "happiness"]. (For the full story, see how tokenizers work.)

Each token is then converted into an embedding — a dense vector of numbers (typically 768 to 12,288 dimensions depending on model size) that captures meaning. “King” and “queen” end up as nearby vectors. “King” and “toaster” don’t. These embeddings are learned during training, not hand-crafted. (More on this in how embeddings work.)

"The cat sat" → ["The", "cat", "sat"]
                      ↓
              [ [0.12, -0.34, ...],    ← embedding for "The"
                [0.87,  0.21, ...],    ← embedding for "cat"
                [-0.05, 0.63, ...] ]   ← embedding for "sat"

At this point, the model has a list of vectors — one per token. But there’s a problem: these vectors don’t know what order they’re in.

Step 2: Positional encoding — giving tokens a sense of place

Transformers process all tokens in parallel, not sequentially. That’s what makes them fast, but it also means the model has no built-in notion of word order. “The cat sat on the mat” and “The mat sat on the cat” would look identical.

Positional encodings fix this. They’re vectors added to each token embedding that encode position information. Think of it like seat numbers at a concert — the music (embedding) tells you what each instrument sounds like, and the seat number (position) tells you where it’s sitting in the orchestra.

The original paper used fixed sine/cosine patterns. Many modern models use learned positional encodings or rotary positional embeddings (RoPE), which encode relative distances between tokens rather than absolute positions. The details vary, but the purpose is the same: let the model know that token 3 comes after token 2.

After this step, each token’s vector carries two pieces of information: what it means and where it sits.

Step 3: Self-attention — the core innovation

This is the big one. Self-attention is the mechanism that lets every token look at every other token and decide: “How relevant are you to me?”

Here’s the analogy. Imagine you’re at a crowded party and someone says the word “bank.” To understand what they mean, you look around the conversation for context clues. If someone nearby said “river,” you think riverbank. If someone said “money,” you think financial bank. Self-attention is the model doing exactly this — for every token, simultaneously.

Mechanically, each token produces three vectors from its embedding:

Query (Q): “What am I looking for?”
Key (K): “What do I contain?”
Value (V): “What information do I carry?”

Every token’s Query is compared against every other token’s Key to produce an attention score — a measure of relevance. Those scores are used to create a weighted blend of all the Value vectors. The result is a new representation of each token that’s been enriched by context from the entire sequence.

Token: "bank"

  Q("bank") asks: "Who's relevant to me?"
      ↓
  Compares against K("river"), K("the"), K("near"), K("money")...
      ↓
  Scores:  river=0.7,  the=0.02,  near=0.1,  money=0.18
      ↓
  Weighted sum of V("river"), V("the"), V("near"), V("money")
      ↓
  New enriched representation of "bank" (leaning toward riverbank)

The critical insight: after self-attention, the vector for “bank” is no longer just about the word “bank” in isolation. It’s been reshaped by its context. This is how transformers build understanding — not from dictionary definitions, but from relationships between tokens.

In decoder-only models, there’s one extra rule: causal masking. Each token can only attend to tokens that came before it (and itself), never to future tokens. This prevents the model from “cheating” during generation by peeking at the answer. It’s like writing an essay where you can re-read what you’ve already written, but you can’t skip ahead.

Step 4: Multi-head attention — looking at things from multiple angles

A single attention pass captures one type of relationship. But language is layered. In the sentence “The animal didn’t cross the street because it was too tired,” the word “it” relates to “animal” syntactically, to “tired” semantically, and to “street” spatially.

Multi-head attention runs several attention operations in parallel, each with its own learned Q, K, V projections. Each “head” can specialize in a different type of relationship — one might track grammar, another coreference, another proximity.

┌──────────────────────────────────┐
│        Multi-Head Attention       │
│                                   │
│  Head 1: grammar relationships    │
│  Head 2: coreference ("it"→"animal") │
│  Head 3: proximity/position       │
│  Head 4: semantic similarity      │
│  ...                              │
│  Head N: (learned, not labeled)   │
│                                   │
│  All heads → concatenate → project│
└──────────────────────────────────┘

The outputs of all heads are concatenated and projected back down to the model’s hidden dimension. The model learns what each head should focus on during training — nobody hand-assigns “you do grammar.”

Typical models use 12 to 128 attention heads depending on size.

Step 5: Feed-forward layers — the thinking step

After attention gathers context, each token passes through a feed-forward network (FFN) — a small, independent neural network applied to each token position separately. If attention is “gathering information from the room,” the FFN is “sitting down to think about what you heard.”

These layers are where a lot of the model’s factual knowledge is believed to be stored. They typically expand the hidden dimension by 4x, apply a nonlinear activation, then project back down. It’s the same operation for every token, but with shared learned weights.

  Attention output (per token)
        ↓
  Expand: 4096 → 16384
        ↓
  Activation (GeLU / SwiGLU)
        ↓
  Contract: 16384 → 4096
        ↓
  Feed-forward output

Step 6: Layer normalization and residual connections — keeping things stable

Deep networks have a problem: as signals pass through dozens of layers, values can explode or vanish. Two mechanisms prevent this:

Residual connections add the input of each sub-layer back to its output. Instead of replacing the signal, each layer refines it. Think of it like editing a document — you don’t rewrite from scratch each time, you make incremental improvements to the existing draft.

Layer normalization rescales the values at each layer to keep them in a stable range. It’s the thermostat of the network — without it, training becomes chaotic.

  Input ──────────────────────┐
    ↓                         │ (residual / skip connection)
  [ Attention or FFN ]        │
    ↓                         │
  + ◄─────────────────────────┘
    ↓
  Layer Norm
    ↓
  Output

Modern models sometimes apply normalization before the sub-layer (pre-norm) rather than after (post-norm). The effect is similar — stability during training.

The full forward pass

Let’s put it all together. Here’s what happens when a decoder-only transformer processes a sequence:

Input tokens
    ↓
┌─────────────────────────┐
│  Token Embeddings       │
│  + Positional Encoding  │
└────────────┬────────────┘
             ↓
┌─────────────────────────┐ ─┐
│  Multi-Head Attention   │  │
│  (causal mask applied)  │  │
│  + Residual + LayerNorm │  │
├─────────────────────────┤  ├── × N layers (e.g., 32, 80, 128)
│  Feed-Forward Network   │  │
│  + Residual + LayerNorm │  │
└────────────┬────────────┘ ─┘
             ↓
┌─────────────────────────┐
│  Final LayerNorm        │
│  → Linear projection    │
│  → Softmax over vocab   │
└────────────┬────────────┘
             ↓
    Next token probabilities

The model picks the most likely next token (or samples from the distribution), appends it to the sequence, and runs the whole thing again. That’s autoregressive generation — one token at a time, each time re-attending to everything before it.

This is also where the KV cache becomes essential. Without it, the model would recompute attention for all previous tokens at every step. The KV cache stores the Key and Value vectors from prior steps so only the new token needs full computation. For more on how this plays out at scale, see LLM inference explained and continuous batching.

Why this matters for developers

You don’t need to implement a transformer from scratch to build with LLMs. But knowing the architecture helps you reason about:

Context windows — why there’s a token limit (attention is quadratic in sequence length)
Prompt engineering — why token order and phrasing matter (attention patterns are sensitive to position and wording)
Cost — why longer prompts cost more (more tokens = more computation per layer, per head)
Hallucinations — the model is always just predicting the next plausible token, not retrieving facts from a database
Fine-tuning — what you’re actually changing (the weights in attention projections and FFN layers)

The transformer is a surprisingly simple architecture — attention, feed-forward, normalize, repeat. The magic isn’t in any single component. It’s in stacking these layers deep and training on enormous amounts of text. Understanding the building blocks gives you a mental model for everything built on top of them.

How Transformers Actually Work — A Visual Guide for Developers

The architecture at 30,000 feet

Step 1: Tokenization and embeddings

Step 2: Positional encoding — giving tokens a sense of place

Step 3: Self-attention — the core innovation

Step 4: Multi-head attention — looking at things from multiple angles

Step 5: Feed-forward layers — the thinking step

Step 6: Layer normalization and residual connections — keeping things stable

The full forward pass

Why this matters for developers

📬 AI Dev Weekly

You might also like

How Embeddings Work — The Math Behind Semantic Search, Explained Simply

How Tokenizers Work — Why 'strawberry' Has 3 Tokens

What Is Prompt Engineering? A Developer's Guide (2026)

How Git Merge vs Rebase Actually Works (With Visual Examples)