Every major language model youβve heard of β GPT, Claude, Gemini, Llama, Mistral β runs on the same fundamental architecture: the transformer. It was introduced in a 2017 paper called βAttention Is All You Need,β and it genuinely changed everything. Before transformers, sequence models like RNNs processed words one at a time, like reading a sentence through a keyhole. Transformers read the whole page at once.
If you build on top of LLMs, understanding whatβs happening inside them isnβt optional curiosity β itβs the difference between debugging effectively and guessing. This post walks through the transformer architecture piece by piece, using analogies instead of equations.
The architecture at 30,000 feet
The original transformer has two halves: an encoder (reads input) and a decoder (produces output). Think of it like a translator: the encoder understands the source language, the decoder writes in the target language.
βββββββββββββββ βββββββββββββββ
β ENCODER β ββββΊ β DECODER β
β (understand)β β (generate) β
βββββββββββββββ βββββββββββββββ
But hereβs the thing β most modern LLMs (GPT, Claude, Llama) are decoder-only. They donβt have a separate encoder. They just predict the next token, over and over, using only the decoder stack. The encoder-decoder design still shows up in models built for translation or summarization (like T5), but for general-purpose text generation, decoder-only won.
Why? Because βpredict the next wordβ turns out to be a shockingly powerful training objective. You donβt need labeled data. You just need text β and the internet has plenty of that.
Step 1: Tokenization and embeddings
Before a transformer sees your prompt, it gets broken into tokens β chunks that might be words, subwords, or individual characters. βunhappinessβ might become ["un", "happiness"]. (For the full story, see how tokenizers work.)
Each token is then converted into an embedding β a dense vector of numbers (typically 768 to 12,288 dimensions depending on model size) that captures meaning. βKingβ and βqueenβ end up as nearby vectors. βKingβ and βtoasterβ donβt. These embeddings are learned during training, not hand-crafted. (More on this in how embeddings work.)
"The cat sat" β ["The", "cat", "sat"]
β
[ [0.12, -0.34, ...], β embedding for "The"
[0.87, 0.21, ...], β embedding for "cat"
[-0.05, 0.63, ...] ] β embedding for "sat"
At this point, the model has a list of vectors β one per token. But thereβs a problem: these vectors donβt know what order theyβre in.
Step 2: Positional encoding β giving tokens a sense of place
Transformers process all tokens in parallel, not sequentially. Thatβs what makes them fast, but it also means the model has no built-in notion of word order. βThe cat sat on the matβ and βThe mat sat on the catβ would look identical.
Positional encodings fix this. Theyβre vectors added to each token embedding that encode position information. Think of it like seat numbers at a concert β the music (embedding) tells you what each instrument sounds like, and the seat number (position) tells you where itβs sitting in the orchestra.
The original paper used fixed sine/cosine patterns. Many modern models use learned positional encodings or rotary positional embeddings (RoPE), which encode relative distances between tokens rather than absolute positions. The details vary, but the purpose is the same: let the model know that token 3 comes after token 2.
After this step, each tokenβs vector carries two pieces of information: what it means and where it sits.
Step 3: Self-attention β the core innovation
This is the big one. Self-attention is the mechanism that lets every token look at every other token and decide: βHow relevant are you to me?β
Hereβs the analogy. Imagine youβre at a crowded party and someone says the word βbank.β To understand what they mean, you look around the conversation for context clues. If someone nearby said βriver,β you think riverbank. If someone said βmoney,β you think financial bank. Self-attention is the model doing exactly this β for every token, simultaneously.
Mechanically, each token produces three vectors from its embedding:
- Query (Q): βWhat am I looking for?β
- Key (K): βWhat do I contain?β
- Value (V): βWhat information do I carry?β
Every tokenβs Query is compared against every other tokenβs Key to produce an attention score β a measure of relevance. Those scores are used to create a weighted blend of all the Value vectors. The result is a new representation of each token thatβs been enriched by context from the entire sequence.
Token: "bank"
Q("bank") asks: "Who's relevant to me?"
β
Compares against K("river"), K("the"), K("near"), K("money")...
β
Scores: river=0.7, the=0.02, near=0.1, money=0.18
β
Weighted sum of V("river"), V("the"), V("near"), V("money")
β
New enriched representation of "bank" (leaning toward riverbank)
The critical insight: after self-attention, the vector for βbankβ is no longer just about the word βbankβ in isolation. Itβs been reshaped by its context. This is how transformers build understanding β not from dictionary definitions, but from relationships between tokens.
In decoder-only models, thereβs one extra rule: causal masking. Each token can only attend to tokens that came before it (and itself), never to future tokens. This prevents the model from βcheatingβ during generation by peeking at the answer. Itβs like writing an essay where you can re-read what youβve already written, but you canβt skip ahead.
Step 4: Multi-head attention β looking at things from multiple angles
A single attention pass captures one type of relationship. But language is layered. In the sentence βThe animal didnβt cross the street because it was too tired,β the word βitβ relates to βanimalβ syntactically, to βtiredβ semantically, and to βstreetβ spatially.
Multi-head attention runs several attention operations in parallel, each with its own learned Q, K, V projections. Each βheadβ can specialize in a different type of relationship β one might track grammar, another coreference, another proximity.
ββββββββββββββββββββββββββββββββββββ
β Multi-Head Attention β
β β
β Head 1: grammar relationships β
β Head 2: coreference ("it"β"animal") β
β Head 3: proximity/position β
β Head 4: semantic similarity β
β ... β
β Head N: (learned, not labeled) β
β β
β All heads β concatenate β projectβ
ββββββββββββββββββββββββββββββββββββ
The outputs of all heads are concatenated and projected back down to the modelβs hidden dimension. The model learns what each head should focus on during training β nobody hand-assigns βyou do grammar.β
Typical models use 12 to 128 attention heads depending on size.
Step 5: Feed-forward layers β the thinking step
After attention gathers context, each token passes through a feed-forward network (FFN) β a small, independent neural network applied to each token position separately. If attention is βgathering information from the room,β the FFN is βsitting down to think about what you heard.β
These layers are where a lot of the modelβs factual knowledge is believed to be stored. They typically expand the hidden dimension by 4x, apply a nonlinear activation, then project back down. Itβs the same operation for every token, but with shared learned weights.
Attention output (per token)
β
Expand: 4096 β 16384
β
Activation (GeLU / SwiGLU)
β
Contract: 16384 β 4096
β
Feed-forward output
Step 6: Layer normalization and residual connections β keeping things stable
Deep networks have a problem: as signals pass through dozens of layers, values can explode or vanish. Two mechanisms prevent this:
Residual connections add the input of each sub-layer back to its output. Instead of replacing the signal, each layer refines it. Think of it like editing a document β you donβt rewrite from scratch each time, you make incremental improvements to the existing draft.
Layer normalization rescales the values at each layer to keep them in a stable range. Itβs the thermostat of the network β without it, training becomes chaotic.
Input βββββββββββββββββββββββ
β β (residual / skip connection)
[ Attention or FFN ] β
β β
+ βββββββββββββββββββββββββββ
β
Layer Norm
β
Output
Modern models sometimes apply normalization before the sub-layer (pre-norm) rather than after (post-norm). The effect is similar β stability during training.
The full forward pass
Letβs put it all together. Hereβs what happens when a decoder-only transformer processes a sequence:
Input tokens
β
βββββββββββββββββββββββββββ
β Token Embeddings β
β + Positional Encoding β
ββββββββββββββ¬βββββββββββββ
β
βββββββββββββββββββββββββββ ββ
β Multi-Head Attention β β
β (causal mask applied) β β
β + Residual + LayerNorm β β
βββββββββββββββββββββββββββ€ βββ Γ N layers (e.g., 32, 80, 128)
β Feed-Forward Network β β
β + Residual + LayerNorm β β
ββββββββββββββ¬βββββββββββββ ββ
β
βββββββββββββββββββββββββββ
β Final LayerNorm β
β β Linear projection β
β β Softmax over vocab β
ββββββββββββββ¬βββββββββββββ
β
Next token probabilities
The model picks the most likely next token (or samples from the distribution), appends it to the sequence, and runs the whole thing again. Thatβs autoregressive generation β one token at a time, each time re-attending to everything before it.
This is also where the KV cache becomes essential. Without it, the model would recompute attention for all previous tokens at every step. The KV cache stores the Key and Value vectors from prior steps so only the new token needs full computation. For more on how this plays out at scale, see LLM inference explained and continuous batching.
Why this matters for developers
You donβt need to implement a transformer from scratch to build with LLMs. But knowing the architecture helps you reason about:
- Context windows β why thereβs a token limit (attention is quadratic in sequence length)
- Prompt engineering β why token order and phrasing matter (attention patterns are sensitive to position and wording)
- Cost β why longer prompts cost more (more tokens = more computation per layer, per head)
- Hallucinations β the model is always just predicting the next plausible token, not retrieving facts from a database
- Fine-tuning β what youβre actually changing (the weights in attention projections and FFN layers)
The transformer is a surprisingly simple architecture β attention, feed-forward, normalize, repeat. The magic isnβt in any single component. Itβs in stacking these layers deep and training on enormous amounts of text. Understanding the building blocks gives you a mental model for everything built on top of them.