📚 Learning Hub
· 6 min read

How Tokenizers Work — Why 'strawberry' Has 3 Tokens


You type “strawberry” into ChatGPT. One word. Ten letters. But the model doesn’t see one word — it sees three tokens:

"strawberry" → ['str', 'aw', 'berry']

Why? Because AI models don’t read text the way you do. They read tokens. And the tool that chops your text into those tokens — the tokenizer — quietly shapes everything from how much your API call costs to whether your prompt fits in the context window.

Let’s crack it open.

What Is a Token, Really?

A token is the smallest unit of text a language model works with. It’s not a word. It’s not a character. It’s somewhere in between — a chunk that the model learned to recognize during training.

Here’s how different inputs get tokenized (using GPT-4’s tokenizer):

"hello"          → ['hello']           — 1 token
"Hello!"         → ['Hello', '!']      — 2 tokens
"strawberry"     → ['str', 'aw', 'berry'] — 3 tokens
"unbelievable"   → ['un', 'believ', 'able'] — 3 tokens
"🎉"             → ['🎉']              — 1 token
"    "           → [' ', ' ', ' ', ' '] — 4 tokens (spaces aren't free!)

Notice the pattern: common words stay whole, uncommon words get split, and whitespace eats tokens too. The tokenizer is optimizing for the most efficient representation of the text it was trained on.

BPE: The Algorithm Behind the Magic

Most modern tokenizers use Byte Pair Encoding (BPE). The name sounds intimidating, but the idea is beautifully simple.

Imagine you’re starting with individual characters:

Step 0: ['s', 't', 'r', 'a', 'w', 'b', 'e', 'r', 'r', 'y']

BPE looks at your entire training corpus and asks: which two adjacent characters appear together most often? Let’s say it’s r + r. Merge them:

Step 1: ['s', 't', 'r', 'a', 'w', 'b', 'e', 'rr', 'y']

Now repeat. Maybe b + e is next most common:

Step 2: ['s', 't', 'r', 'a', 'w', 'be', 'rr', 'y']

Then be + rrberr, then berr + yberry, then st + rstr

After thousands of merges, you end up with a vocabulary — a fixed set of tokens the model knows. GPT-4 uses roughly 100,000 tokens in its vocabulary. Claude 3.5 uses about 100,000 as well.

The key insight: BPE builds tokens from the bottom up based on frequency. Common words like “the” become single tokens. Rare words like “defenestration” get split into pieces. This is why “strawberry” — despite being a normal English word — gets split. The subword chunks str, aw, and berry were each more statistically useful across the training data than keeping “strawberry” as one token.

Why Different Models Tokenize Differently

Here’s the same sentence through different tokenizers:

Input: "Let's tokenize this sentence!"

GPT-4 (cl100k):    ['Let', "'s", ' token', 'ize', ' this', ' sentence', '!']  — 7 tokens
Claude 3.5:         ['Let', "'s", ' tokenize', ' this', ' sentence', '!']      — 6 tokens
LLaMA 3:           ['Let', "'s", ' token', 'ize', ' this', ' sentence', '!']  — 7 tokens

Different tokenizers because:

  • Different training data — a tokenizer trained on more code will have tokens like func and => as single units
  • Different vocabulary sizes — bigger vocab = fewer tokens per text, but more memory per token
  • Different algorithms — Google’s SentencePiece treats the input as raw bytes (no pre-tokenization), while OpenAI’s tiktoken splits on regex patterns first

This matters more than you’d think. When a model’s context window is “128K tokens,” the actual amount of text that fits depends entirely on which tokenizer is counting.

Why Token Count Matters (Your Wallet Cares)

Tokens are the currency of the LLM world. They affect three things directly:

💰 Pricing — API providers charge per token. At GPT-4o’s rate of $2.50/1M input tokens, inefficient tokenization literally costs more money. The same English text might be 1,000 tokens in one model and 1,300 in another.

📏 Context limits — A 128K context window sounds huge, but if your tokenizer is verbose, you fit less actual text. Code tends to tokenize poorly (lots of special characters), which is why code-heavy prompts fill up faster than prose.

⚡ Speed — More tokens = more computation during inference. Each token passes through every layer of the model and gets stored in the KV cache. Fewer tokens means faster responses and less VRAM usage.

Want to check token counts before sending a prompt? Use a token counter tool — it’ll save you from context window surprises.

tiktoken: OpenAI’s Tokenizer

OpenAI open-sourced their tokenizer library, tiktoken, and it’s become the go-to tool for counting tokens:

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")
tokens = enc.encode("strawberry")
print(tokens)        # [496, 675, 15717]
print(len(tokens))   # 3

# Decode back to see the pieces
for t in tokens:
    print(enc.decode([t]))
# 'str'
# 'aw'
# 'berry'

It’s fast (written in Rust under the hood), and it’s the only reliable way to get exact token counts for OpenAI models. Different models use different encodings — GPT-4o uses o200k_base, while GPT-4 used cl100k_base.

Anthropic’s New Tokenizer in Claude Opus 4.7

Anthropic recently overhauled their tokenizer for Claude Opus 4.7, and the numbers are significant: the new tokenizer produces roughly 35% more tokens for the same input compared to Claude 3.5 Sonnet.

Why would they increase token count? A few reasons:

  • Finer granularity — smaller tokens give the model more precise control over generation, especially for code and multilingual text
  • Better multilingual support — previous tokenizers were heavily English-optimized; splitting into smaller units helps non-Latin scripts
  • Improved reasoning — more tokens per “thought” means more compute per concept, which can improve quality on complex tasks

The tradeoff is real though. That 200K context window in Opus 4.7 holds less raw text than the same window in Claude 3.5 Sonnet. A document that was 50,000 tokens before might now be 67,500 tokens. Plan accordingly.

The Weird Edge Cases

Tokenizers produce some genuinely surprising results:

" hello"  → [' hello']     — 1 token (leading space is part of the token!)
"hello "  → ['hello', ' '] — 2 tokens (trailing space is separate)
"hello\n" → ['hello', '\n'] — 2 tokens

"123456"  → ['123', '456']  — 2 tokens (numbers get chunked)
"$19.99"  → ['$', '19', '.', '99'] — 4 tokens

"café"    → ['caf', 'é']   — 2 tokens
"naïve"   → ['na', 'ï', 've'] — 3 tokens (accented chars split)

A few practical takeaways:

  • Leading spaces matter. " Hello" and "Hello" tokenize differently. This is why prompt formatting matters.
  • Numbers are expensive. Each digit cluster becomes its own token. Long numbers or IDs eat through your context fast.
  • Non-English text uses more tokens. Chinese, Japanese, Arabic — they all tokenize less efficiently than English in most current models. A sentence that’s 20 tokens in English might be 40+ tokens in Japanese.

What This Means for You

If you’re building with LLMs, tokenization isn’t just trivia — it’s operational:

  1. Count tokens before sending — don’t guess. Use tiktoken or your provider’s tokenizer library.
  2. Budget for overhead — system prompts, conversation history, and formatting all consume tokens. Leave headroom.
  3. Watch for model switches — changing from GPT-4 to Claude (or vice versa) changes your token economics. The same prompt costs differently.
  4. Compress when possible — shorter variable names in code, concise instructions, and trimmed context all reduce token count.
  5. Know your tokenizer — when Anthropic ships a new tokenizer with Opus 4.7, your existing prompts might suddenly be 35% more expensive in token terms.

The tokenizer is the first thing that touches your text and the last thing you usually think about. Now you know what it’s doing — and why “strawberry” will never be one token.