Jun 10, 2026 · 5 min read

How Embeddings Work — The Math Behind Semantic Search, Explained Simply

Every time you search for something and the results just get it — even though you didn’t use the exact right words — there’s a good chance embeddings are doing the heavy lifting behind the scenes. They’re one of the most important building blocks in modern AI, and they’re simpler than you think.

Text In, Numbers Out

Computers don’t understand words. They understand numbers. An embedding is a way to convert a piece of text — a word, a sentence, an entire paragraph — into a list of numbers called a vector.

Think of it like GPS coordinates for meaning. Just as latitude and longitude pin a physical location on a map, an embedding pins a piece of text to a location in “meaning space.” The key insight: texts with similar meanings end up close together.

The sentence “How do I reset my password?” and “I forgot my login credentials” would land near each other in this space, even though they share almost no words. Meanwhile, “The cat sat on the mat” would be far away from both.

A typical embedding might look like this:

"How do I reset my password?"
→ [0.021, -0.187, 0.442, 0.033, ..., -0.091]  // 768 or 1536 numbers

Those hundreds of numbers aren’t random. Each dimension captures some aspect of meaning — topic, tone, intent, specificity — learned from training on massive amounts of text. The model figures out what matters on its own. You just get the coordinates.

Why “King − Man + Woman = Queen”

The most famous embedding demo goes like this: take the vector for “king,” subtract “man,” add “woman,” and the result lands closest to “queen.” This works because the model has learned that the relationship between king and queen mirrors the relationship between man and woman — they differ along a “gender” direction in the vector space while sharing a “royalty” direction.

This isn’t a party trick. It reveals that embeddings capture relationships between concepts, not just individual meanings. That’s what makes them so powerful for search, recommendations, and reasoning.

Measuring Closeness: Cosine Similarity

If embeddings are coordinates, we need a way to measure how close two points are. The standard tool is cosine similarity — it measures the angle between two vectors rather than the raw distance.

Why the angle? Because it focuses on direction (meaning) rather than magnitude (length). Two vectors pointing the same way get a score near 1.0 (very similar). Perpendicular vectors score 0 (unrelated). Opposite vectors score -1.

cosine_similarity("happy", "joyful")  → 0.92
cosine_similarity("happy", "laptop")  → 0.11
cosine_similarity("happy", "sad")     → 0.38  // related, but different

Notice “happy” and “sad” aren’t at -1. They’re related concepts (both emotions), just opposite in sentiment. The math captures that nuance.

When you hear about vector databases, this is exactly what they’re optimized for — storing millions of embeddings and finding the closest ones fast using cosine similarity (or similar metrics like dot product).

Where Embeddings Get Used

Once you can turn text into meaning-aware coordinates and measure closeness, a whole world of applications opens up:

Semantic search — Find results by meaning, not keywords. A user searches “eco-friendly packaging” and finds documents about “sustainable wrapping materials.”
RAG (Retrieval-Augmented Generation) — Embed your documents, embed the user’s question, retrieve the closest chunks, and feed them to an LLM as context. This is how you build AI that answers questions about your data. See our local RAG pipeline guide for a hands-on walkthrough.
Clustering — Group similar support tickets, articles, or feedback automatically. No labels needed — just embed everything and let the geometry do the grouping.
Recommendations — “Users who liked this article also liked…” becomes “find articles whose embeddings are closest to what this user has engaged with.”
Deduplication — Detect near-duplicate content even when the wording differs.

If you’re building any of these, understanding how tokenizers break text into pieces before embedding helps you reason about chunk sizes and input limits.

Popular Embedding Models

Not all embedding models are equal. They differ in quality, speed, dimensions, and cost:

Model	Provider	Dimensions	Notes
`text-embedding-3-small`	OpenAI	1536	Great balance of quality and cost. API-based.
`text-embedding-3-large`	OpenAI	3072	Higher quality, higher cost. Supports dimension reduction.
`nomic-embed-text`	Nomic AI	768	Open-weight, runs locally via Ollama. Strong performer for its size.
`embed-v4.0`	Cohere	1024	Excellent multilingual support. API-based.
`mxbai-embed-large`	mixedbread.ai	1024	Open-weight, solid benchmark scores.

The open-weight models are particularly interesting — you can run them on your own hardware with zero API costs and full data privacy. If you’ve got Ollama set up, you’re minutes away from generating embeddings locally.

Generating Embeddings in Practice

With the OpenAI API:

from openai import OpenAI
client = OpenAI()

response = client.embeddings.create(
    model="text-embedding-3-small",
    input="How do embeddings work?"
)

vector = response.data[0].embedding  # list of 1536 floats

With Ollama (local, free):

import ollama

response = ollama.embed(
    model="nomic-embed-text",
    input="How do embeddings work?"
)

vector = response["embeddings"][0]  # list of 768 floats

Both give you a list of floats. What you do next — store them in a vector database, compute similarities, build a search index — is where the real application begins.

For a deeper look at what happens when an LLM processes those tokens and generates output, check out how inference actually works.

Things That Trip People Up

Embeddings aren’t the same as tokens. Tokenizers chop text into pieces. Embeddings convert those pieces (or whole passages) into a single vector that represents the overall meaning.

Dimension count ≠ quality. A 768-dimension model can outperform a 1536-dimension one depending on training data and architecture. Benchmark scores matter more than raw numbers.

You can’t mix models. An embedding from nomic-embed-text lives in a completely different vector space than one from text-embedding-3-small. You must use the same model for both indexing and querying.

Chunk size matters. Embedding an entire book into one vector loses detail. Embedding single sentences loses context. Most RAG systems chunk documents into 200–500 token passages as a sweet spot — our RAG pipeline guide covers chunking strategies in detail.

The Big Picture

Embeddings are the bridge between human language and machine math. They let you ask “what’s similar to this?” in a way that respects meaning, not just spelling. Every semantic search engine, every RAG pipeline, every recommendation system that feels like it “gets” you — embeddings are doing the quiet work underneath.

The best part: you don’t need a PhD to use them. Pick a model, call an API (or run one locally), and start building. The math is elegant, but the API is just a function call.

How Embeddings Work — The Math Behind Semantic Search, Explained Simply

Text In, Numbers Out

Why “King − Man + Woman = Queen”

Measuring Closeness: Cosine Similarity

Where Embeddings Get Used

Popular Embedding Models

Generating Embeddings in Practice

Things That Trip People Up

The Big Picture

📬 AI Dev Weekly

You might also like

What Is an Embedding? Explained for Developers (2026)

What Is an AI Agent? A Simple Explanation for Developers (2026)

How Tokenizers Work — Why 'strawberry' Has 3 Tokens

What Is Edge Computing? Why Your API Might Move Closer to Users (2026)