AI terminology explained in plain English, for developers who build things.
Models & Architecture
LLM (Large Language Model) — An AI model trained on text that can generate text. Examples: Claude, GPT-5, Gemma 4.
MoE (Mixture of Experts) — Architecture where only a subset of parameters activate per token. GLM-5.1 has 754B total but only 40B active. Makes large models efficient.
Dense model — All parameters activate for every token. Mistral Large 2 (123B) is dense. Simpler but needs more compute per token.
Parameters — The learned values in a model. More parameters generally = more capable. Measured in billions (B).
Open-weight / Open-source — Model weights are publicly available. You can download and run them locally. Examples: Qwen 3.5, Kimi K2.5, DeepSeek.
Frontier model — The most capable models available. Currently: Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro.
Tokens & Context
Token — The unit models process. Roughly 1 token ≈ 0.75 words or ~4 characters. “Hello world” = 2 tokens.
Context window — Maximum tokens a model can process in one request. Ranges from 4K to 1M+ tokens. Larger = can read more code at once.
Input tokens — What you send to the model (your prompt, code, context).
Output tokens — What the model generates (the response). Usually more expensive than input.
Techniques
RAG (Retrieval-Augmented Generation) — Fetching relevant documents and including them in the prompt so the model can answer questions about your data.
Fine-tuning — Retraining a model on your specific data to change its behavior or knowledge. More expensive than RAG.
Embeddings — Converting text into number arrays that capture meaning. Used for semantic search and RAG.
Prompt engineering — Writing instructions that get the best output from AI models.
Prompt caching — Reusing the beginning of prompts across requests to reduce costs by up to 90%.
FIM (Fill-in-the-Middle) — Model predicts code between a prefix and suffix. How Codestral does autocomplete.
Chain-of-thought — Asking the model to reason step by step before answering. Improves accuracy on complex tasks.
Infrastructure
Vector database — Database that stores embeddings and finds similar ones. Examples: Pinecone, Qdrant, Weaviate, Chroma.
Inference — Running a model to generate output. “Inference cost” = the cost of generating responses.
Quantization — Reducing model precision (e.g., 16-bit → 4-bit) to use less memory. Trades small quality loss for 4x less VRAM. See our local AI guides.
VRAM — GPU memory. Determines what models you can run locally. See how much VRAM you need.
Ollama — Tool for running AI models locally with one command.
Tools & Workflows
AI coding agent — Tool that autonomously writes, edits, and debugs code. Examples: Claude Code, Aider, Kimi CLI.
Agentic coding — AI that plans, executes, tests, and iterates autonomously. Beyond simple code generation.
Agent Swarm — Multiple AI agents working in parallel on the same task. Kimi K2.5’s signature feature.
Model routing — Sending simple tasks to cheap models and complex tasks to expensive ones. Saves 40-60% on API costs.
OpenRouter — Unified API gateway for 300+ AI models through one endpoint.
Compliance
GDPR — EU data protection regulation. Affects how you use AI APIs with personal data.
EU AI Act — EU regulation classifying AI systems by risk level. Full enforcement August 2026.
DPA (Data Processing Agreement) — Contract required under GDPR when a third party processes personal data on your behalf.
Related: Best AI Coding Tools 2026 · How to Choose an AI Coding Agent · Best Free AI APIs 2026