Some links in this article are affiliate links. We earn a commission at no extra cost to you when you purchase through them. Full disclosure.
Every team building AI agents faces the same question: run them in the cloud (OpenAI, Anthropic, Google APIs) or self-host with open-source models on your own infrastructure?
The answer depends on your constraints: budget, privacy requirements, latency needs, and team size. Here are the real numbers.
Cost comparison
| Setup | Monthly cost | Quality | Latency |
|---|---|---|---|
| Cloud API (GPT-4o) | $50-500 (usage-based) | Frontier | 1-5s |
| Cloud API (GPT-4o-mini) | $5-50 (usage-based) | Good | 0.5-2s |
| Self-hosted (Qwen3 8B on VPS) | $20 fixed | Good for most tasks | 2-8s |
| Self-hosted (DeepSeek R1 14B on GPU) | $50-100 fixed | Strong reasoning | 3-10s |
| Local (Ollama on your machine) | $0 (electricity) | Varies by model | 1-15s |
| Hybrid (local + cloud fallback) | $10-50 | Best of both | 1-5s |
The crossover point: if youβre spending more than $100/month on API calls, self-hosting starts making financial sense. Below that, cloud APIs are cheaper when you factor in infrastructure management time.
Privacy comparison
| Concern | Cloud API | Self-hosted |
|---|---|---|
| Data leaves your network | β Yes | β No |
| Provider can read your data | Depends on ToS | β No |
| GDPR compliant | With DPA | β By default |
| HIPAA compliant | Some providers | β You control it |
| Data retention | Providerβs policy | Your policy |
| Audit trail | Providerβs logs | Your logs |
If you handle sensitive data (healthcare, legal, financial), self-hosting is often the only option that satisfies compliance. See our GDPR compliance guide and self-hosted AI guide.
The hybrid approach
The practical answer for most teams: use both.
async def route_request(message, sensitivity):
if sensitivity == "high":
# Sensitive data stays local
return await run_local_agent(message, model="qwen3-8b")
elif sensitivity == "complex":
# Complex reasoning goes to frontier model
return await run_cloud_agent(message, model="claude-sonnet-4")
else:
# Routine tasks use cheap cloud API
return await run_cloud_agent(message, model="gpt-4o-mini")
This gives you:
- Privacy for sensitive data (local)
- Frontier quality for hard problems (cloud)
- Low cost for routine tasks (cheap cloud)
Self-hosting options
| Platform | GPU | RAM | Models it can run | Monthly cost |
|---|---|---|---|---|
| Vultr VPS | β | 8-32 GB | Qwen3 8B, Phi-4 | $20-80 |
| RunPod GPU | β A40/A100 | 48-80 GB | Any model | $50-300 |
| Contabo VPS | β | 8-60 GB | Qwen3 8B-27B, DeepSeek 14B | $5-40 |
| Hetzner dedicated | β | 64 GB | Up to 30B quantized | $40-80 |
| Your Mac (M-series) | β Unified | 16-192 GB | Up to 70B | $0 |
For getting started with self-hosted models, see our Ollama guide and VRAM requirements guide.
Decision framework
Choose cloud APIs when:
- You need frontier model quality (GPT-4o, Claude Sonnet)
- Your usage is under $100/month
- You donβt handle sensitive data
- You want zero infrastructure management
Choose self-hosted when:
- Privacy/compliance is non-negotiable
- You have predictable, high-volume usage
- You need full control over the model and infrastructure
- You have DevOps capacity to maintain it
Choose hybrid when:
- You have mixed sensitivity levels
- You want cost optimization
- You need both frontier quality and privacy
Related: Best Cloud GPU Providers Β· Self-Hosted AI for Enterprise Β· Ollama Complete Guide Β· AI GDPR Guide Β· How Much VRAM for AI Models Β· AI Agent Cost Management Β· Deploy AI Agents to Production