Enterprise teams are moving AI workloads in-house. The reasons: data sovereignty, cost control at scale, and regulatory compliance. Hereβs the architecture.
Why enterprises self-host
| Concern | Cloud API | Self-hosted |
|---|---|---|
| Data leaves your network | Yes | No |
| GDPR compliance | Requires DPA + SCCs | Automatic |
| Cost at scale | $10K+/month | Fixed hardware cost |
| Model control | Provider decides updates | You control versions |
| Uptime | Depends on provider | You control SLA |
| Customization | Limited | Full (fine-tuning, custom models) |
The architecture
βββββββββββββββββββββββββββ
β Load Balancer β
β (Nginx / Traefik) β
ββββββββββββ¬βββββββββββββββ
β
ββββββββββββββββββΌβββββββββββββββββ
β β β
βββββββββββΌβββββββ ββββββββΌββββββββ ββββββββΌββββββββ
β vLLM Instance β β vLLM Instance β β vLLM Instance β
β (GPU Node 1) β β (GPU Node 2) β β (GPU Node 3) β
β Qwen 3.5 27B β β Devstral 24B β β Codestral 22B β
ββββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β β β
ββββββββββββββββββΌβββββββββββββββββ
β
ββββββββββββΌβββββββββββββββ
β OpenAI-Compatible β
β API β
ββββββββββββ¬βββββββββββββββ
β
βββββββββββββββββββββββΌβββββββββββββββββββββββ
β β β
ββββββΌββββββ ββββββββΌβββββββ βββββββΌβββββββ
β Aider β β Continue.dev β β OpenCode β
β Terminal β β VS Code β β Terminal β
ββββββββββββ βββββββββββββββ ββββββββββββββ
Hardware sizing
| Team size | Hardware | Models | Monthly cost |
|---|---|---|---|
| 1-5 devs | Mac Mini M4 32GB | Qwen 3.5 27B + Codestral | ~$50 (amortized) |
| 5-15 devs | RTX 4090 workstation | Qwen 27B + Codestral | ~$105 (amortized) |
| 15-50 devs | 2x A100 server | Devstral 2 123B | ~$500 (cloud) |
| 50+ devs | GPU cluster | Multiple models | Custom |
See our GPU memory planning guide for exact VRAM calculations and our inference cost calculator for break-even analysis.
Model selection
| Use case | Model | Why |
|---|---|---|
| Coding agent | Devstral Small 24B | Best coding quality at 24B, 256K context |
| Autocomplete | Codestral 22B | Purpose-built for FIM, fastest |
| General chat | Qwen 3.5 27B | Best all-rounder at this size |
| Reasoning | DeepSeek R1 14B | Best reasoning at small size |
| Frontier quality | Mistral Large 2 123B | Single-node, EU-based company |
All available via Ollama (easy) or vLLM (production).
Inference engine
Use vLLM for multi-user production serving. It provides:
- OpenAI-compatible API (drop-in replacement)
- Continuous batching for high throughput
- Prefix caching for shared prompts
- Tensor parallelism for multi-GPU
For single-user or small teams, Ollama is simpler. See our inference engine comparison.
Security
Self-hosting eliminates third-party data transfer but introduces new responsibilities:
- Network isolation β AI servers should be on a separate VLAN
- Access control β API keys or OAuth for the inference endpoint
- Model integrity β verify checksums when downloading weights
- Logging β observability for audit trails
- Updates β youβre responsible for patching and model updates
See our AI security checklist.
Connecting developer tools
Once your inference server is running, connect your teamβs tools:
# Aider
aider --model openai/devstral-small --openai-api-base http://ai-server:8000/v1
# Continue.dev (VS Code)
# Set provider to "openai" with baseURL "http://ai-server:8000/v1"
# OpenCode
opencode --provider ollama --model devstral-small:24b
See our free AI coding server guide for step-by-step team setup.
The hybrid approach
Most enterprises donβt go 100% self-hosted. The practical approach:
- Self-host for routine coding (autocomplete, simple edits) β free after hardware
- Cloud API for complex tasks (architecture decisions, security reviews) β Claude Opus or GPT-5
- EU provider for sensitive data β Mistral API with EU data residency
This gives you cost control on 80% of usage while keeping access to frontier models for the hardest 20%.
Total cost of ownership
When comparing self-hosted vs cloud API, include all costs:
| Cost category | Cloud API | Self-hosted |
|---|---|---|
| Model access | Per-token pricing | Free (open weights) |
| Hardware | $0 | $50-500/mo (amortized) |
| Electricity | $0 | $20-50/mo for GPU server |
| Ops time | $0 | 2-4 hrs/week |
| Updates | Automatic | Manual |
| Scaling | Automatic | Manual |
For a team of 10 developers doing 50K requests/day:
- Cloud API (Claude Sonnet): ~$300/month
- Self-hosted (Devstral 24B on A100): ~$200/month (cloud GPU) or ~$100/month (own hardware)
The break-even point is typically around 20K requests/day. Below that, cloud APIs are simpler and cheaper. Above that, self-hosting starts to make financial sense.
Getting started
The fastest path to self-hosted AI:
- Install Ollama on any machine (5 minutes)
- Pull a model:
ollama pull devstral-small:24b - Connect Aider or Continue.dev: point to
http://localhost:11434 - Start coding with free, private AI
Graduate to vLLM when you need multi-user serving, and to dedicated GPU hardware when you need consistent performance.
Related: Self-Hosted AI for GDPR Β· How to Set Up a Free AI Coding Server Β· How to Serve LLMs with vLLM Β· Best AI Coding Agents for Privacy Β· Best Cloud GPU Providers