πŸ€– AI Tools
Β· 4 min read

Self-Hosted AI for Enterprise β€” Complete Architecture Guide (2026)


Enterprise teams are moving AI workloads in-house. The reasons: data sovereignty, cost control at scale, and regulatory compliance. Here’s the architecture.

Why enterprises self-host

ConcernCloud APISelf-hosted
Data leaves your networkYesNo
GDPR complianceRequires DPA + SCCsAutomatic
Cost at scale$10K+/monthFixed hardware cost
Model controlProvider decides updatesYou control versions
UptimeDepends on providerYou control SLA
CustomizationLimitedFull (fine-tuning, custom models)

The architecture

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚     Load Balancer        β”‚
                    β”‚    (Nginx / Traefik)     β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚                β”‚                β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”
    β”‚  vLLM Instance  β”‚ β”‚ vLLM Instance β”‚ β”‚ vLLM Instance β”‚
    β”‚  (GPU Node 1)   β”‚ β”‚ (GPU Node 2)  β”‚ β”‚ (GPU Node 3)  β”‚
    β”‚  Qwen 3.5 27B   β”‚ β”‚ Devstral 24B  β”‚ β”‚ Codestral 22B β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚                β”‚                β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚    OpenAI-Compatible     β”‚
                    β”‚         API              β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚                     β”‚                      β”‚
    β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”
    β”‚  Aider   β”‚        β”‚ Continue.dev β”‚        β”‚  OpenCode  β”‚
    β”‚ Terminal β”‚        β”‚   VS Code    β”‚        β”‚  Terminal  β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Hardware sizing

Team sizeHardwareModelsMonthly cost
1-5 devsMac Mini M4 32GBQwen 3.5 27B + Codestral~$50 (amortized)
5-15 devsRTX 4090 workstationQwen 27B + Codestral~$105 (amortized)
15-50 devs2x A100 serverDevstral 2 123B~$500 (cloud)
50+ devsGPU clusterMultiple modelsCustom

See our GPU memory planning guide for exact VRAM calculations and our inference cost calculator for break-even analysis.

Model selection

Use caseModelWhy
Coding agentDevstral Small 24BBest coding quality at 24B, 256K context
AutocompleteCodestral 22BPurpose-built for FIM, fastest
General chatQwen 3.5 27BBest all-rounder at this size
ReasoningDeepSeek R1 14BBest reasoning at small size
Frontier qualityMistral Large 2 123BSingle-node, EU-based company

All available via Ollama (easy) or vLLM (production).

Inference engine

Use vLLM for multi-user production serving. It provides:

For single-user or small teams, Ollama is simpler. See our inference engine comparison.

Security

Self-hosting eliminates third-party data transfer but introduces new responsibilities:

  • Network isolation β€” AI servers should be on a separate VLAN
  • Access control β€” API keys or OAuth for the inference endpoint
  • Model integrity β€” verify checksums when downloading weights
  • Logging β€” observability for audit trails
  • Updates β€” you’re responsible for patching and model updates

See our AI security checklist.

Connecting developer tools

Once your inference server is running, connect your team’s tools:

# Aider
aider --model openai/devstral-small --openai-api-base http://ai-server:8000/v1

# Continue.dev (VS Code)
# Set provider to "openai" with baseURL "http://ai-server:8000/v1"

# OpenCode
opencode --provider ollama --model devstral-small:24b

See our free AI coding server guide for step-by-step team setup.

The hybrid approach

Most enterprises don’t go 100% self-hosted. The practical approach:

  • Self-host for routine coding (autocomplete, simple edits) β€” free after hardware
  • Cloud API for complex tasks (architecture decisions, security reviews) β€” Claude Opus or GPT-5
  • EU provider for sensitive data β€” Mistral API with EU data residency

This gives you cost control on 80% of usage while keeping access to frontier models for the hardest 20%.

Total cost of ownership

When comparing self-hosted vs cloud API, include all costs:

Cost categoryCloud APISelf-hosted
Model accessPer-token pricingFree (open weights)
Hardware$0$50-500/mo (amortized)
Electricity$0$20-50/mo for GPU server
Ops time$02-4 hrs/week
UpdatesAutomaticManual
ScalingAutomaticManual

For a team of 10 developers doing 50K requests/day:

  • Cloud API (Claude Sonnet): ~$300/month
  • Self-hosted (Devstral 24B on A100): ~$200/month (cloud GPU) or ~$100/month (own hardware)

The break-even point is typically around 20K requests/day. Below that, cloud APIs are simpler and cheaper. Above that, self-hosting starts to make financial sense.

Getting started

The fastest path to self-hosted AI:

  1. Install Ollama on any machine (5 minutes)
  2. Pull a model: ollama pull devstral-small:24b
  3. Connect Aider or Continue.dev: point to http://localhost:11434
  4. Start coding with free, private AI

Graduate to vLLM when you need multi-user serving, and to dedicated GPU hardware when you need consistent performance.

Related: Self-Hosted AI for GDPR Β· How to Set Up a Free AI Coding Server Β· How to Serve LLMs with vLLM Β· Best AI Coding Agents for Privacy Β· Best Cloud GPU Providers