Some links in this article are affiliate links. We earn a commission at no extra cost to you when you purchase through them. Full disclosure.
Running AI models locally is great — until your MacBook fans sound like a jet engine and inference takes 30 seconds per response. The fix? Spin up a GPU instance in the cloud, install Ollama, and run any model you want at full speed.
In this tutorial, I’ll show you how to deploy Ollama on a Vultr GPU server in under 5 minutes. You’ll have a fully functional AI inference server that you can SSH into, hit via API, or use as a backend for your apps.
Why Vultr for GPU Inference?
Vultr offers bare-metal and cloud GPU instances with NVIDIA A100 and A40 GPUs. What makes them great for AI workloads:
- No long-term commitment — pay hourly, destroy when done
- Fast provisioning — servers ready in under 60 seconds
- Global locations — 32 data centers worldwide
- Simple pricing — no hidden egress fees for reasonable usage
If you’re comparing options, check out our best cloud GPU providers roundup.
Getting Started
First, you’ll need a Vultr account. New accounts get generous credits to test GPU instances without risk:
That’s enough to run an A100 instance for over 5 days straight — plenty of time to experiment.
Step 1: Create a GPU Instance
- Log into Vultr and click Deploy New Server
- Select Cloud GPU as the server type
- Choose your GPU:
| GPU | VRAM | Price/hr | Best For |
|---|---|---|---|
| NVIDIA A100 80GB | 80GB | $1.85/hr | Large models (70B+), multi-model serving |
| NVIDIA A40 48GB | 48GB | $1.10/hr | Mid-size models (13B-34B) |
| NVIDIA L40S 48GB | 48GB | $1.24/hr | Good balance of price/performance |
- Pick Ubuntu 22.04 as the OS
- Choose the closest data center to you
- Add your SSH key (or use password auth)
- Click Deploy Now
Your server will be ready in about 60 seconds. Copy the IP address.
Step 2: SSH Into Your Server
ssh root@YOUR_SERVER_IP
Verify the GPU is detected:
nvidia-smi
You should see your A100 (or whichever GPU you picked) with driver info and available VRAM.
Step 3: Install Ollama
One command:
curl -fsSL https://ollama.com/install.sh | sh
Ollama installs in about 10 seconds on a fresh server. It automatically detects NVIDIA GPUs and configures CUDA.
Verify it’s running:
ollama --version
Step 4: Pull a Model
Now pull whatever model you want to run. Here are popular choices based on VRAM:
# Small and fast (needs ~4GB VRAM)
ollama pull qwen3:8b
# Medium powerhouse (needs ~26GB VRAM)
ollama pull qwen3:32b
# Large flagship (needs ~40GB VRAM)
ollama pull llama3.1:70b-q4_K_M
For a full breakdown of model memory requirements, see how much VRAM do AI models need.
Pull time depends on the model size — expect 1-3 minutes for most models on Vultr’s fast network.
Step 5: Test Your Deployment
Run an interactive chat:
ollama run qwen3:8b
Or test the API directly:
curl http://localhost:11434/api/generate -d '{
"model": "qwen3:8b",
"prompt": "Explain Docker in 3 sentences",
"stream": false
}'
You should get a response in 1-2 seconds on GPU. That same model on CPU would take 15-20 seconds.
Step 6: Expose the API (Optional)
If you want to access Ollama from your local machine or other apps, configure it to listen on all interfaces:
sudo systemctl edit ollama
Add:
[Service]
Environment="OLLAMA_HOST=0.0.0.0"
Then restart:
sudo systemctl restart ollama
Now you can hit it from your local machine:
curl http://YOUR_SERVER_IP:11434/api/generate -d '{
"model": "qwen3:8b",
"prompt": "Hello from my laptop!",
"stream": false
}'
Security note: Add firewall rules to restrict access to your IP only. In the Vultr dashboard, go to Firewall and create a group that only allows port 11434 from your IP.
Cost Comparison: Running Different Models
Here’s what it actually costs to run models on Vultr GPU instances:
| Model | GPU Needed | Hourly Cost | Monthly (24/7) | Tokens/sec |
|---|---|---|---|---|
| Qwen 3 8B | A40 48GB | $1.10/hr | ~$792/mo | ~80 tok/s |
| Qwen 3 32B | A100 80GB | $1.85/hr | ~$1,332/mo | ~45 tok/s |
| Llama 3.1 70B (Q4) | A100 80GB | $1.85/hr | ~$1,332/mo | ~25 tok/s |
Pro tip: Don’t run 24/7 unless you need to. Spin up for development sessions and destroy when done. A typical 8-hour dev day on an A100 costs about $15.
If you want to run Qwen 3 locally on your own hardware instead, that’s free — just slower without a dedicated GPU.
Cleanup: Destroy When Done
When you’re finished, destroy the instance from the Vultr dashboard. You only pay for active time. No lingering charges.
This is the biggest advantage over buying hardware. Need an A100 for 2 hours? That’s $3.70 total. Try buying an A100 for that price.
What’s Next?
Once you have Ollama running on Vultr, you can:
- Use it as a backend for your apps via the OpenAI-compatible API
- Run multiple models simultaneously (if VRAM allows)
- Set up a reverse proxy with auth for team access
- Connect it to your local RAG pipeline
For a deep dive into everything Ollama can do, check the complete Ollama guide.
FAQ
How much does it cost to run Ollama on Vultr?
The cheapest GPU option is around $1.10/hr (A40). For occasional use, expect $10-30/month. If you only need small models, a high-RAM CPU instance ($0.10-0.30/hr) works too — just slower. The $250 credit covers extensive testing.
Can I run multiple models at once?
Yes, if you have enough VRAM. On an A100 80GB, you could run a 7B model (~4GB) and a 32B model (~20GB quantized) simultaneously. Ollama handles multi-model serving automatically.
Is Vultr faster than running on my local machine?
Almost certainly, unless you have a desktop RTX 4090 or better. An A100 delivers 2-5x the throughput of consumer GPUs for LLM inference, especially on larger models.
Should I use Vultr or just use an API like OpenAI?
If you need privacy, customization (fine-tuning, system prompts without restrictions), or predictable costs at high volume, self-hosting on Vultr wins. For low-volume, casual use, APIs are simpler. At around 1M+ tokens/day, self-hosting becomes cheaper.