Apertus is one of the few fully open foundation models you can actually run end-to-end on your own hardware without signing any agreements, requesting access, or depending on an external API. Every model size is freely downloadable from HuggingFace under Apache 2.0, and the Transformers library supports it natively since v4.56.0.
This guide covers every practical way to get Apertus running locally, from a quick test on your laptop to a production-ready vLLM deployment. I’ll be specific about hardware requirements because nothing’s more frustrating than downloading a 140GB model only to find it won’t fit in memory.
Hardware requirements by model size
Let’s start with what you actually need. These are minimum requirements for inference (not training):
| Model | BF16 VRAM | INT4 Quantized | Minimum hardware |
|---|---|---|---|
| 0.5B | ~1 GB | ~0.5 GB | Any modern laptop CPU |
| 1.5B | ~3 GB | ~1.5 GB | Laptop with 8GB RAM |
| 4B | ~8 GB | ~3 GB | RTX 3060 12GB or M1 Mac |
| 8B | ~16 GB | ~6 GB | RTX 4090 24GB or M2 Pro |
| 70B | ~140 GB | ~40 GB | 2-4x A100 80GB or 8x RTX 4090 |
The 0.5B model genuinely runs on a CPU. It’s slow, but it works. The 4B is the sweet spot for most developers because it fits on consumer GPUs while still being useful. The 70B is for serious deployments with proper hardware.
Method 1: HuggingFace Transformers (simplest)
This is the fastest way to get Apertus running. You need Python 3.10+ and a recent Transformers version.
Install dependencies
pip install -U transformers torch accelerate
Run the 4B Instruct model
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "swiss-ai/Apertus-v1.1-4B-Instruct"
device = "cuda" # use "cpu" if no GPU, or "mps" for Apple Silicon
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
messages = [
{"role": "user", "content": "Explain how the EU AI Act affects open-source models."}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer([text], return_tensors="pt", add_special_tokens=False).to(device)
output = model.generate(**inputs, max_new_tokens=1024, temperature=0.8, top_p=0.9)
response = tokenizer.decode(output[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
print(response)
The Swiss AI team recommends temperature=0.8 and top_p=0.9 for the best output quality.
Run the 8B model with automatic device mapping
For the 8B model, you might want to split across CPU and GPU if you don’t have 16GB of VRAM:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "swiss-ai/Apertus-8B-Instruct-2509"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto", # automatically splits across available devices
torch_dtype="auto",
)
messages = [
{"role": "user", "content": "Write a summary of Swiss data protection law in German."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt", add_special_tokens=False).to(model.device)
output = model.generate(**inputs, max_new_tokens=2048, temperature=0.8, top_p=0.9)
print(tokenizer.decode(output[0][len(inputs.input_ids[0]):], skip_special_tokens=True))
Setting device_map="auto" lets Transformers figure out how to distribute the model across your GPU, CPU, and even disk if necessary. It’s slower than pure GPU inference, but it works.
Method 2: Quantized models for limited hardware
If you don’t have a beefy GPU, quantization is your friend. Apertus provides official quantized checkpoints, which is a big deal. These aren’t community GPTQ/GGUF quantizations of unknown quality. They’re quantization-aware distilled checkpoints created by the original team.
Available quantization formats
- FP8 and NVFP4A16: Optimized for vLLM inference on NVIDIA GPUs
- INT3, INT4, INT6: Optimized for Apple Silicon via MLX
Running the INT4 quantized 4B on Apple Silicon
If you’re on a Mac with Apple Silicon, you can use the MLX checkpoints:
pip install mlx-lm
from mlx_lm import load, generate
model, tokenizer = load("swiss-ai/Apertus-v1.1-4B-Instruct-MLX-INT4")
messages = [{"role": "user", "content": "What is sovereign AI?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
response = generate(model, tokenizer, prompt=prompt, max_tokens=512)
print(response)
The INT4 version of the 4B model uses around 3GB of memory. That’s comfortable on any M1/M2/M3 Mac.
Running quantized models with bitsandbytes
For NVIDIA GPUs with limited VRAM, you can also load with 4-bit quantization on the fly:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype="float16",
bnb_4bit_quant_type="nf4",
)
model_name = "swiss-ai/Apertus-8B-Instruct-2509"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quantization_config,
device_map="auto",
)
This lets you fit the 8B model into around 6GB of VRAM. Quality drops slightly compared to full precision, but it’s very usable for testing and development.
Method 3: vLLM for production serving
If you’re deploying Apertus as a service (internal API, RAG backend, or production application), vLLM is the way to go. It gives you an OpenAI-compatible API with continuous batching and optimized inference.
Basic vLLM setup
pip install vllm
vllm serve "swiss-ai/Apertus-v1.1-4B-Instruct"
That’s it. You now have an OpenAI-compatible API running at http://localhost:8000. Any code that works with the OpenAI SDK works with this.
Call it with curl
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "swiss-ai/Apertus-v1.1-4B-Instruct",
"messages": [
{"role": "user", "content": "Summarize the GDPR in three paragraphs."}
],
"temperature": 0.8,
"top_p": 0.9,
"max_tokens": 1024
}'
Call it with the OpenAI Python SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
model="swiss-ai/Apertus-v1.1-4B-Instruct",
messages=[
{"role": "user", "content": "Explain DORA regulation for banks."}
],
temperature=0.8,
top_p=0.9,
)
print(response.choices[0].message.content)
Serving the 70B model with tensor parallelism
For the 70B model, you’ll need multiple GPUs. vLLM handles this with tensor parallelism:
vllm serve "swiss-ai/Apertus-70B-Instruct-2509" \
--tensor-parallel-size 4 \
--max-model-len 4096
This splits the model across 4 GPUs. With 4x A100 80GB, you can serve the full BF16 model. With 4x RTX 4090 (24GB each = 96GB total), you’d want the FP8 quantized version:
vllm serve "swiss-ai/Apertus-70B-Instruct-2509-vLLM-FP8" \
--tensor-parallel-size 4 \
--max-model-len 4096
Using the official NVFP4 checkpoint
For maximum efficiency on NVIDIA hardware, the NVFP4A16 checkpoint is optimized specifically for vLLM:
vllm serve "swiss-ai/Apertus-v1.1-4B-Instruct-vLLM-NVFP4A16"
This gives you roughly 2x the throughput compared to BF16 with minimal quality loss.
Method 4: Docker
If you prefer containerized deployment:
docker model run hf.co/swiss-ai/Apertus-v1.1-4B-Instruct
This uses Docker’s built-in model runner feature and handles GPU passthrough automatically.
Method 5: SGLang (alternative to vLLM)
SGLang is another high-performance inference engine that supports Apertus:
pip install sglang
python3 -m sglang.launch_server \
--model-path "swiss-ai/Apertus-v1.1-4B-Instruct" \
--host 0.0.0.0 \
--port 8000
SGLang can offer better throughput than vLLM for certain workloads, particularly those involving structured outputs or complex prompt caching patterns.
Tips for getting the best results
After running Apertus for a while, here are some practical tips:
Use the Instruct models. The base models are pretrained checkpoints meant for further fine-tuning. For direct conversation and tasks, always use the -Instruct variants.
Stick to the recommended sampling parameters. temperature=0.8 and top_p=0.9 were specifically tuned during post-training. Going too low on temperature makes the output repetitive. Going too high makes it incoherent.
Don’t expect GPT-5 quality. This is an open model in the 4-70B range. Calibrate your expectations accordingly. It’s great for multilingual tasks, summarization, translation, and general Q&A. It’s not going to ace complex multi-step reasoning or advanced code generation.
The 4B model is surprisingly capable. The v1.1 4B was trained using pre-training distillation from the 8B model, which means it punches above its weight for its parameter count. For development and testing, start here.
Quantized checkpoints are official. Unlike most open models where you rely on community quantizations, the Apertus team provides their own quantization-aware distilled checkpoints. Use these over third-party quantizations when available.
Model identifiers reference
Here’s a quick reference for all the HuggingFace model IDs you’ll need:
| Model | HuggingFace ID |
|---|---|
| 0.5B Instruct | swiss-ai/Apertus-v1.1-0.5B-Instruct |
| 1.5B Instruct | swiss-ai/Apertus-v1.1-1.5B-Instruct |
| 4B Instruct | swiss-ai/Apertus-v1.1-4B-Instruct |
| 8B Instruct | swiss-ai/Apertus-8B-Instruct-2509 |
| 70B Instruct | swiss-ai/Apertus-70B-Instruct-2509 |
| 8B FP8 (vLLM) | swiss-ai/Apertus-8B-Instruct-2509-vLLM-FP8 |
| 70B FP8 (vLLM) | swiss-ai/Apertus-70B-Instruct-2509-vLLM-FP8 |
| 4B INT4 (MLX) | swiss-ai/Apertus-v1.1-4B-Instruct-MLX-INT4 |
| 8B INT4 (MLX) | swiss-ai/Apertus-8B-Instruct-2509-MLX-INT4 |
FAQ
Can I run Apertus on a MacBook?
Yes. The 0.5B runs on any Mac. The 4B with INT4 quantization uses about 3GB and runs well on any Apple Silicon Mac. The 8B with INT4 is comfortable on M2 Pro/Max with 32GB+ unified memory. The 70B isn’t realistic on a Mac unless you have an M2 Ultra with 192GB.
Do I need to request access or sign up for anything?
No. All models are publicly available on HuggingFace. Just pip install transformers and load the model. No gating, no access requests, no API keys.
How fast is inference on consumer hardware?
On an RTX 4090 with the 4B model at BF16, expect around 50-80 tokens per second. With FP8 quantization, you can push that higher. The 8B model on the same card does roughly 30-50 tok/s. On Apple Silicon M3 Max with the INT4 4B model, expect 20-40 tok/s depending on context length.
Is there an Ollama model for Apertus?
As of June 2026, Apertus isn’t in the default Ollama library. However, you can use Docker’s model runner (docker model run) which supports HuggingFace models directly. You can also convert the model to GGUF format manually for use with llama.cpp or Ollama, but the official quantized checkpoints via Transformers or vLLM are a better path.
Can I serve Apertus behind an API for my team?
Absolutely. vLLM gives you a production-ready OpenAI-compatible API with a single command. Point your existing OpenAI SDK code at it and things just work. For higher availability, put it behind a load balancer and run multiple vLLM instances.
What’s the context window size?
The v1.1 models (0.5B, 1.5B, 4B) support 4096 token sequences. The 8B and 70B models also use 4096 tokens by default. This is shorter than some competitors, so plan accordingly for long-document tasks. You may need chunking strategies for documents that exceed this limit.