🤖 AI Tools
· 7 min read

How to Run Apertus Locally: Complete Setup Guide (All Sizes)


Apertus is one of the few fully open foundation models you can actually run end-to-end on your own hardware without signing any agreements, requesting access, or depending on an external API. Every model size is freely downloadable from HuggingFace under Apache 2.0, and the Transformers library supports it natively since v4.56.0.

This guide covers every practical way to get Apertus running locally, from a quick test on your laptop to a production-ready vLLM deployment. I’ll be specific about hardware requirements because nothing’s more frustrating than downloading a 140GB model only to find it won’t fit in memory.

Hardware requirements by model size

Let’s start with what you actually need. These are minimum requirements for inference (not training):

ModelBF16 VRAMINT4 QuantizedMinimum hardware
0.5B~1 GB~0.5 GBAny modern laptop CPU
1.5B~3 GB~1.5 GBLaptop with 8GB RAM
4B~8 GB~3 GBRTX 3060 12GB or M1 Mac
8B~16 GB~6 GBRTX 4090 24GB or M2 Pro
70B~140 GB~40 GB2-4x A100 80GB or 8x RTX 4090

The 0.5B model genuinely runs on a CPU. It’s slow, but it works. The 4B is the sweet spot for most developers because it fits on consumer GPUs while still being useful. The 70B is for serious deployments with proper hardware.

Method 1: HuggingFace Transformers (simplest)

This is the fastest way to get Apertus running. You need Python 3.10+ and a recent Transformers version.

Install dependencies

pip install -U transformers torch accelerate

Run the 4B Instruct model

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "swiss-ai/Apertus-v1.1-4B-Instruct"
device = "cuda"  # use "cpu" if no GPU, or "mps" for Apple Silicon

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

messages = [
    {"role": "user", "content": "Explain how the EU AI Act affects open-source models."}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer([text], return_tensors="pt", add_special_tokens=False).to(device)

output = model.generate(**inputs, max_new_tokens=1024, temperature=0.8, top_p=0.9)
response = tokenizer.decode(output[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
print(response)

The Swiss AI team recommends temperature=0.8 and top_p=0.9 for the best output quality.

Run the 8B model with automatic device mapping

For the 8B model, you might want to split across CPU and GPU if you don’t have 16GB of VRAM:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "swiss-ai/Apertus-8B-Instruct-2509"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",  # automatically splits across available devices
    torch_dtype="auto",
)

messages = [
    {"role": "user", "content": "Write a summary of Swiss data protection law in German."}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt", add_special_tokens=False).to(model.device)

output = model.generate(**inputs, max_new_tokens=2048, temperature=0.8, top_p=0.9)
print(tokenizer.decode(output[0][len(inputs.input_ids[0]):], skip_special_tokens=True))

Setting device_map="auto" lets Transformers figure out how to distribute the model across your GPU, CPU, and even disk if necessary. It’s slower than pure GPU inference, but it works.

Method 2: Quantized models for limited hardware

If you don’t have a beefy GPU, quantization is your friend. Apertus provides official quantized checkpoints, which is a big deal. These aren’t community GPTQ/GGUF quantizations of unknown quality. They’re quantization-aware distilled checkpoints created by the original team.

Available quantization formats

  • FP8 and NVFP4A16: Optimized for vLLM inference on NVIDIA GPUs
  • INT3, INT4, INT6: Optimized for Apple Silicon via MLX

Running the INT4 quantized 4B on Apple Silicon

If you’re on a Mac with Apple Silicon, you can use the MLX checkpoints:

pip install mlx-lm
from mlx_lm import load, generate

model, tokenizer = load("swiss-ai/Apertus-v1.1-4B-Instruct-MLX-INT4")

messages = [{"role": "user", "content": "What is sovereign AI?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

response = generate(model, tokenizer, prompt=prompt, max_tokens=512)
print(response)

The INT4 version of the 4B model uses around 3GB of memory. That’s comfortable on any M1/M2/M3 Mac.

Running quantized models with bitsandbytes

For NVIDIA GPUs with limited VRAM, you can also load with 4-bit quantization on the fly:

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_quant_type="nf4",
)

model_name = "swiss-ai/Apertus-8B-Instruct-2509"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map="auto",
)

This lets you fit the 8B model into around 6GB of VRAM. Quality drops slightly compared to full precision, but it’s very usable for testing and development.

Method 3: vLLM for production serving

If you’re deploying Apertus as a service (internal API, RAG backend, or production application), vLLM is the way to go. It gives you an OpenAI-compatible API with continuous batching and optimized inference.

Basic vLLM setup

pip install vllm
vllm serve "swiss-ai/Apertus-v1.1-4B-Instruct"

That’s it. You now have an OpenAI-compatible API running at http://localhost:8000. Any code that works with the OpenAI SDK works with this.

Call it with curl

curl -X POST "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  --data '{
    "model": "swiss-ai/Apertus-v1.1-4B-Instruct",
    "messages": [
      {"role": "user", "content": "Summarize the GDPR in three paragraphs."}
    ],
    "temperature": 0.8,
    "top_p": 0.9,
    "max_tokens": 1024
  }'

Call it with the OpenAI Python SDK

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

response = client.chat.completions.create(
    model="swiss-ai/Apertus-v1.1-4B-Instruct",
    messages=[
        {"role": "user", "content": "Explain DORA regulation for banks."}
    ],
    temperature=0.8,
    top_p=0.9,
)
print(response.choices[0].message.content)

Serving the 70B model with tensor parallelism

For the 70B model, you’ll need multiple GPUs. vLLM handles this with tensor parallelism:

vllm serve "swiss-ai/Apertus-70B-Instruct-2509" \
  --tensor-parallel-size 4 \
  --max-model-len 4096

This splits the model across 4 GPUs. With 4x A100 80GB, you can serve the full BF16 model. With 4x RTX 4090 (24GB each = 96GB total), you’d want the FP8 quantized version:

vllm serve "swiss-ai/Apertus-70B-Instruct-2509-vLLM-FP8" \
  --tensor-parallel-size 4 \
  --max-model-len 4096

Using the official NVFP4 checkpoint

For maximum efficiency on NVIDIA hardware, the NVFP4A16 checkpoint is optimized specifically for vLLM:

vllm serve "swiss-ai/Apertus-v1.1-4B-Instruct-vLLM-NVFP4A16"

This gives you roughly 2x the throughput compared to BF16 with minimal quality loss.

Method 4: Docker

If you prefer containerized deployment:

docker model run hf.co/swiss-ai/Apertus-v1.1-4B-Instruct

This uses Docker’s built-in model runner feature and handles GPU passthrough automatically.

Method 5: SGLang (alternative to vLLM)

SGLang is another high-performance inference engine that supports Apertus:

pip install sglang

python3 -m sglang.launch_server \
  --model-path "swiss-ai/Apertus-v1.1-4B-Instruct" \
  --host 0.0.0.0 \
  --port 8000

SGLang can offer better throughput than vLLM for certain workloads, particularly those involving structured outputs or complex prompt caching patterns.

Tips for getting the best results

After running Apertus for a while, here are some practical tips:

Use the Instruct models. The base models are pretrained checkpoints meant for further fine-tuning. For direct conversation and tasks, always use the -Instruct variants.

Stick to the recommended sampling parameters. temperature=0.8 and top_p=0.9 were specifically tuned during post-training. Going too low on temperature makes the output repetitive. Going too high makes it incoherent.

Don’t expect GPT-5 quality. This is an open model in the 4-70B range. Calibrate your expectations accordingly. It’s great for multilingual tasks, summarization, translation, and general Q&A. It’s not going to ace complex multi-step reasoning or advanced code generation.

The 4B model is surprisingly capable. The v1.1 4B was trained using pre-training distillation from the 8B model, which means it punches above its weight for its parameter count. For development and testing, start here.

Quantized checkpoints are official. Unlike most open models where you rely on community quantizations, the Apertus team provides their own quantization-aware distilled checkpoints. Use these over third-party quantizations when available.

Model identifiers reference

Here’s a quick reference for all the HuggingFace model IDs you’ll need:

ModelHuggingFace ID
0.5B Instructswiss-ai/Apertus-v1.1-0.5B-Instruct
1.5B Instructswiss-ai/Apertus-v1.1-1.5B-Instruct
4B Instructswiss-ai/Apertus-v1.1-4B-Instruct
8B Instructswiss-ai/Apertus-8B-Instruct-2509
70B Instructswiss-ai/Apertus-70B-Instruct-2509
8B FP8 (vLLM)swiss-ai/Apertus-8B-Instruct-2509-vLLM-FP8
70B FP8 (vLLM)swiss-ai/Apertus-70B-Instruct-2509-vLLM-FP8
4B INT4 (MLX)swiss-ai/Apertus-v1.1-4B-Instruct-MLX-INT4
8B INT4 (MLX)swiss-ai/Apertus-8B-Instruct-2509-MLX-INT4

FAQ

Can I run Apertus on a MacBook?

Yes. The 0.5B runs on any Mac. The 4B with INT4 quantization uses about 3GB and runs well on any Apple Silicon Mac. The 8B with INT4 is comfortable on M2 Pro/Max with 32GB+ unified memory. The 70B isn’t realistic on a Mac unless you have an M2 Ultra with 192GB.

Do I need to request access or sign up for anything?

No. All models are publicly available on HuggingFace. Just pip install transformers and load the model. No gating, no access requests, no API keys.

How fast is inference on consumer hardware?

On an RTX 4090 with the 4B model at BF16, expect around 50-80 tokens per second. With FP8 quantization, you can push that higher. The 8B model on the same card does roughly 30-50 tok/s. On Apple Silicon M3 Max with the INT4 4B model, expect 20-40 tok/s depending on context length.

Is there an Ollama model for Apertus?

As of June 2026, Apertus isn’t in the default Ollama library. However, you can use Docker’s model runner (docker model run) which supports HuggingFace models directly. You can also convert the model to GGUF format manually for use with llama.cpp or Ollama, but the official quantized checkpoints via Transformers or vLLM are a better path.

Can I serve Apertus behind an API for my team?

Absolutely. vLLM gives you a production-ready OpenAI-compatible API with a single command. Point your existing OpenAI SDK code at it and things just work. For higher availability, put it behind a load balancer and run multiple vLLM instances.

What’s the context window size?

The v1.1 models (0.5B, 1.5B, 4B) support 4096 token sequences. The 8B and 70B models also use 4096 tokens by default. This is shorter than some competitors, so plan accordingly for long-document tasks. You may need chunking strategies for documents that exceed this limit.