πŸ€– AI Tools
Β· 8 min read

How to Run Poolside Laguna XS.2 Locally β€” Setup Guide (2026)


Some links in this article are affiliate links. We earn a commission at no extra cost to you when you purchase through them. Full disclosure.

Laguna XS.2 is one of the few coding-specific models you can actually run locally with good performance. At 33B total parameters with only 3B active (Mixture-of-Experts), it fits on consumer hardware while delivering coding quality trained with Poolside’s RLCEF pipeline. The Apache 2.0 license means no restrictions on local use, fine-tuning, or commercial deployment.

This guide covers everything: hardware requirements, downloading the weights, setting up inference with vLLM and llama.cpp, quantization options, and cloud GPU alternatives for when local hardware is not enough.

For background on the model itself, see our Laguna XS.2 complete guide and What is Poolside AI.

Hardware requirements

XS.2 is a 33B parameter MoE model. All 33B parameters need to be loaded into memory, even though only 3B are active during inference. This means memory requirements are closer to a 33B model than a 3B model, but inference speed is closer to a 3B model.

Minimum requirements

SetupVRAM/RAMQuantizationExpected speed
NVIDIA RTX 3060 (12 GB)12 GB VRAMINT4 (GPTQ/AWQ)30-40 tok/s
NVIDIA RTX 4060 Ti (16 GB)16 GB VRAMINT440-50 tok/s
Apple M1/M2 (16 GB)16 GB unifiedQ4_K_M (GGUF)25-35 tok/s
CPU only (32 GB RAM)32 GB RAMQ4_K_M5-10 tok/s
SetupVRAM/RAMQuantizationExpected speed
NVIDIA RTX 4090 (24 GB)24 GB VRAMINT8 or FP1660-80 tok/s
Apple M3/M4 Pro (36 GB)36 GB unifiedQ6_K or Q8_040-55 tok/s
Apple M3/M4 Max (64 GB)64 GB unifiedFP1650-65 tok/s
2x NVIDIA A100 (80 GB each)160 GB VRAMFP16100+ tok/s

The sweet spot for most developers is an RTX 4090 or an Apple Silicon Mac with 32+ GB unified memory. Both handle XS.2 comfortably with good quantization.

Download the weights

XS.2 weights are available on HuggingFace. You need huggingface-hub installed:

pip install huggingface-hub

# Download the full model (FP16)
huggingface-cli download poolside/laguna-xs.2 --local-dir ./laguna-xs2

# Or download a specific quantized version (if available)
huggingface-cli download poolside/laguna-xs.2-GPTQ --local-dir ./laguna-xs2-gptq

The full FP16 model is approximately 66 GB. Quantized versions are smaller:

  • GPTQ INT4: ~17 GB
  • AWQ INT4: ~17 GB
  • GGUF Q4_K_M: ~19 GB
  • GGUF Q8_0: ~35 GB

Download the quantization format that matches your inference engine: GPTQ/AWQ for vLLM, GGUF for llama.cpp.

vLLM is the best option for NVIDIA GPUs. It supports MoE models natively, handles batching efficiently, and exposes an OpenAI-compatible API.

Install vLLM

pip install vllm

Requires CUDA 11.8+ and an NVIDIA GPU with compute capability 7.0+.

Start the server

# FP16 (needs 24+ GB VRAM)
vllm serve poolside/laguna-xs.2 \
  --dtype float16 \
  --max-model-len 8192 \
  --port 8000

# INT4 quantized (needs 12+ GB VRAM)
vllm serve poolside/laguna-xs.2-GPTQ \
  --dtype float16 \
  --quantization gptq \
  --max-model-len 8192 \
  --port 8000

# AWQ quantized
vllm serve poolside/laguna-xs.2-AWQ \
  --dtype float16 \
  --quantization awq \
  --max-model-len 8192 \
  --port 8000

Test the server

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "poolside/laguna-xs.2",
    "messages": [
      {"role": "user", "content": "Write a Python decorator that retries a function up to 3 times with exponential backoff."}
    ],
    "max_tokens": 1024
  }'

Use with Python

import openai

client = openai.OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="poolside/laguna-xs.2",
    messages=[
        {"role": "user", "content": "Write a TypeScript function that validates an email address using a regex and returns a typed result."}
    ],
    max_tokens=1024
)

print(response.choices[0].message.content)

For a comparison of inference engines, see our Ollama vs llama.cpp vs vLLM guide.

llama.cpp runs on Apple Silicon, NVIDIA GPUs, and CPU. It uses GGUF format and supports Metal acceleration on Mac.

Install llama.cpp

# macOS with Homebrew
brew install llama.cpp

# Or build from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DLLAMA_METAL=ON  # For Mac Metal support
cmake --build build --config Release

Download GGUF weights

huggingface-cli download poolside/laguna-xs.2-GGUF \
  laguna-xs.2-Q4_K_M.gguf \
  --local-dir ./models

Run the server

llama-server \
  --model ./models/laguna-xs.2-Q4_K_M.gguf \
  --ctx-size 8192 \
  --n-gpu-layers 99 \
  --port 8080

The --n-gpu-layers 99 flag offloads all layers to the GPU (Metal on Mac, CUDA on NVIDIA). Reduce this number if you run out of VRAM.

Test it

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "laguna-xs.2",
    "messages": [
      {"role": "user", "content": "Write a Rust function that reads a JSON config file and deserializes it into a typed struct."}
    ]
  }'

Option 3: Ollama

If Laguna XS.2 is available in the Ollama library, setup is trivial:

ollama pull laguna-xs.2
ollama run laguna-xs.2

Check the Ollama model library for availability. If it is not listed yet, you can create a custom Modelfile pointing to the GGUF weights:

FROM ./models/laguna-xs.2-Q4_K_M.gguf

PARAMETER temperature 0.2
PARAMETER num_ctx 8192

SYSTEM "You are an expert software engineer. Write clean, correct, well-tested code."
ollama create laguna-xs2 -f Modelfile
ollama run laguna-xs2

Connecting to coding tools

Once you have a local server running (vLLM, llama.cpp, or Ollama), connect it to your coding tools:

Aider

# With vLLM or llama.cpp server
aider --model openai/laguna-xs.2 --openai-api-base http://localhost:8000/v1

# With Ollama
aider --model ollama/laguna-xs2

Continue (VS Code)

Add to your Continue config (~/.continue/config.json):

{
  "models": [
    {
      "title": "Laguna XS.2 (Local)",
      "provider": "openai",
      "model": "laguna-xs.2",
      "apiBase": "http://localhost:8000/v1",
      "apiKey": "not-needed"
    }
  ]
}

OpenCode

OPENAI_API_BASE=http://localhost:8000/v1 \
OPENAI_API_KEY=not-needed \
opencode --model laguna-xs.2

Cloud GPU alternatives

If your local hardware cannot handle XS.2, or you want faster inference than your machine provides, cloud GPUs are an option. You rent a GPU by the hour, run the model, and shut it down when you are done.

RunPod

RunPod offers on-demand GPU instances starting at competitive hourly rates. For XS.2:

  • RTX 4090 (24 GB): Handles INT8 or FP16 comfortably. Good for individual use.
  • A100 (80 GB): Overkill for XS.2 but gives you headroom for larger context windows and batching.
  • H100 (80 GB): Maximum performance if you need it.

RunPod provides Docker templates with vLLM pre-installed, so you can have XS.2 running in minutes:

  1. Create a RunPod account at runpod.io
  2. Launch a GPU pod with the vLLM template
  3. SSH in and start the model: vllm serve poolside/laguna-xs.2 --dtype float16
  4. Connect your local tools to the pod’s IP address

RunPod also supports serverless endpoints β€” deploy XS.2 once and pay only for actual inference time, with automatic scaling to zero when idle.

For a broader comparison of cloud GPU providers, see our best cloud GPU providers guide.

Other cloud GPU options

  • AWS (EC2 with NVIDIA GPUs): More expensive but integrates with your existing AWS infrastructure
  • Lambda Labs: Good availability of A100 and H100 instances
  • Vast.ai: Marketplace model with competitive pricing from individual GPU owners
  • Google Cloud (A100/H100): Enterprise option with GCP integration

Performance tuning

Context length

XS.2 supports various context lengths depending on your deployment. Start with 8192 tokens and increase if needed:

# vLLM β€” increase context
vllm serve poolside/laguna-xs.2 --max-model-len 16384

# llama.cpp β€” increase context
llama-server --model ./laguna-xs.2-Q4_K_M.gguf --ctx-size 16384

Longer context uses more memory. On a 24 GB GPU with INT4 quantization, you can comfortably run 16K context. On 12 GB, stick to 8K.

Temperature for coding

For code generation, use low temperature:

response = client.chat.completions.create(
    model="poolside/laguna-xs.2",
    messages=[{"role": "user", "content": "..."}],
    temperature=0.1,  # Low for deterministic code
    max_tokens=2048
)

Temperature 0.0-0.2 for code generation. Temperature 0.3-0.5 for creative solutions or brainstorming approaches. Never go above 0.7 for code β€” the output becomes unreliable.

Batch processing

If you are processing multiple files or generating code for multiple functions, vLLM’s batching gives you significant throughput improvements:

# Send multiple requests β€” vLLM batches them automatically
import asyncio
import openai

client = openai.AsyncOpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

async def generate(prompt):
    response = await client.chat.completions.create(
        model="poolside/laguna-xs.2",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=1024
    )
    return response.choices[0].message.content

async def main():
    prompts = [
        "Write a Python function to parse ISO 8601 dates.",
        "Write a Go function to compute SHA-256 of a file.",
        "Write a TypeScript function to debounce an async callback."
    ]
    results = await asyncio.gather(*[generate(p) for p in prompts])
    for r in results:
        print(r)
        print("---")

asyncio.run(main())

Troubleshooting

Out of memory

If you get CUDA OOM errors:

  1. Use a more aggressive quantization (INT4 instead of INT8)
  2. Reduce --max-model-len to 4096 or 2048
  3. Add --gpu-memory-utilization 0.85 to vLLM to leave headroom
  4. For llama.cpp, reduce --n-gpu-layers to offload some layers to CPU

Slow generation on Mac

If llama.cpp is slow on Apple Silicon:

  1. Ensure Metal is enabled: build with -DLLAMA_METAL=ON
  2. Check that all layers are on GPU: use --n-gpu-layers 99
  3. Use Q4_K_M quantization β€” it is optimized for Metal
  4. Close other GPU-intensive applications

Model not loading

If the model fails to load:

  1. Verify the download completed: check file sizes against HuggingFace
  2. Ensure you have the right format: GGUF for llama.cpp, safetensors for vLLM
  3. Check available memory: the full model needs to fit in VRAM/RAM
  4. Try a smaller quantization if memory is tight

FAQ

Can I run Laguna M.1 locally?

No. M.1 is a 225B parameter model with proprietary weights. Even if the weights were available, you would need multiple high-end GPUs (4-8x A100 80GB or equivalent). Only XS.2 is available for local deployment.

What is the best quantization for XS.2?

For NVIDIA GPUs with 24 GB VRAM: INT8 gives the best quality-to-speed ratio. For 12 GB VRAM: INT4 (GPTQ or AWQ). For Apple Silicon with 32 GB: Q6_K (GGUF). For 16 GB: Q4_K_M (GGUF). Avoid Q2 or Q3 quantizations β€” the quality drop is too significant for coding tasks where precision matters.

How does local XS.2 compare to OpenRouter?

Same model, same outputs. The difference is latency (local is faster for single requests, OpenRouter may be faster for first-token with their infrastructure), privacy (local keeps all code on your machine), and cost (local is free after hardware, OpenRouter is free but requires internet). Use local for privacy-sensitive code and offline work. Use OpenRouter when you do not want to manage infrastructure.

Can I fine-tune XS.2 locally?

Yes, but fine-tuning requires more memory than inference. Full fine-tuning of the 33B model needs ~80 GB of GPU memory. LoRA fine-tuning needs ~24-40 GB depending on rank. QLoRA can work on a single 24 GB GPU. Use a cloud GPU if your local hardware is not sufficient β€” RunPod A100 instances work well for this.

Does XS.2 work on Windows?

Yes. vLLM runs on Windows with WSL2 and CUDA. llama.cpp has native Windows builds. Ollama has a Windows installer. The setup is slightly more involved than Mac or Linux but fully supported. Use WSL2 for the most reliable experience with vLLM.

How much disk space do I need?

The full FP16 model is ~66 GB. INT4 quantized versions are ~17 GB. GGUF Q4_K_M is ~19 GB. Plan for at least 20 GB of free disk space for a quantized version, or 70 GB for the full model. Downloads from HuggingFace may temporarily need double the space during extraction.