πŸ€– AI Tools
Β· 4 min read

How to Run Falcon Models Locally with Ollama (2026)


Falcon from TII (UAE) is fully open source and runs locally with Ollama. Here’s the setup for Falcon 2 and the new Falcon H1R hybrid model.

Which Falcon model to pick

ModelSizeRAM neededBest for
Falcon H1R 7B~5 GB6 GBReasoning, math, coding on budget hardware
Falcon 2 11B~7 GB8 GBGeneral purpose, multilingual
Falcon 40B~25 GB32 GBBest quality (needs beefy hardware)

For most developers, Falcon H1R 7B is the best pick β€” it outperforms models up to 47B on reasoning while running on any modern laptop.

Setup

# Install Ollama
brew install ollama  # Mac
# or: curl -fsSL https://ollama.com/install.sh | sh  # Linux

# Falcon 2 (11B, general purpose)
ollama pull falcon2

# Falcon 40B (needs 32GB+ RAM)
ollama pull falcon:40b

# Test
ollama run falcon2 "Explain microservices architecture"

Falcon H1R via GGUF

Falcon H1R 7B isn’t in the official Ollama library yet, but GGUF files are available from both TII and Unsloth on HuggingFace:

# Download GGUF from HuggingFace
huggingface-cli download tiiuae/Falcon-H1R-7B-GGUF --local-dir ./falcon-h1r

# Or use Unsloth's optimized quantizations
huggingface-cli download unsloth/Falcon-H1R-7B-GGUF --local-dir ./falcon-h1r

Run with llama.cpp:

./llama-cli -m ./falcon-h1r/falcon-h1r-7b-q5_k_m.gguf \
  -p "Solve: what is the sum of all prime numbers less than 20?" \
  -n 1000

Run with MLX (Apple Silicon):

TII provides an official local deployment guide for MLX + OpenWebUI, which gives you a clean chat interface on Mac.

Create a custom Ollama model

# Create a Modelfile
cat > Modelfile << 'EOF'
FROM ./falcon-h1r-7b-q5_k_m.gguf
PARAMETER temperature 0.2
PARAMETER num_ctx 65536
SYSTEM You are a helpful coding assistant with strong reasoning abilities.
EOF

# Import into Ollama
ollama create falcon-h1r -f Modelfile

# Now use it like any Ollama model
ollama run falcon-h1r "Debug this Python function..."
aider --model ollama/falcon-h1r

The Mamba advantage

Falcon H1R’s hybrid Transformer-Mamba architecture has a unique property: the Mamba (State Space Model) component has no theoretical limit on context length. Unlike pure transformers where attention scales quadratically with sequence length, Mamba scales linearly.

In practice this means:

  • 256K context supported out of the box
  • Constant memory per token β€” processing the 200,000th token costs the same as the 1st
  • 1,500 tok/s per GPU at batch size 64 β€” nearly 2x faster than Qwen3-8B

This makes Falcon H1R particularly good for tasks that need long context: analyzing large codebases, processing long documents, or multi-turn debugging sessions.

Hardware requirements

HardwareFalcon H1R 7BFalcon 2 11BFalcon 40B
MacBook Air M2 8GBβœ… ~25 tok/sβœ… ~18 tok/s❌
MacBook Pro M3 16GBβœ… ~30 tok/sβœ… ~25 tok/s❌
Mac Mini M4 Pro 48GBβœ… ~35 tok/sβœ… ~30 tok/sβœ… ~12 tok/s
RTX 4090 24GBβœ… ~45 tok/sβœ… ~35 tok/s❌ (VRAM)

For Falcon 40B and other models that exceed your local VRAM, cloud GPU providers offer 48GB+ GPU instances on demand.

See our VRAM guide for exact memory calculations and GPU vs CPU guide for when you need a GPU.

Connect to coding tools

Aider

aider --model ollama/falcon2

Continue.dev (VS Code)

{
  "models": [{
    "title": "Falcon 2 Local",
    "provider": "ollama",
    "model": "falcon2"
  }]
}

OpenCode

opencode --provider ollama --model falcon2

Falcon vs other local models at similar sizes

ModelParamsReasoningCodingSpeed
Falcon H1R 7B7Bβœ… Best at 7BGoodFast
Falcon 2 11B11BGoodGoodFast
Qwen3 8B8BGoodGoodFast
Yi-Coder 9B9BDecentβœ… Best codingFast
Gemma 4 9B9BGoodGoodFast

Falcon H1R-7B wins on reasoning at this size. Yi-Coder 9B wins on coding. Qwen3 8B is the best all-rounder.

Falcon H1R: the hybrid architecture

What makes Falcon H1R special is its hybrid architecture combining:

  • State Space Models (SSM) β€” efficient long-sequence processing, linear scaling
  • Traditional attention β€” precise token-level reasoning

This hybrid approach gives it reasoning capabilities that pure transformer models at 7B can’t match. It’s similar to how Qwen 3.6 Plus uses hybrid linear attention + MoE, but at a much smaller scale.

Troubleshooting

  • Model not found β€” check exact name: ollama list
  • Slow performance β€” verify GPU is used: ollama ps
  • Out of memory β€” try Falcon H1R 7B instead of Falcon 2 11B

See our Ollama troubleshooting guide for all common errors.

Related: What is Falcon? Β· Falcon vs Jais Β· Best Ollama Models for Coding Β· Ollama Complete Guide Β· Ollama vs LM Studio vs vLLM Β· Best AI Models for Mac