May 8, 2026 · 4 min read

How to Run Falcon Models Locally with Ollama (2026)

Falcon from TII (UAE) is fully open source and runs locally with Ollama. Here’s the setup for Falcon 2 and the new Falcon H1R hybrid model.

Which Falcon model to pick

Model	Size	RAM needed	Best for
Falcon H1R 7B	~5 GB	6 GB	Reasoning, math, coding on budget hardware
Falcon 2 11B	~7 GB	8 GB	General purpose, multilingual
Falcon 40B	~25 GB	32 GB	Best quality (needs beefy hardware)

For most developers, Falcon H1R 7B is the best pick — it outperforms models up to 47B on reasoning while running on any modern laptop.

Setup

# Install Ollama
brew install ollama  # Mac
# or: curl -fsSL https://ollama.com/install.sh | sh  # Linux

# Falcon 2 (11B, general purpose)
ollama pull falcon2

# Falcon 40B (needs 32GB+ RAM)
ollama pull falcon:40b

# Test
ollama run falcon2 "Explain microservices architecture"

Falcon H1R via GGUF

Falcon H1R 7B isn’t in the official Ollama library yet, but GGUF files are available from both TII and Unsloth on HuggingFace:

# Download GGUF from HuggingFace
huggingface-cli download tiiuae/Falcon-H1R-7B-GGUF --local-dir ./falcon-h1r

# Or use Unsloth's optimized quantizations
huggingface-cli download unsloth/Falcon-H1R-7B-GGUF --local-dir ./falcon-h1r

Run with llama.cpp:

./llama-cli -m ./falcon-h1r/falcon-h1r-7b-q5_k_m.gguf \
  -p "Solve: what is the sum of all prime numbers less than 20?" \
  -n 1000

Run with MLX (Apple Silicon):

TII provides an official local deployment guide for MLX + OpenWebUI, which gives you a clean chat interface on Mac.

Create a custom Ollama model

# Create a Modelfile
cat > Modelfile << 'EOF'
FROM ./falcon-h1r-7b-q5_k_m.gguf
PARAMETER temperature 0.2
PARAMETER num_ctx 65536
SYSTEM You are a helpful coding assistant with strong reasoning abilities.
EOF

# Import into Ollama
ollama create falcon-h1r -f Modelfile

# Now use it like any Ollama model
ollama run falcon-h1r "Debug this Python function..."
aider --model ollama/falcon-h1r

The Mamba advantage

Falcon H1R’s hybrid Transformer-Mamba architecture has a unique property: the Mamba (State Space Model) component has no theoretical limit on context length. Unlike pure transformers where attention scales quadratically with sequence length, Mamba scales linearly.

In practice this means:

256K context supported out of the box
Constant memory per token — processing the 200,000th token costs the same as the 1st
1,500 tok/s per GPU at batch size 64 — nearly 2x faster than Qwen3-8B

This makes Falcon H1R particularly good for tasks that need long context: analyzing large codebases, processing long documents, or multi-turn debugging sessions.

Hardware requirements

Hardware	Falcon H1R 7B	Falcon 2 11B	Falcon 40B
MacBook Air M2 8GB	✅ ~25 tok/s	✅ ~18 tok/s	❌
MacBook Pro M3 16GB	✅ ~30 tok/s	✅ ~25 tok/s	❌
Mac Mini M4 Pro 48GB	✅ ~35 tok/s	✅ ~30 tok/s	✅ ~12 tok/s
RTX 4090 24GB	✅ ~45 tok/s	✅ ~35 tok/s	❌ (VRAM)

For Falcon 40B and other models that exceed your local VRAM, cloud GPU providers offer 48GB+ GPU instances on demand.

See our VRAM guide for exact memory calculations and GPU vs CPU guide for when you need a GPU.

Connect to coding tools

Aider

aider --model ollama/falcon2

Continue.dev (VS Code)

{
  "models": [{
    "title": "Falcon 2 Local",
    "provider": "ollama",
    "model": "falcon2"
  }]
}

OpenCode

opencode --provider ollama --model falcon2

Falcon vs other local models at similar sizes

Model	Params	Reasoning	Coding	Speed
Falcon H1R 7B	7B	✅ Best at 7B	Good	Fast
Falcon 2 11B	11B	Good	Good	Fast
Qwen3 8B	8B	Good	Good	Fast
Yi-Coder 9B	9B	Decent	✅ Best coding	Fast
Gemma 4 9B	9B	Good	Good	Fast

Falcon H1R-7B wins on reasoning at this size. Yi-Coder 9B wins on coding. Qwen3 8B is the best all-rounder.

Falcon H1R: the hybrid architecture

What makes Falcon H1R special is its hybrid architecture combining:

State Space Models (SSM) — efficient long-sequence processing, linear scaling
Traditional attention — precise token-level reasoning

This hybrid approach gives it reasoning capabilities that pure transformer models at 7B can’t match. It’s similar to how Qwen 3.6 Plus uses hybrid linear attention + MoE, but at a much smaller scale.

Troubleshooting

Model not found — check exact name: ollama list
Slow performance — verify GPU is used: ollama ps
Out of memory — try Falcon H1R 7B instead of Falcon 2 11B

See our Ollama troubleshooting guide for all common errors.