Falcon from TII (UAE) is fully open source and runs locally with Ollama. Hereβs the setup for Falcon 2 and the new Falcon H1R hybrid model.
Which Falcon model to pick
| Model | Size | RAM needed | Best for |
|---|---|---|---|
| Falcon H1R 7B | ~5 GB | 6 GB | Reasoning, math, coding on budget hardware |
| Falcon 2 11B | ~7 GB | 8 GB | General purpose, multilingual |
| Falcon 40B | ~25 GB | 32 GB | Best quality (needs beefy hardware) |
For most developers, Falcon H1R 7B is the best pick β it outperforms models up to 47B on reasoning while running on any modern laptop.
Setup
# Install Ollama
brew install ollama # Mac
# or: curl -fsSL https://ollama.com/install.sh | sh # Linux
# Falcon 2 (11B, general purpose)
ollama pull falcon2
# Falcon 40B (needs 32GB+ RAM)
ollama pull falcon:40b
# Test
ollama run falcon2 "Explain microservices architecture"
Falcon H1R via GGUF
Falcon H1R 7B isnβt in the official Ollama library yet, but GGUF files are available from both TII and Unsloth on HuggingFace:
# Download GGUF from HuggingFace
huggingface-cli download tiiuae/Falcon-H1R-7B-GGUF --local-dir ./falcon-h1r
# Or use Unsloth's optimized quantizations
huggingface-cli download unsloth/Falcon-H1R-7B-GGUF --local-dir ./falcon-h1r
Run with llama.cpp:
./llama-cli -m ./falcon-h1r/falcon-h1r-7b-q5_k_m.gguf \
-p "Solve: what is the sum of all prime numbers less than 20?" \
-n 1000
Run with MLX (Apple Silicon):
TII provides an official local deployment guide for MLX + OpenWebUI, which gives you a clean chat interface on Mac.
Create a custom Ollama model
# Create a Modelfile
cat > Modelfile << 'EOF'
FROM ./falcon-h1r-7b-q5_k_m.gguf
PARAMETER temperature 0.2
PARAMETER num_ctx 65536
SYSTEM You are a helpful coding assistant with strong reasoning abilities.
EOF
# Import into Ollama
ollama create falcon-h1r -f Modelfile
# Now use it like any Ollama model
ollama run falcon-h1r "Debug this Python function..."
aider --model ollama/falcon-h1r
The Mamba advantage
Falcon H1Rβs hybrid Transformer-Mamba architecture has a unique property: the Mamba (State Space Model) component has no theoretical limit on context length. Unlike pure transformers where attention scales quadratically with sequence length, Mamba scales linearly.
In practice this means:
- 256K context supported out of the box
- Constant memory per token β processing the 200,000th token costs the same as the 1st
- 1,500 tok/s per GPU at batch size 64 β nearly 2x faster than Qwen3-8B
This makes Falcon H1R particularly good for tasks that need long context: analyzing large codebases, processing long documents, or multi-turn debugging sessions.
Hardware requirements
| Hardware | Falcon H1R 7B | Falcon 2 11B | Falcon 40B |
|---|---|---|---|
| MacBook Air M2 8GB | β ~25 tok/s | β ~18 tok/s | β |
| MacBook Pro M3 16GB | β ~30 tok/s | β ~25 tok/s | β |
| Mac Mini M4 Pro 48GB | β ~35 tok/s | β ~30 tok/s | β ~12 tok/s |
| RTX 4090 24GB | β ~45 tok/s | β ~35 tok/s | β (VRAM) |
For Falcon 40B and other models that exceed your local VRAM, cloud GPU providers offer 48GB+ GPU instances on demand.
See our VRAM guide for exact memory calculations and GPU vs CPU guide for when you need a GPU.
Connect to coding tools
Aider
aider --model ollama/falcon2
Continue.dev (VS Code)
{
"models": [{
"title": "Falcon 2 Local",
"provider": "ollama",
"model": "falcon2"
}]
}
OpenCode
opencode --provider ollama --model falcon2
Falcon vs other local models at similar sizes
| Model | Params | Reasoning | Coding | Speed |
|---|---|---|---|---|
| Falcon H1R 7B | 7B | β Best at 7B | Good | Fast |
| Falcon 2 11B | 11B | Good | Good | Fast |
| Qwen3 8B | 8B | Good | Good | Fast |
| Yi-Coder 9B | 9B | Decent | β Best coding | Fast |
| Gemma 4 9B | 9B | Good | Good | Fast |
Falcon H1R-7B wins on reasoning at this size. Yi-Coder 9B wins on coding. Qwen3 8B is the best all-rounder.
Falcon H1R: the hybrid architecture
What makes Falcon H1R special is its hybrid architecture combining:
- State Space Models (SSM) β efficient long-sequence processing, linear scaling
- Traditional attention β precise token-level reasoning
This hybrid approach gives it reasoning capabilities that pure transformer models at 7B canβt match. Itβs similar to how Qwen 3.6 Plus uses hybrid linear attention + MoE, but at a much smaller scale.
Troubleshooting
- Model not found β check exact name:
ollama list - Slow performance β verify GPU is used:
ollama ps - Out of memory β try Falcon H1R 7B instead of Falcon 2 11B
See our Ollama troubleshooting guide for all common errors.
Related: What is Falcon? Β· Falcon vs Jais Β· Best Ollama Models for Coding Β· Ollama Complete Guide Β· Ollama vs LM Studio vs vLLM Β· Best AI Models for Mac