May 20, 2026 · 5 min read

How to Run Microsoft Fara-7B Locally — Complete Setup Guide

Fara-7B is Microsoft’s open-source Computer Use Agent — a 7B model that can browse the web autonomously from screenshots. Here’s how to run it on your own hardware.

Hardware requirements

Setup	VRAM	RAM	Speed
bf16 (full precision)	16GB	32GB	Fast
Q8 quantized	10GB	16GB	Fast
Q4 quantized	6GB	16GB	Medium
CPU-only (GGUF Q4)	—	16GB	Slow

Recommended: NVIDIA RTX 4090, A6000, or any GPU with 16GB+ VRAM. Apple Silicon Macs with 16GB+ unified memory also work via llama.cpp.

Method 1: Official setup (vLLM)

This is Microsoft’s recommended approach. Requires a Linux machine with NVIDIA GPU.

# Clone the Fara repository
git clone https://github.com/microsoft/fara.git
cd fara

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install Fara and dependencies
pip install -e .
playwright install

# Install vLLM
pip install vllm>=0.10.0

Start the model server:

vllm serve "microsoft/Fara-7B" --port 5000 --dtype auto

If you run out of VRAM on a single GPU:

# Use tensor parallelism across 2 GPUs
vllm serve "microsoft/Fara-7B" --port 5000 --dtype auto --tensor-parallel-size 2

Run a task:

# Simple task
fara-cli --task "search for the cheapest flight from NYC to London next Tuesday"

# With a specific starting URL
fara-cli --task "find the return policy" --url "https://example-store.com"

The CLI opens a browser, takes screenshots, sends them to the model, and executes the predicted actions.

Method 2: Ollama (easiest)

If you just want to chat with the model or test it quickly:

# Pull the GGUF quantized version
ollama pull bartowski/microsoft_Fara-7B-GGUF

# Or a specific quantization
ollama pull bartowski/microsoft_Fara-7B-GGUF:Q6_K_L

Note: Ollama gives you the raw model for text/image inference, but doesn’t include the browser automation framework. For actual computer use, you need the official setup (Method 1) or a custom integration.

Method 3: llama.cpp (Mac/CPU)

For Apple Silicon Macs or CPU-only setups:

# Download a GGUF quantization
# Q6_K_L (6.5GB) — recommended quality/size balance
wget https://huggingface.co/bartowski/microsoft_Fara-7B-GGUF/resolve/main/Fara-7B-Q6_K_L.gguf

# Run with llama.cpp
./llama-server -m Fara-7B-Q6_K_L.gguf -c 4096 --port 5000

Quantization options:

File	Size	Quality	Use case
Q8_0	8.1GB	Near-perfect	If you have the VRAM
Q6_K_L	6.5GB	Excellent	Recommended default
Q6_K	6.3GB	Very good	Slightly smaller
Q4_K_M	4.5GB	Good	8GB VRAM/RAM constrained
Q4_K_S	4.2GB	Acceptable	Minimum viable quality

Method 4: Docker (sandboxed, recommended for safety)

Microsoft provides a Docker-based setup via Magentic-UI for safe web execution:

# Clone and run with Docker
git clone https://github.com/microsoft/fara.git
cd fara

# Build the Docker image
docker build -t fara-agent .

# Run with GPU access
docker run --gpus all -p 5000:5000 fara-agent

This is the safest approach — the browser runs inside the container, isolated from your host system.

Connecting to a browser

Fara-7B needs a browser to interact with. The official setup uses Playwright:

from playwright.sync_api import sync_playwright
import requests

def run_task(task: str, start_url: str = "https://google.com"):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=False)  # headless=False to watch
        page = browser.new_page()
        page.goto(start_url)
        
        while True:
            # Take screenshot
            screenshot = page.screenshot()
            
            # Send to Fara-7B for next action
            response = requests.post("http://localhost:5000/v1/chat/completions", json={
                "model": "microsoft/Fara-7B",
                "messages": [
                    {"role": "system", "content": f"Task: {task}"},
                    {"role": "user", "content": [
                        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{screenshot_b64}"}}
                    ]}
                ]
            })
            
            action = parse_action(response.json())
            
            if action["type"] == "terminate":
                break
            elif action["type"] == "left_click":
                page.mouse.click(action["x"], action["y"])
            elif action["type"] == "type":
                page.keyboard.type(action["text"])
            # ... handle other actions

Performance tips

Use bf16 precision — Fara-7B was trained in bf16. Using fp16 or lower can degrade action accuracy.
Keep context short — Only send the last 3-5 screenshots as history. The full 128K context is rarely needed.
Set a viewport size — 1280×720 or 1920×1080. Consistent resolution helps the model predict coordinates accurately.
Use headless mode for speed — headless=True skips rendering, making the loop faster.

Safety considerations

Running an AI agent that can click, type, and navigate the web requires caution:

Sandbox it — Use Docker or a VM. Don’t run on your main browser session.
URL allowlist — Restrict which domains the agent can visit.
Watch for critical points — Fara-7B is trained to pause before purchases, logins, and form submissions. Respect these pauses.
Set a step limit — Cap the number of actions (e.g., 50) to prevent infinite loops.
Don’t store credentials — Let the agent ask you to type passwords rather than providing them in the prompt.

Troubleshooting

“CUDA out of memory”: Use --tensor-parallel-size 2 for multi-GPU, or switch to a Q4 quantization.

Actions are inaccurate (clicking wrong spots): Ensure your screenshot resolution matches what the model expects. Use 1280×720 or 1920×1080. Avoid scaling/DPI issues.

Model outputs gibberish instead of actions: Make sure you’re using the correct prompt format (system message with task + image input). Check the GitHub repo for the exact template.

FAQ

Can I run Fara-7B on a Mac?

Yes, via llama.cpp with GGUF quantization. A 16GB M2/M3/M4 Mac runs the Q4 version comfortably. You’ll need to build your own browser integration since the official CLI targets Linux + NVIDIA.

Does it work with Firefox or just Chrome?

The official setup uses Chromium via Playwright, but since Fara-7B works from screenshots, it’s browser-agnostic. You can use any browser — just capture screenshots and execute actions programmatically.

Can I fine-tune it for my specific website?

Yes. MIT license allows fine-tuning. Capture trajectories of tasks on your site, format them as training data, and fine-tune with QLoRA. This can significantly improve accuracy for your specific UI.

How fast is it per action?

On an A100: ~1-2 seconds per action (screenshot → inference → action). On a 4090: ~2-3 seconds. On CPU: 10-15 seconds. A typical 10-step task takes 20-30 seconds on GPU.

How to Run Microsoft Fara-7B Locally — Complete Setup Guide

Hardware requirements

Method 1: Official setup (vLLM)

Method 2: Ollama (easiest)

Method 3: llama.cpp (Mac/CPU)

Method 4: Docker (sandboxed, recommended for safety)

Connecting to a browser

Performance tips

Safety considerations

Troubleshooting

FAQ

Can I run Fara-7B on a Mac?

Does it work with Firefox or just Chrome?

Can I fine-tune it for my specific website?

How fast is it per action?

Related articles

📬 AI Dev Weekly

You might also like

Aion 1.0: Microsoft's On-Device AI Models for Windows (2026)

What is Microsoft Fara-7B? The First Open-Source Computer Use Agent

How to Run Llama 4 Maverick (400B) Locally — Setup Guide (2026)

How to Run Gemma 4 Locally — Complete Setup Guide (2026)