πŸ€– AI Tools
Β· 5 min read

How to Run Microsoft Fara-7B Locally β€” Complete Setup Guide


Fara-7B is Microsoft’s open-source Computer Use Agent β€” a 7B model that can browse the web autonomously from screenshots. Here’s how to run it on your own hardware.

Hardware requirements

SetupVRAMRAMSpeed
bf16 (full precision)16GB32GBFast
Q8 quantized10GB16GBFast
Q4 quantized6GB16GBMedium
CPU-only (GGUF Q4)β€”16GBSlow

Recommended: NVIDIA RTX 4090, A6000, or any GPU with 16GB+ VRAM. Apple Silicon Macs with 16GB+ unified memory also work via llama.cpp.

Method 1: Official setup (vLLM)

This is Microsoft’s recommended approach. Requires a Linux machine with NVIDIA GPU.

# Clone the Fara repository
git clone https://github.com/microsoft/fara.git
cd fara

# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install Fara and dependencies
pip install -e .
playwright install

# Install vLLM
pip install vllm>=0.10.0

Start the model server:

vllm serve "microsoft/Fara-7B" --port 5000 --dtype auto

If you run out of VRAM on a single GPU:

# Use tensor parallelism across 2 GPUs
vllm serve "microsoft/Fara-7B" --port 5000 --dtype auto --tensor-parallel-size 2

Run a task:

# Simple task
fara-cli --task "search for the cheapest flight from NYC to London next Tuesday"

# With a specific starting URL
fara-cli --task "find the return policy" --url "https://example-store.com"

The CLI opens a browser, takes screenshots, sends them to the model, and executes the predicted actions.

Method 2: Ollama (easiest)

If you just want to chat with the model or test it quickly:

# Pull the GGUF quantized version
ollama pull bartowski/microsoft_Fara-7B-GGUF

# Or a specific quantization
ollama pull bartowski/microsoft_Fara-7B-GGUF:Q6_K_L

Note: Ollama gives you the raw model for text/image inference, but doesn’t include the browser automation framework. For actual computer use, you need the official setup (Method 1) or a custom integration.

Method 3: llama.cpp (Mac/CPU)

For Apple Silicon Macs or CPU-only setups:

# Download a GGUF quantization
# Q6_K_L (6.5GB) β€” recommended quality/size balance
wget https://huggingface.co/bartowski/microsoft_Fara-7B-GGUF/resolve/main/Fara-7B-Q6_K_L.gguf

# Run with llama.cpp
./llama-server -m Fara-7B-Q6_K_L.gguf -c 4096 --port 5000

Quantization options:

FileSizeQualityUse case
Q8_08.1GBNear-perfectIf you have the VRAM
Q6_K_L6.5GBExcellentRecommended default
Q6_K6.3GBVery goodSlightly smaller
Q4_K_M4.5GBGood8GB VRAM/RAM constrained
Q4_K_S4.2GBAcceptableMinimum viable quality

Microsoft provides a Docker-based setup via Magentic-UI for safe web execution:

# Clone and run with Docker
git clone https://github.com/microsoft/fara.git
cd fara

# Build the Docker image
docker build -t fara-agent .

# Run with GPU access
docker run --gpus all -p 5000:5000 fara-agent

This is the safest approach β€” the browser runs inside the container, isolated from your host system.

Connecting to a browser

Fara-7B needs a browser to interact with. The official setup uses Playwright:

from playwright.sync_api import sync_playwright
import requests

def run_task(task: str, start_url: str = "https://google.com"):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=False)  # headless=False to watch
        page = browser.new_page()
        page.goto(start_url)
        
        while True:
            # Take screenshot
            screenshot = page.screenshot()
            
            # Send to Fara-7B for next action
            response = requests.post("http://localhost:5000/v1/chat/completions", json={
                "model": "microsoft/Fara-7B",
                "messages": [
                    {"role": "system", "content": f"Task: {task}"},
                    {"role": "user", "content": [
                        {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{screenshot_b64}"}}
                    ]}
                ]
            })
            
            action = parse_action(response.json())
            
            if action["type"] == "terminate":
                break
            elif action["type"] == "left_click":
                page.mouse.click(action["x"], action["y"])
            elif action["type"] == "type":
                page.keyboard.type(action["text"])
            # ... handle other actions

Performance tips

  1. Use bf16 precision β€” Fara-7B was trained in bf16. Using fp16 or lower can degrade action accuracy.
  2. Keep context short β€” Only send the last 3-5 screenshots as history. The full 128K context is rarely needed.
  3. Set a viewport size β€” 1280Γ—720 or 1920Γ—1080. Consistent resolution helps the model predict coordinates accurately.
  4. Use headless mode for speed β€” headless=True skips rendering, making the loop faster.

Safety considerations

Running an AI agent that can click, type, and navigate the web requires caution:

  • Sandbox it β€” Use Docker or a VM. Don’t run on your main browser session.
  • URL allowlist β€” Restrict which domains the agent can visit.
  • Watch for critical points β€” Fara-7B is trained to pause before purchases, logins, and form submissions. Respect these pauses.
  • Set a step limit β€” Cap the number of actions (e.g., 50) to prevent infinite loops.
  • Don’t store credentials β€” Let the agent ask you to type passwords rather than providing them in the prompt.

Troubleshooting

β€œCUDA out of memory”: Use --tensor-parallel-size 2 for multi-GPU, or switch to a Q4 quantization.

Actions are inaccurate (clicking wrong spots): Ensure your screenshot resolution matches what the model expects. Use 1280Γ—720 or 1920Γ—1080. Avoid scaling/DPI issues.

Model outputs gibberish instead of actions: Make sure you’re using the correct prompt format (system message with task + image input). Check the GitHub repo for the exact template.

FAQ

Can I run Fara-7B on a Mac?

Yes, via llama.cpp with GGUF quantization. A 16GB M2/M3/M4 Mac runs the Q4 version comfortably. You’ll need to build your own browser integration since the official CLI targets Linux + NVIDIA.

Does it work with Firefox or just Chrome?

The official setup uses Chromium via Playwright, but since Fara-7B works from screenshots, it’s browser-agnostic. You can use any browser β€” just capture screenshots and execute actions programmatically.

Can I fine-tune it for my specific website?

Yes. MIT license allows fine-tuning. Capture trajectories of tasks on your site, format them as training data, and fine-tune with QLoRA. This can significantly improve accuracy for your specific UI.

How fast is it per action?

On an A100: ~1-2 seconds per action (screenshot β†’ inference β†’ action). On a 4090: ~2-3 seconds. On CPU: 10-15 seconds. A typical 10-step task takes 20-30 seconds on GPU.