Fara-7B is Microsoftβs open-source Computer Use Agent β a 7B model that can browse the web autonomously from screenshots. Hereβs how to run it on your own hardware.
Hardware requirements
| Setup | VRAM | RAM | Speed |
|---|---|---|---|
| bf16 (full precision) | 16GB | 32GB | Fast |
| Q8 quantized | 10GB | 16GB | Fast |
| Q4 quantized | 6GB | 16GB | Medium |
| CPU-only (GGUF Q4) | β | 16GB | Slow |
Recommended: NVIDIA RTX 4090, A6000, or any GPU with 16GB+ VRAM. Apple Silicon Macs with 16GB+ unified memory also work via llama.cpp.
Method 1: Official setup (vLLM)
This is Microsoftβs recommended approach. Requires a Linux machine with NVIDIA GPU.
# Clone the Fara repository
git clone https://github.com/microsoft/fara.git
cd fara
# Create virtual environment
python3 -m venv .venv
source .venv/bin/activate
# Install Fara and dependencies
pip install -e .
playwright install
# Install vLLM
pip install vllm>=0.10.0
Start the model server:
vllm serve "microsoft/Fara-7B" --port 5000 --dtype auto
If you run out of VRAM on a single GPU:
# Use tensor parallelism across 2 GPUs
vllm serve "microsoft/Fara-7B" --port 5000 --dtype auto --tensor-parallel-size 2
Run a task:
# Simple task
fara-cli --task "search for the cheapest flight from NYC to London next Tuesday"
# With a specific starting URL
fara-cli --task "find the return policy" --url "https://example-store.com"
The CLI opens a browser, takes screenshots, sends them to the model, and executes the predicted actions.
Method 2: Ollama (easiest)
If you just want to chat with the model or test it quickly:
# Pull the GGUF quantized version
ollama pull bartowski/microsoft_Fara-7B-GGUF
# Or a specific quantization
ollama pull bartowski/microsoft_Fara-7B-GGUF:Q6_K_L
Note: Ollama gives you the raw model for text/image inference, but doesnβt include the browser automation framework. For actual computer use, you need the official setup (Method 1) or a custom integration.
Method 3: llama.cpp (Mac/CPU)
For Apple Silicon Macs or CPU-only setups:
# Download a GGUF quantization
# Q6_K_L (6.5GB) β recommended quality/size balance
wget https://huggingface.co/bartowski/microsoft_Fara-7B-GGUF/resolve/main/Fara-7B-Q6_K_L.gguf
# Run with llama.cpp
./llama-server -m Fara-7B-Q6_K_L.gguf -c 4096 --port 5000
Quantization options:
| File | Size | Quality | Use case |
|---|---|---|---|
| Q8_0 | 8.1GB | Near-perfect | If you have the VRAM |
| Q6_K_L | 6.5GB | Excellent | Recommended default |
| Q6_K | 6.3GB | Very good | Slightly smaller |
| Q4_K_M | 4.5GB | Good | 8GB VRAM/RAM constrained |
| Q4_K_S | 4.2GB | Acceptable | Minimum viable quality |
Method 4: Docker (sandboxed, recommended for safety)
Microsoft provides a Docker-based setup via Magentic-UI for safe web execution:
# Clone and run with Docker
git clone https://github.com/microsoft/fara.git
cd fara
# Build the Docker image
docker build -t fara-agent .
# Run with GPU access
docker run --gpus all -p 5000:5000 fara-agent
This is the safest approach β the browser runs inside the container, isolated from your host system.
Connecting to a browser
Fara-7B needs a browser to interact with. The official setup uses Playwright:
from playwright.sync_api import sync_playwright
import requests
def run_task(task: str, start_url: str = "https://google.com"):
with sync_playwright() as p:
browser = p.chromium.launch(headless=False) # headless=False to watch
page = browser.new_page()
page.goto(start_url)
while True:
# Take screenshot
screenshot = page.screenshot()
# Send to Fara-7B for next action
response = requests.post("http://localhost:5000/v1/chat/completions", json={
"model": "microsoft/Fara-7B",
"messages": [
{"role": "system", "content": f"Task: {task}"},
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{screenshot_b64}"}}
]}
]
})
action = parse_action(response.json())
if action["type"] == "terminate":
break
elif action["type"] == "left_click":
page.mouse.click(action["x"], action["y"])
elif action["type"] == "type":
page.keyboard.type(action["text"])
# ... handle other actions
Performance tips
- Use bf16 precision β Fara-7B was trained in bf16. Using fp16 or lower can degrade action accuracy.
- Keep context short β Only send the last 3-5 screenshots as history. The full 128K context is rarely needed.
- Set a viewport size β 1280Γ720 or 1920Γ1080. Consistent resolution helps the model predict coordinates accurately.
- Use headless mode for speed β
headless=Trueskips rendering, making the loop faster.
Safety considerations
Running an AI agent that can click, type, and navigate the web requires caution:
- Sandbox it β Use Docker or a VM. Donβt run on your main browser session.
- URL allowlist β Restrict which domains the agent can visit.
- Watch for critical points β Fara-7B is trained to pause before purchases, logins, and form submissions. Respect these pauses.
- Set a step limit β Cap the number of actions (e.g., 50) to prevent infinite loops.
- Donβt store credentials β Let the agent ask you to type passwords rather than providing them in the prompt.
Troubleshooting
βCUDA out of memoryβ:
Use --tensor-parallel-size 2 for multi-GPU, or switch to a Q4 quantization.
Actions are inaccurate (clicking wrong spots): Ensure your screenshot resolution matches what the model expects. Use 1280Γ720 or 1920Γ1080. Avoid scaling/DPI issues.
Model outputs gibberish instead of actions: Make sure youβre using the correct prompt format (system message with task + image input). Check the GitHub repo for the exact template.
FAQ
Can I run Fara-7B on a Mac?
Yes, via llama.cpp with GGUF quantization. A 16GB M2/M3/M4 Mac runs the Q4 version comfortably. Youβll need to build your own browser integration since the official CLI targets Linux + NVIDIA.
Does it work with Firefox or just Chrome?
The official setup uses Chromium via Playwright, but since Fara-7B works from screenshots, itβs browser-agnostic. You can use any browser β just capture screenshots and execute actions programmatically.
Can I fine-tune it for my specific website?
Yes. MIT license allows fine-tuning. Capture trajectories of tasks on your site, format them as training data, and fine-tune with QLoRA. This can significantly improve accuracy for your specific UI.
How fast is it per action?
On an A100: ~1-2 seconds per action (screenshot β inference β action). On a 4090: ~2-3 seconds. On CPU: 10-15 seconds. A typical 10-step task takes 20-30 seconds on GPU.
Related articles
- What is Microsoft Fara-7B?
- Fara-7B vs Anthropic Computer Use vs OpenAI Operator
- Ollama Complete Guide
- Best AI Models for Mac
- How to Run DeepSeek Locally
- Serve LLMs with vLLM