Fara-7B: Microsoft’s Compact AI That Controls Browsers Locally
'Microsoft Research unveiled Fara-7B, a 7B multimodal agentic model that controls browsers locally from screenshots and text, executing grounded actions like clicks and typing with lower latency and cost compared to server side agents.'
What Fara-7B does
Microsoft Research released Fara-7B, a 7 billion parameter agentic small language model built for direct computer use. Unlike chat oriented LLMs that return text, Fara-7B controls the browser or desktop UI from screenshots and text context. It predicts low level actions such as clicks, typing, scrolling, visits and searches, allowing it to complete tasks like form filling, booking or price comparison while keeping data local and reducing latency.
How it interacts with web pages
The model consumes screenshots and the current text context, reasons about page layout, then emits a chain of thought followed by a tool call with grounded arguments. Coordinates are predicted as pixel positions on the screenshot, so the model can operate without access to the accessibility tree at inference time. The tool set maps to the Magentic-UI interface and includes actions like key, type, mouse_move, left_click, scroll, visit_url, web_search, history_back, pause_and_memorize_fact, wait and terminate.
FaraGen: synthetic training data at scale
A major bottleneck for computer use agents is high quality multi step web interaction data. The Fara project addresses this with FaraGen, a synthetic data engine that generates and filters web trajectories on live sites. FaraGen uses a three stage pipeline:
- Task Proposal: seed URLs from public corpora are classified by domain and converted by LLMs into realistic, achievable, verifiable tasks that avoid logins and paywalls.
- Task Solving: a multi agent system based on Magentic-One and Magentic-UI orchestrates an Orchestrator, a WebSurfer that emits Playwright actions, and a UserSimulator for clarification steps.
- Trajectory Verification: three LLM verifiers check alignment, score partial completion via rubrics, and inspect screenshots plus final answers to reduce hallucinations.
After filtering, FaraGen produced 145,603 trajectories with 1,010,797 steps across 70,117 unique domains. Trajectories range from 3 to 84 steps with an average of 6.9 steps, and generating verified trajectories using premium models was estimated at roughly 1 dollar per trajectory.
Model architecture and training
Fara-7B is a multimodal decoder only model built on Qwen2.5-VL-7B. It accepts a user goal, recent browser screenshots and the full history of previous thoughts and actions, operating with a context window of 128,000 tokens. At each step the model generates chain of thought text and then issues a tool call with grounded arguments like pixel coordinates, text, or URLs.
Training relied on supervised finetuning over about 1.8 million samples. The dataset mixes FaraGen trajectories split into observe-think-act steps, grounding and UI localization tasks, screenshot based VQA and captioning, plus safety and refusal examples.
Benchmarks and efficiency
Microsoft evaluated Fara-7B on four live web benchmarks. Reported successes are 73.5% on WebVoyager, 34.1% on Online-Mind2Web, 26.2% on DeepShop, and 38.4% on WebTailBench. These results surpass the UI-TARS-1.5-7B baseline on all four benchmarks and compare favorably with larger SoM systems built on GPT-4o and other proprietary models.
On WebVoyager Fara-7B used roughly 124,000 input tokens and 1,100 output tokens per task with about 16.5 actions on average. The team estimates an average inference cost of about 0.025 dollars per task using market token prices, substantially cheaper in output token usage than SoM agents backed by GPT-5 class models.
Why this matters
Fara-7B demonstrates a path from multi agent data generation to a single compact model that runs on local hardware, lowers inference cost, and helps preserve browsing privacy. By compressing multi agent behavior into a single multimodal model and training on verified synthetic trajectories, Microsoft shows a practical approach to building agentic systems that act directly on user devices with reasonable accuracy and cost.
Сменить язык
Читать эту статью на русском