Fara1.5-9B

Fara1.5-9B

Multimodal web agent for browser-based task automation.
Microsoft
Version: 1
Fara1.5-9B is a multimodal web agent developed by Microsoft Research AI Frontiers. It observes the browser via screenshots and acts on behalf of the user by emitting structured tool calls (click, type, scroll, web_search, visit_url, and others) to complete multi-step web tasks end-to-end. Primary use cases include filling forms, shopping, booking travel and restaurants, information seeking, and account-driven workflows. Fara1.5-9B is trained to recognize "critical points" in a task (situations involving missing user information, ambiguous instructions, or irreversible actions such as completing a purchase or sending a message) and to pause for user confirmation rather than proceed unilaterally. It is intended for human-in-the-loop, sandboxed deployment; the recommended deployment vehicle is MagenticLite, which provides allow-listed navigation, watch-mode action monitoring, an immediate pause control, and a Docker-isolated browser environment. Fara1.5-9B is a 9B-parameter multimodal decoder-only language model fine-tuned from the Qwen 3.5 9B base. It accepts the user's textual goal, current browser screenshot(s), and a textual history of prior thoughts and actions, and emits a chain-of-thought block followed by a structured tool-call block. The model is vision-only, perceiving the browser exclusively through UI screenshots, and supports up to a 262K-token context window, which enables long multi-step trajectories with substantial screenshot and action-history accumulation.

Quick facts

Model providerMicrosoft
TypeWeb agent tasks, Gui grounding, Image to text
LifecycleGenerally available (GA)
Input typeimage, text
Output typetext