Fara-7B
Fara is a multimodal web agent model that observes the browser and acts on behalf of the user by emitting tool‑calls (e.g., click(x,y), type, scroll, select) to complete web tasks end‑to‑end. Fara is trained on data generated by a scalable multi‑agent pipeline that synthesizes diverse web tasks, executes trajectories to solve them, and verifies those trajectories. Resulting SFT recipes target task completion, action grounding, and safe behavior.
Fara supports automating web tasks: shopping, booking travel, restaurant reservations, info seeking, account workflows. Fara has a context length of 128 tokens. Our training datasets are sourced from multiple pipelines. Data generation starts with bottom-up seed sites and task proposals, where multi-agent solvers and verifiers produce validated trajectories. Grounding relies on curated datasets that predict actions and on-screen coordinates. UI understanding is built through visual question answering, captioning, and OCR on web page screenshots collected during data generation. Finally, safety and instruction-following are reinforced with refusal and harm-prevention datasets, along with instruction-following data that help models decide when to terminate or act appropriately.
Quick facts
Model providerMicrosoft
TypeImage to text
LifecycleGenerally available (GA)
Input typeimage, text
Output typetext
PricingView pricing