Phi-3.5-vision-instruct

Phi-3.5-vision-instruct

Refresh of Phi-3-vision model.
Microsoft
Version: 2
Phi-3.5-vision is a lightweight, state-of-the-art open multimodal model built upon datasets which include - synthetic data and filtered publicly available websites - with a focus on very high-quality, reasoning dense data both on text and vision. The model belongs to the Phi-3 model family, and the multimodal version comes with 128K context length (in tokens) it can support. The model underwent a rigorous enhancement process, incorporating both supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures.

Resources

🏡 Phi-3 Portal

📰 Phi-3 Microsoft Blog

📖 Phi-3 Technical Report

👩‍🍳 Phi-3 Cookbook

Model Summary

ArchitecturePhi-3.5-vision has 4.2B parameters and contains image encoder, connector, projector, and Phi-3 Mini language model.
InputsText and Image. It’s best suited for prompts using the chat format.
Context length128K tokens
GPUs256 A100-80G
Training time6 days
Training data500B tokens (vision tokens + text tokens)
OutputsGenerated text in response to the input
DatesTrained between July and August 2024
StatusThis is a static model trained on an offline text dataset with cutoff date March 15, 2024. Future versions of the tuned models may be released as we improve models.
Release dateAugust 20, 2024
LicenseMIT

Quick facts

Model providerMicrosoft
TypeChat completion
LifecycleGenerally available (GA)
Input typetext, image
Context window131.072k
Token limits4096 output