deci-decidiffusion-v1-0

Version: 7

Deci AI•Last updated June 2024

DeciDiffusion 1.0 is an 820 million parameter latent diffusion model designed for text-to-image conversion. Trained initially on the LAION-v2 dataset and fine-tuned on the LAION-ART dataset, the model's training involved advanced techniques to improve speed, training performance, and achieve superior inference quality. DeciDiffusion 1.0 retains key elements from Stable Diffusion, like the Variational Autoencoder (VAE) and CLIP's pre-trained Text Encoder, while introducing notable improvements. But U-Net is replaced with the more efficient U-Net-NAS which is developed by Deci. This novel component streamlines the model by reducing parameters, resulting in enhanced computational efficiency. For more details, review the blog .

Training Details

Training Procedure

This model was trained in 4 phases.

It was trained from scratch for 1.28 million steps at a resolution of 256x256 using 320 million samples from LAION-v2.
The model was trained for 870k steps at a higher resolution of 512x512 on the same dataset to capture more fine-detailed information.
Training for 65k steps with EMA, a different learning rate scheduler, and more qualitative data.
Then the model underwent fine-tuning on a 2 million sample subset of the LAION-ART dataset.

In phase 1, 8 X 8 X A100 GPUs, AdamW optimizer had been used with batch size 8192 and learning rate 1e-4. In phases 2-4, 8 X 8 X H100 GPUs, LAMB optimizer had been used with batch size 6144 and learning rate 5e-3.

Limitations and Biases

Limitations

The model has limitations and may not perform optimally in various scenarios. It doesn't generate entirely photorealistic images. Rendering legible text is beyond its capability. The generation of faces and human figures may lack precision. The model is primarily optimized for English captions and may not be as effective with other languages. The auto-encoding component of the model is lossy.

Biases

DeciDiffusion primarily underwent training on subsets of LAION-v2, with a focus on English descriptions. As a result, there might be underrepresentation of non-English communities and cultures, potentially introducing bias towards white and western norms. The accuracy of outputs from non-English prompts is notably less accurate. Considering these biases, users are advised to exercise caution when using DeciDiffusion, irrespective of the input provided.

License

creativeml-openrail++-m

Inference Samples

Inference type	Python sample (Notebook)	CLI with YAML
Real time	text-to-image-online-endpoint.ipynb	text-to-image-online-endpoint.sh
Batch	text-to-image-batch-endpoint.ipynb	text-to-image-batch-endpoint.sh

Inference with Azure AI Content Safety (AACS) Samples

Inference type	Python sample (Notebook)
Real time	safe-text-to-image-online-deployment.ipynb
Batch	safe-text-to-image-batch-endpoint.ipynb

Sample input and output

Sample input

{
   "input_data": {
        "columns": ["prompt"],
        "data": ["A photo of an astronaut riding a horse on Mars"],
        "index": [0]
    }
}

Sample output

[
    {
        "prompt": "A photo of an astronaut riding a horse on Mars",
        "generated_image": "image",
        "nsfw_content_detected": null
    }
]

Note:

"image" string is in base64 format.

The deci-decidiffusion-v1-0 model checks for the NSFW content in generated image. We highly recommend to use the model with Azure AI Content Safety (AACS) . Please refer sample online and batch notebooks for AACS integrated deployments.

Visualization of inference result for a sample prompt - "a photograph of an astronaut riding a horse"

Model Specifications

LicenseCreativeml-openrail++-m

Last UpdatedJune 2024

PublisherDeci AI

Quick Start