Muse
Muse
Version: 3
MicrosoftLast updated November 2025
Muse is a World and Human Action Model (WHAM), a generative model of gameplay (visuals and/or controller actions).

Key capabilities

About this model

Muse is an autoregressive model that has been trained to predict (tokenized) game visuals and controller actions given a prompt. The resulting model can generate consistent game sequences, and shows evidence of capturing the 3D structure of the game environment, the effects of controller actions, and the temporal structure of the game (up to the model's context length).

Key model capabilities

This allows the user to run the model in (a) world modelling mode (generate visuals given controller actions), (b) behavior policy (generate controller actions given @past visuals), or (c) generate both visuals and behavior. Muse can be used in multiple scenarios. The following list illustrates the types of tasks that Muse can be used for:
  • World Model: Visuals are predicted, given a real starting state and action sequence.
  • Behaviour Policy: Given visuals, the model predicts the next controller action.
  • Full Generation: The model generates both the visuals and the controller actions a human player might take in the game.

Use cases

See Responsible AI for additional considerations for responsible use.

Key use cases

This model and accompanying code are intended for academic research purposes only. Muse has been trained on gameplay data from a single game, Bleeding Edge, and is intended to be used to generate plausible gameplay sequences resembling this game.

Out of scope use cases

The model is not intended to be used to generate imagery outside of the game Bleeding Edge. Generated images include watermark and provenance metadata. Do not remove the watermark or provenance metadata.

Pricing

Pricing is based on a number of factors, including deployment type and tokens used. See pricing details here.

Technical specs

  • Architecture: A decoder-only transformer that predicts the next token corresponding to an interleaved sequence of observations and actions. The image tokenizer is a VQ-GAN.
  • Context length: 10 (observation, action) pairs / 5560 tokens
  • GPUs: 98xH100 GPUs

Training cut-off date

The provider has not supplied this information.

Training time

Training time: 5 days

Input formats

Prompts here can be either visual (one or more initial game visuals) and / or controller actions.

Output formats

The provider has not supplied this information.

Supported languages

The provider has not supplied this information.

Sample JSON response

The provider has not supplied this information.

Model architecture

Muse consists of two components, an encoder-decoder VQ-GAN trained to encode game visuals to a discrete representation, and a transformer backbone trained to perform next-token prediction. We train both components from scratch.

Long context

  • Limited context length (10s)

Optimizing model performance

The provider has not supplied this information.

Additional assets

One of the ways to interact with the model is with the use of WHAM Demonstrator (download link here) . To use it, ensure that you have the required .NET Core Runtime . If this is not yet installed, an error message will pop up from which you can follow a link to download and install this package. Refer to the instructions provided in the zip file for more information.

Training disclosure

Training, testing and validation

Muse was trained on human gameplay data to predict game visuals and players' controller actions. We worked with the game studio Ninja Theory and their game Bleeding Edge – a 3D, 4v4 multiplayer video game. From the resulting data we extracted one year's worth of anonymized gameplay from 27,990 players, capturing a wide range of behaviors and interactions. The model was trained on data from approximately 500,000 Bleeding Edge games from all seven game maps (over 1 billion observation, action pairs 10Hz, equivalent to over 7 years of continuous human gameplay).

Distribution

Distribution channels

The provider has not supplied this information.

More information

This work has been funded by Microsoft Research.

Responsible AI considerations

Safety techniques

Generated images include watermark and provenance metadata. Do not remove the watermark or provenance metadata. Muse has been tested with out of context prompt images to evaluate the risk of outputting harmful or nonsensical images. The generated image sequences did not retain the initial image, but rather dissolved into either unrecognizable blobs or to scenes resembling the training environment.

Safety evaluations

The provider has not supplied this information.

Known limitations

Models trained using game data may potentially behave in ways that are unfair, unreliable, or offensive, in turn causing harms. We emphasize that these types of harms are not mutually exclusive. A single model can exhibit more than one type of harm, potentially relating to multiple different groups of people. For example, the output of the model can be nonsensical or might look reasonable but is inaccurate with respect to external validation sources. The training data represents gameplay recordings from a variety of skilled and unskilled gameplayers, representing diverse demographic characteristics. Not all possible player characteristics are represented and model performance may therefore vary. The model, as it is, can only be used to generate visuals and controller inputs. Users should not manipulate images and attempt to generate offensive scenes. Model:
  • Trained on a single game, very specialized, not intended for image prompts that are out of context or from other domains
  • Limited context length (10s)
  • Limited image resolution (300px x 180px), the model can only generate images at this fixed resolution.
  • Generated images and controls can incorrect or unrecognizable.
  • Inference time is currently too slow for real-time use.
WHAM Demonstrator:
  • Developed as a way to explore potential interactions. This is not intended as a fully-fledged user experience or demo.
Although users can input any image as a starting point, the model is only trained to generate images and controller actions based on the structure of the Bleeding Edge game environment that it has learned from the training data. Out of domain inputs lead to unpredictable results. For example, this could include a sequence of images that dissolve into unrecognizable blobs. Model generations when "out of scope" image elements are introduced will either:
  • Dissolve into unrecognizable blobs of color.
  • Morphed into game-relevant items such as game characters.

Acceptable use

Acceptable use policy

This model and accompanying code are intended for academic research purposes only. Muse has been trained on gameplay data from a single game, Bleeding Edge, and is intended to be used to generate plausible gameplay sequences resembling this game. Muse can be used in multiple scenarios. The following list illustrates the types of tasks that Muse can be used for:
  • World Model: Visuals are predicted, given a real starting state and action sequence.
  • Behaviour Policy: Given visuals, the model predicts the next controller action.
  • Full Generation: The model generates both the visuals and the controller actions a human player might take in the game.
The model is not intended to be used to generate imagery outside of the game Bleeding Edge. Generated images include watermark and provenance metadata. Do not remove the watermark or provenance metadata.

Quality and performance evaluations

Source: Microsoft The provider has not supplied this information.

Benchmarking methodology

Source: Microsoft Muse has been tested with out of context prompt images to evaluate the risk of outputting harmful or nonsensical images. The generated image sequences did not retain the initial image, but rather dissolved into either unrecognizable blobs or to scenes resembling the training environment.

Public data summary

Source: Microsoft The provider has not supplied this information.
Model Specifications
LicenseApache-2.0
Last UpdatedNovember 2025
Input TypeImage
Output TypeImage
ProviderMicrosoft
Languages1 Language