AI Model Catalog | Azure AI Foundry Models

Muse

Version: 2

Microsoft•Last updated May 2025

Muse is a World and Human Action Model (WHAM), a generative model of gameplay (visuals and/or controller actions).

Muse is a World and Human Action Model (WHAM), a generative model of gameplay (visuals and/or controller actions) trained on gameplay data of Ninja Theory’s Xbox game Bleeding Edge. Model development was informed by requirements of game creatives that we identified through a user study. Our goal is to explore the capabilities that generative AI models need to support human creative exploration. Muse is developed by the Game Intelligence group at Microsoft Research , in collaboration with TaiX and Ninja Theory .
Muse is an autoregressive model that has been trained to predict (tokenized) game visuals and controller actions given a prompt. Prompts here can be either visual (one or more initial game visuals) and / or controller actions. This allows the user to run the model in (a) world modelling mode (generate visuals given controller actions), (b) behavior policy (generate controller actions given @past visuals), or (c) generate both visuals and behavior.
One of the ways to interact with the model is with the use of WHAM Demonstrator (download link here) . To use it, ensure that you have the required .NET Core Runtime . If this is not yet installed, an error message will pop up from which you can follow a link to download and install this package. Refer to the instructions provided in the zip file for more information. Muse consists of two components, an encoder-decoder VQ-GAN trained to encode game visuals to a discrete representation, and a transformer backbone trained to perform next-token prediction. We train both components from scratch. The resulting model can generate consistent game sequences, and shows evidence of capturing the 3D structure of the game environment, the effects of controller actions, and the temporal structure of the game (up to the model’s context length).
Muse was trained on human gameplay data to predict game visuals and players’ controller actions. We worked with the game studio Ninja Theory and their game Bleeding Edge – a 3D, 4v4 multiplayer video game. From the resulting data we extracted one year’s worth of anonymized gameplay from 27,990 players, capturing a wide range of behaviors and interactions.

Architecture: A decoder-only transformer that predicts the next token corresponding to an interleaved sequence of observations and actions. The image tokenizer is a VQ-GAN.
Context length: 10 (observation, action) pairs / 5560 tokens
GPUs: 98xH100 GPUs
Training time: 5 days
This work has been funded by Microsoft Research

Intended Use

Primary Use Cases

Models trained using game data may potentially behave in ways that are unfair, unreliable, or offensive, in turn causing harms. We emphasize that these types of harms are not mutually exclusive. A single model can exhibit more than one type of harm, potentially relating to multiple different groups of people. For example, the output of the model can be nonsensical or might look reasonable but is inaccurate with respect to external validation sources.
This model and accompanying code are intended for academic research purposes only. Muse has been trained on gameplay data from a single game, Bleeding Edge, and is intended to be used to generate plausible gameplay sequences resembling this game.
Muse can be used in multiple scenarios. The following list illustrates the types of tasks that Muse can be used for:

World Model: Visuals are predicted, given a real starting state and action sequence.
Behaviour Policy: Given visuals, the model predicts the next controller action.
Full Generation: The model generates both the visuals and the controller actions a human player might take in the game.

Out-of-Scope Use Cases

The model is not intended to be used to generate imagery outside of the game Bleeding Edge. Generated images include watermark and provenance metadata. Do not remove the watermark or provenance metadata.

Responsible AI Considerations

Muse has been tested with out of context prompt images to evaluate the risk of outputting harmful or nonsensical images. The generated image sequences did not retain the initial image, but rather dissolved into either unrecognizable blobs or to scenes resembling the training environment.

Training Data

The model was trained on data from approximately 500,000 Bleeding Edge games from all seven game maps (over 1 billion observation, action pairs 10Hz, equivalent to over 7 years of continuous human gameplay).

Bias, Risks and Limitations

The training data represents gameplay recordings from a variety of skilled and unskilled gameplayers, representing diverse demographic characteristics. Not all possible player characteristics are represented and model performance may therefore vary.
The model, as it is, can only be used to generate visuals and controller inputs. Users should not manipulate images and attempt to generate offensive scenes.

Technical limitations, operational factors, and ranges

Model:

Trained on a single game, very specialized, not intended for image prompts that are out of context or from other domains
Limited context length (10s)
Limited image resolution (300px x 180px), the model can only generate images at this fixed resolution.
Generated images and controls can incorrect or unrecognizable.
Inference time is currently too slow for real-time use.
WHAM Demonstrator:
Developed as a way to explore potential interactions. This is not intended as a fully-fledged user experience or demo.
Models trained using game data may potentially behave in ways that are unfair, unreliable, or offensive, in turn causing harms. We emphasize that these types of harms are not mutually exclusive. A single model can exhibit more than one type of harm, potentially relating to multiple different groups of people. For example, the output of the model can be nonsensical or might look reasonable but is inaccurate with respect to external validation sources.
Although users can input any image as a starting point, the model is only trained to generate images and controller actions based on the structure of the Bleeding Edge game environment that it has learned from the training data. Out of domain inputs lead to unpredictable results. For example, this could include a sequence of images that dissolve into unrecognizable blobs .
Model generations when “out of scope” image elements are introduced will either:
Dissolve into unrecognizable blobs of color.
Morphed into game-relevant items such as game characters.

Model Specifications

LicenseApache-2.0

Last UpdatedMay 2025

Input TypeImage

Output TypeImage

PublisherMicrosoft

Languages1 Language

Quick Start