microsoft-magma-8b
microsoft-magma-8b
Version: 7
HuggingFaceLast updated June 2025
microsoft/Magma-8B powered by Hugging Face Inference Toolkit

Send Request

You can use cURL or any REST Client to send a request to the AzureML endpoint with your AzureML token.
curl <AZUREML_ENDPOINT_URL>/v1/chat/completions \
    -X POST \
    -d '{"model":"microsoft/Magma-8B","messages":[{"role":"system","content":"You are agent that can see, talk and act."},{"role":"user","content":[{"type":"text","text":"What is in this image?"},{"type":"image_url","image_url":{"url":"https://assets-c4akfrf5b4d3f4b7.z01.azurefd.net/assets/2024/04/BMDataViz_661fb89f3845e.png"}}]}],"temperature":1.0,"stream":true}' \
    -H "Authorization: Bearer <AZUREML_TOKEN>" \
    -H "Content-Type: application/json"
For more information on the supported parameters refer the OpenAI documentation for chat completions . Additionally, if you plan to use this model for Visual-Language-Action (VLA) robotic use-cases, you can also obtain the action token IDs out of the generated tokens as per the following request (note that streaming is set to false as you need all the generated tokens in order to obtain the action token IDs, the temperature is set to 0.1 for deterministic-like results, and the prompt ends with "\nProvide the action tokens for the robot." for better results):
curl <AZUREML_ENDPOINT_URL>/v1/chat/completions \
    -X POST \
    -d '{"model":"microsoft/Magma-8B","messages":[{"role":"user","content":[{"type":"text","text":"What is the next action token for the robot in the image to move its end-effector straight up in vertical position?\nProvide the action tokens for the robot."},{"type":"image_url","image_url":{"url":"https://rhinotoolhouse.com/wp-content/uploads/60107346_m_normal_none.jpg"}}]}],"temperature":0.1,"stream":false}' \
    -H "Authorization: Bearer <AZUREML_TOKEN>" \
    -H "Content-Type: application/json"
So that then you can call the /v1/action-tokens endpoint that will convert the generated output tokens into extracted actions for the given robot (either bridge_orig or google_robot, or with custom normalization stats instead via normalization_stats) as it follows:
curl <AZUREML_ENDPOINT_URL>/v1/action-tokens \
    -X POST \
    -d '{"text":"立刻ẳn сирiв衝ційнаiв","robot_type":"bridge_orig"}' \
    -H "Authorization: Bearer <AZUREML_TOKEN>" \
    -H "Content-Type: application/json"

I/O Example

Visual Language Models (VLMs)

The following input payload:
{
    "model": "microsoft/Magma-8B",
    "messages": [
        {"role": "system", "content": "You are agent that can see, talk and act."},
        {"role": "user", "content": [
            {"type": "text", "text":"What is in this image?"},
            {"type": "image_url", "image_url": {"url": "https://assets-c4akfrf5b4d3f4b7.z01.azurefd.net/assets/2024/04/BMDataViz_661fb89f3845e.png"}}
        ]}
    ],
    "temperature": 1.0,
    "stream": false
}
Would produce a response like the following:
{
  "id": "chatcmpl-4f37b5c82d",
  "object": "chat.completion",
  "created": 1747162017,
  "model": "microsoft/Magma-8B",
  "choices": [
    {
      "index": 0,
      "message": {
        "content": "The image presents a bar graph titled "Percentage of respondents who agree". The graph is divided into four sections, each representing a different level of agreement with a statement. The statements are "Having clear and pre-defined goals for meetings", "Knowing where to find the information I need for a meeting", "Understanding my exact role and responsibilities", and "Having more focus time to prepare for meetings". 

The bars for the first two statements are green, indicating a high level of agreement. The bars for the third and fourth statements are yellow, suggesting a lower level of agreement. The bars for the first and second statements are slightly taller than those for the third and fourth statements, indicating that more respondents agreed with the first two statements than the third and fourth. 

The graph does not provide any additional text or context beyond the title and the statements. The relative positions of the bars suggest a comparison between the levels of agreement for the different statements. However, without additional information, it's difficult to determine the exact significance of these results.",
        "refusal": null,
        "role": "assistant",
        "annotations": []
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 37,
    "completion_tokens": 206,
    "reasoning_tokens": 0,
    "total_tokens": 243,
    "completion_tokens_details": {
      "accepted_prediction_tokens": 0,
      "audio_tokens": 0,
      "reasoning_tokens": 0,
      "rejected_prediction_tokens": 0
    },
    "prompt_tokens_details": {
      "audio_tokens": 0,
      "cached_tokens": 0
    }
  },
  "service_tier": "default",
  "system_fingerprint": "43d325a5-34c9-4bd4-a2e9-e2af6330d1e7"
}

Visual-Language-Action (VLA)

First you send a request as it follows:
curl <AZUREML_ENDPOINT_URL>/v1/chat/completions \
    -X POST \
    -d '{"model":"microsoft/Magma-8B","messages":[{"role":"user","content":[{"type":"text","text":"What is the next action token for the robot in the image to move its end-effector straight up in vertical position?\nProvide the action tokens for the robot."},{"type":"image_url","image_url":{"url":"https://rhinotoolhouse.com/wp-content/uploads/60107346_m_normal_none.jpg"}}]}],"temperature":0.1,"stream":false}' \
    -H "Authorization: Bearer <AZUREML_TOKEN>" \
    -H "Content-Type: application/json"
Producing a response like the following:
{
  "id": "chatcmpl-e3317d6d9b",
  "object": "chat.completion",
  "created": 1750069687,
  "model": "microsoft/Magma-8B",
  "choices": [
    {
      "index": 0,
      "message": {
        "content": "立刻ẳn сирiв衝ційнаiв",
        "refusal": null,
        "role": "assistant",
        "annotations": []
      },
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 47,
    "completion_tokens": 8,
    "reasoning_tokens": 0,
    "total_tokens": 55,
    "completion_tokens_details": {
      "accepted_prediction_tokens": 0,
      "audio_tokens": 0,
      "reasoning_tokens": 0,
      "rejected_prediction_tokens": 0
    },
    "prompt_tokens_details": {
      "audio_tokens": 0,
      "cached_tokens": 0
    }
  },
  "service_tier": "default",
  "system_fingerprint": "38b490d1-fadc-4db9-b423-e3e998452fbf"
}
And then, you need to grab the ".choices[0].message.content" value and send a request as the following:
curl <AZUREML_ENDPOINT_URL>/v1/action-tokens \
    -X POST \
    -d '{"text":"立刻ẳn сирiв衝ційнаiв","robot_type":"bridge_orig"}' \
    -H "Authorization: Bearer <AZUREML_TOKEN>" \
    -H "Content-Type: application/json"
Finally, you'll get the actions as it follows:
{
  "x": 0.006277765673749571,
  "y": -0.0017191617944661772,
  "z": -0.004658273500116439,
  "roll": 0.0004997134208677839,
  "pitch": 0.00591804922503579,
  "yaw": -0.004902105561658515,
  "gripper": 0
}
Note that you can also send custom normalization_stats via the normalization_stats parameter instead of relying on the default normalization stats based on the robot_type parameter, and the provided normalization_stats should contain: mask (a list of bool) and both q01 and q99 a list of float, all the lists with exactly 7 items (standing for the respective action tokens).
Model Specifications
LicenseMit
Last UpdatedJune 2025
PublisherHuggingFace