unsloth-glm-4.7-flash-gguf
Version: 2
unsloth/GLM-4.7-Flash-GGUF with Q4_K_XL quantization, powered by llama.cpp
GGUF
GGUF is a binary file format optimized for quick loading and saving of models weights, making it highly efficient for inference purposes. It supports multiple quantization schemes, allowing users to choose the best trade-off between performance and resource usage.llama.cpp has native support for GGUF checkpoints, enabling seamless integration and utilization of models stored in this format. This model's weights come in GGUF format with Q4_K_XL quantization. For more information on GGUF format, you can check the Hugging Face docs and the official GGUF documentation .
Chat Completions API
Send Request
You can use cURL or any REST Client to send a request to the Azure ML endpoint with your Azure ML token.curl <AZUREML_ENDPOINT_URL> \
-X POST \
-H "Authorization: Bearer <AZUREML_TOKEN>" \
-H "Content-Type: application/json" \
-d '{"model":"unsloth/GLM-4.7-Flash-GGUF","messages":[{"role":"user","content":"What is Deep Learning?"}]}'
Supported Parameters
The following are the only mandatory parameters to send in the HTTP POST request tov1/chat/completions.
- model (string): Model ID used to generate the response, in this case since only a single model is deployed within the same endpoint you can either set it to unsloth/GLM-4.7-Flash-GGUF or leave it blank instead.
- messages (array): A list of messages comprising the conversation so far. Depending on the model you use, different message types (modalities) are supported, like text, images, and audio.
/openapi.json for the current Azure ML Endpoint.
Example payload
{
"model": "unsloth/GLM-4.7-Flash-GGUF",
"messages": [
{"role":"user","content":"What is Deep Learning?"}
],
"max_completion_tokens": 256,
"temperature": 0.6
}
Model Specifications
LicenseMit
Last UpdatedJanuary 2026
ProviderHugging Face