Gretel
Version: 1
MODEL METADATA
- Model Provider Name: Gretel
- Model provider logo:
- Model Provider contact info: support@gretel.ai
- Model Name: ex. Gretel-Navigator-Tabular
- Model Inference Task: Chat completion for data generation
- Model Fine-tune support: ex. N
- Model Finetune Task: N/A
- Supported data type: JSON, CSV
- Supported language: en
- Model License: Llama 3.1 Community Licensed
- Training cutoff date: December 2023
- Training data: Public Sources and Synthetic Data
- Context window: ex. 128k context length in tokens.
- Sample notebook links: ex. https://github.com/gretelai/gretel-blueprints/tree/main/sdk_blueprints placeholder, need to figure out notebook
Description
Gretel Navigator generates production-quality synthetic data optimized for AI and machine learning development from prompts, schema definitions, or seed examples. Unlike single-LLM approaches to data generation, Navigator employs a compound AI architecture specifically engineered for synthetic data, combining top open-source SLM models fine-tuned across 10+ industry domains. This purpose-built system creates diverse, domain-specific datasets at scales of hundreds to millions of examples while preserving complex statistical relationships and offering increased speed and accuracy compared to manual data creation. Top use cases:- Creating synthetic data for LLM training and fine-tuning
- Generating evaluation datasets for AI models and RAG systems
- Augmenting limited training data with diverse synthetic samples
- Creating realistic PII/PHI data for model testing
Documentation and Resources
Input Examples
Natural Language Prompts
Generate customer bank transaction data with the following columns:
- customer_name: Full names in Western format
- transaction_date: Dates within the last 30 days
- transaction_amount: Dollar amounts between $1-$10,000
- transaction_type: Either 'debit' or 'credit'
- transaction_category: Common banking categories like 'dining', 'retail', 'utilities'
- account_balance: Running balance after each transaction
Schema-Based Input
CREATE TABLE transactions (
customer_name VARCHAR(100),
customer_id CHAR(8),
transaction_date DATE,
transaction_amount DECIMAL(10,2),
transaction_type VARCHAR(6),
transaction_category VARCHAR(50),
account_balance DECIMAL(10,2)
);
OpenAI Example
# !pip install -Uqq openai==1.52.0
import json
import os
from openai import OpenAI
client = OpenAI(
base_url="https://<url>/v1/inference/oai/v1",
api_key="XXX",
)
model = "gretelai/auto"
message = """Generate a mock dataset for users from the Foo company based in France.
Each user should have the following columns:
* first_name: traditional French first names.
* last_name: traditional French surnames.
* email: formatted as the first letter of their first name followed by their last name @foo.io (e.g., jdupont@foo.io)
* gender: Male/Female
* city: a city in France
* country: always 'France'.
"""
chat_completion_stream = client.chat.completions.create(
messages=[
{
"role": "user",
"content": message,
}
],
stream=True,
n=10,
model=model,
)
for chunk in chat_completion_stream:
if content := chunk.choices[0].delta.content:
# each chunk streams back as a json row
print(json.loads(content)["table_data"])
if usage := chunk.usage:
print(usage.model_dump())
Gretel Client Example
# !pip install gretel-client
import pandas as pd
from gretel_client import Gretel
PROMPT = """Generate a mock dataset for users from the Foo company based in France.
Each user should have the following columns:
* first_name: traditional French first names.
* last_name: traditional French surnames.
* email: formatted as the first letter of their first name followed by their last name @foo.io (e.g., jdupont@foo.io)
* gender: Male/Female
* city: a city in France
* country: always 'France'.
"""
EDIT_PROMPT = """Edit the table and add the following columns:
* occupation: a random occupation
* education level: make it relevant to the occupation
"""
table_headers = ["first_name", "last_name", "email", "gender", "city", "country"]
table_data = [
{
"first_name": "Lea",
"last_name": "Martin",
"email": "lmartin@foo.io",
"gender": "Female",
"city": "Lyon",
"country": "France",
}
]
SAMPLE_DATA = pd.DataFrame(table_data, columns=table_headers)
STREAM = True
from openai import OpenAI
# client = AzureOpenAI(azure_endpoint="http://localhost:8000/v1/inference/oai/v1", api_key="abc", api_version="foo")
# This is used for local testing since the AzureOpenAI class requires an actual deployed Azure OpenAI endpoint
client = OpenAI(
base_url="https://<endpoint>/v1/inference/oai/v1",
api_key="abc"
)
MODEL = "gretelai-azure/gpt-3.5-turbo"
azure_open_ai = Gretel.create_navigator_azure_oai_adapter(client)
usage, generated_df = azure_open_ai.generate(
MODEL,
PROMPT,
num_records=10,
sample_data=SAMPLE_DATA,
stream=STREAM,
)
print(generated_df)
print("*****")
print(usage)
# Let's edit this now
EDIT_PROMPT = """Edit the table and add the following columns:
* Occupation: a random occupation
* Education level: make it relevant to the occupation
"""
usage, edited_df = azure_open_ai.edit(
MODEL,
EDIT_PROMPT,
seed_data=generated_df,
stream=STREAM
)
print(edited_df)
print("*****")
print(usage)
Transparency/Usage Guidance
Data Generation Architecture
- Agentic workflow system for synthetic data generation
- Multi-modal support (tabular, text)
- Scalable generation (up to millions of records)
- Underlying LLMs fine-tuned by Gretel on 10 different industry data and formats including healthcare, life sciences, financial, manufacturing, retail
Key Features
- Natural language interface to specify data requirements
- Schema-based data generation
- Real-time and streaming data generation
- Dataset augmentation and modification
- Structured data supported as LLM inputs and outputs
Example Open Datasets
High-quality open synthetic datasets created using Navigator available on HuggingFace:- Text-to-SQL Dataset : Large-scale dataset for SQL generation
- GSM8K Math Problem Solving Dataset : AI reasoning dataset
- Multilingual Financial PII Dataset : Financial services training data
Coming Soon: AI Data Designer
The AI Data Designer functionality, coming soon to Model-as-a-Service (MaaS), will provide an end-to-end synthetic data pipeline using Navigator, enabling iterative improvements, validation, and evaluation of generated datasets directly through the Navigator SDK.Service Limitations
- Fine-tuning capability is not yet available in MaaS.
- See Navigator Fine-tuning (for tabular data) in Gretel Cloud.
- See Gretel GPT Fine-tuning (for natural language) in Gretel Cloud.
- Batch SDK for large-scale generation currently limited to Gretel Cloud
Responsible AI Considerations
Navigator is designed to democratize synthetic data generation while upholding high standards of responsible AI development. The system incorporates automated alignment checks to detect the generation of harmful or discriminatory data while respecting legitimate use cases across industries. Navigator is trained exclusively on high-quality, license-compliant datasets spanning 10+ sectors, ensuring both legal compliance and output quality. However, like any advanced AI system, Navigator may occasionally produce unexpected or biased outputs. We therefore recommend that users conduct appropriate testing and validation for their specific use cases. Gretel’s governance framework includes privacy-preserving architecture, regular security audits, and continuous monitoring for bias and quality control. Through ongoing model updates and strict access controls, we maintain alignment with responsible AI principles while protecting against potential misuse. Users are encouraged to review our Responsible Use Guidelines and implement appropriate safety measures based on their specific applications and industry requirements.Model Specifications
Last UpdatedNovember 2024
Input TypeText
Output TypeText
ProviderGretel
Languages1 Language