Gretel
Gretel
Version: 1
GretelLast updated November 2024

MODEL METADATA

  • Model Provider Name: Gretel
  • Model provider logo:
  • Model Provider contact info: support@gretel.ai
  • Model Name: ex. Gretel-Navigator-Tabular
  • Model Inference Task: Chat completion for data generation
  • Model Fine-tune support: ex. N
  • Model Finetune Task: N/A
  • Supported data type: JSON, CSV
  • Supported language: en
  • Model License: Llama 3.1 Community Licensed
  • Training cutoff date: December 2023
  • Training data: Public Sources and Synthetic Data
  • Context window: ex. 128k context length in tokens.
  • Sample notebook links: ex. https://github.com/gretelai/gretel-blueprints/tree/main/sdk_blueprints placeholder, need to figure out notebook

Description

Gretel Navigator generates production-quality synthetic data optimized for AI and machine learning development from prompts, schema definitions, or seed examples. Unlike single-LLM approaches to data generation, Navigator employs a compound AI architecture specifically engineered for synthetic data, combining top open-source SLM models fine-tuned across 10+ industry domains. This purpose-built system creates diverse, domain-specific datasets at scales of hundreds to millions of examples while preserving complex statistical relationships and offering increased speed and accuracy compared to manual data creation. Top use cases:
  • Creating synthetic data for LLM training and fine-tuning
  • Generating evaluation datasets for AI models and RAG systems
  • Augmenting limited training data with diverse synthetic samples
  • Creating realistic PII/PHI data for model testing

Documentation and Resources

Input Examples

Natural Language Prompts

Generate customer bank transaction data with the following columns:
- customer_name: Full names in Western format
- transaction_date: Dates within the last 30 days
- transaction_amount: Dollar amounts between $1-$10,000
- transaction_type: Either 'debit' or 'credit'
- transaction_category: Common banking categories like 'dining', 'retail', 'utilities'
- account_balance: Running balance after each transaction

Schema-Based Input

CREATE TABLE transactions (
    customer_name VARCHAR(100),
    customer_id CHAR(8),
    transaction_date DATE,
    transaction_amount DECIMAL(10,2),
    transaction_type VARCHAR(6),
    transaction_category VARCHAR(50),
    account_balance DECIMAL(10,2)
);

OpenAI Example

# !pip install -Uqq openai==1.52.0
import json
import os

from openai import OpenAI

client = OpenAI(
    base_url="https://<url>/v1/inference/oai/v1",
    api_key="XXX",
)

model = "gretelai/auto"
message = """Generate a mock dataset for users from the Foo company based in France.
Each user should have the following columns:
* first_name: traditional French first names.
* last_name: traditional French surnames. 
* email: formatted as the first letter of their first name followed by their last name @foo.io (e.g., jdupont@foo.io)
* gender: Male/Female
* city: a city in France
* country: always 'France'.
"""

chat_completion_stream = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": message,
        }
    ],
    stream=True,
    n=10,
    model=model,
)
for chunk in chat_completion_stream:
    if content := chunk.choices[0].delta.content:
        # each chunk streams back as a json row
        print(json.loads(content)["table_data"])

    if usage := chunk.usage:
        print(usage.model_dump())

Gretel Client Example

# !pip install gretel-client
import pandas as pd

from gretel_client import Gretel

PROMPT = """Generate a mock dataset for users from the Foo company based in France.
Each user should have the following columns:
* first_name: traditional French first names.
* last_name: traditional French surnames. 
* email: formatted as the first letter of their first name followed by their last name @foo.io (e.g., jdupont@foo.io)
* gender: Male/Female
* city: a city in France
* country: always 'France'.
"""

EDIT_PROMPT = """Edit the table and add the following columns:
* occupation: a random occupation
* education level: make it relevant to the occupation
"""

table_headers = ["first_name", "last_name", "email", "gender", "city", "country"]
table_data = [
    {
        "first_name": "Lea",
        "last_name": "Martin",
        "email": "lmartin@foo.io",
        "gender": "Female",
        "city": "Lyon",
        "country": "France",
    }
]

SAMPLE_DATA = pd.DataFrame(table_data, columns=table_headers)

STREAM = True

from openai import OpenAI

# client = AzureOpenAI(azure_endpoint="http://localhost:8000/v1/inference/oai/v1", api_key="abc", api_version="foo")

# This is used for local testing since the AzureOpenAI class requires an actual deployed Azure OpenAI endpoint
client = OpenAI(
    base_url="https://<endpoint>/v1/inference/oai/v1",
    api_key="abc"
)

MODEL = "gretelai-azure/gpt-3.5-turbo"

azure_open_ai = Gretel.create_navigator_azure_oai_adapter(client)

usage, generated_df = azure_open_ai.generate(
    MODEL,
    PROMPT,
    num_records=10,
    sample_data=SAMPLE_DATA,
    stream=STREAM,
)

print(generated_df)
print("*****")
print(usage)

# Let's edit this now

EDIT_PROMPT = """Edit the table and add the following columns:
* Occupation: a random occupation
* Education level: make it relevant to the occupation
"""

usage, edited_df = azure_open_ai.edit(
    MODEL,
    EDIT_PROMPT,
    seed_data=generated_df,
    stream=STREAM
)
print(edited_df)
print("*****")
print(usage)

Transparency/Usage Guidance

Data Generation Architecture

  • Agentic workflow system for synthetic data generation
  • Multi-modal support (tabular, text)
  • Scalable generation (up to millions of records)
  • Underlying LLMs fine-tuned by Gretel on 10 different industry data and formats including healthcare, life sciences, financial, manufacturing, retail

Key Features

  • Natural language interface to specify data requirements
  • Schema-based data generation
  • Real-time and streaming data generation
  • Dataset augmentation and modification
  • Structured data supported as LLM inputs and outputs

Example Open Datasets

High-quality open synthetic datasets created using Navigator available on HuggingFace:

Coming Soon: AI Data Designer

The AI Data Designer functionality, coming soon to Model-as-a-Service (MaaS), will provide an end-to-end synthetic data pipeline using Navigator, enabling iterative improvements, validation, and evaluation of generated datasets directly through the Navigator SDK.

Service Limitations

Responsible AI Considerations

Navigator is designed to democratize synthetic data generation while upholding high standards of responsible AI development. The system incorporates automated alignment checks to detect the generation of harmful or discriminatory data while respecting legitimate use cases across industries. Navigator is trained exclusively on high-quality, license-compliant datasets spanning 10+ sectors, ensuring both legal compliance and output quality. However, like any advanced AI system, Navigator may occasionally produce unexpected or biased outputs. We therefore recommend that users conduct appropriate testing and validation for their specific use cases. Gretel’s governance framework includes privacy-preserving architecture, regular security audits, and continuous monitoring for bias and quality control. Through ongoing model updates and strict access controls, we maintain alignment with responsible AI principles while protecting against potential misuse. Users are encouraged to review our Responsible Use Guidelines and implement appropriate safety measures based on their specific applications and industry requirements.
Model Specifications
Last UpdatedNovember 2024
Input TypeText
Output TypeText
ProviderGretel
Languages1 Language