Gretel

Version: 1

Gretel•Last updated November 2024

MODEL METADATA

Model Provider Name: Gretel
Model provider logo:
Model Provider contact info: support@gretel.ai
Model Name: ex. Gretel-Navigator-Tabular
Model Inference Task: Chat completion for data generation
Model Fine-tune support: ex. N
Model Finetune Task: N/A
Supported data type: JSON, CSV
Supported language: en
Model License: Llama 3.1 Community Licensed
Training cutoff date: December 2023
Training data: Public Sources and Synthetic Data
Context window: ex. 128k context length in tokens.
Sample notebook links: ex. https://github.com/gretelai/gretel-blueprints/tree/main/sdk_blueprints placeholder, need to figure out notebook

Description

Gretel Navigator generates production-quality synthetic data optimized for AI and machine learning development from prompts, schema definitions, or seed examples. Unlike single-LLM approaches to data generation, Navigator employs a compound AI architecture specifically engineered for synthetic data, combining top open-source SLM models fine-tuned across 10+ industry domains. This purpose-built system creates diverse, domain-specific datasets at scales of hundreds to millions of examples while preserving complex statistical relationships and offering increased speed and accuracy compared to manual data creation. Top use cases:

Creating synthetic data for LLM training and fine-tuning
Generating evaluation datasets for AI models and RAG systems
Augmenting limited training data with diverse synthetic samples
Creating realistic PII/PHI data for model testing

Documentation and Resources

Input Examples

Natural Language Prompts

Generate customer bank transaction data with the following columns:
- customer_name: Full names in Western format
- transaction_date: Dates within the last 30 days
- transaction_amount: Dollar amounts between $1-$10,000
- transaction_type: Either 'debit' or 'credit'
- transaction_category: Common banking categories like 'dining', 'retail', 'utilities'
- account_balance: Running balance after each transaction

Schema-Based Input

CREATE TABLE transactions (
    customer_name VARCHAR(100),
    customer_id CHAR(8),
    transaction_date DATE,
    transaction_amount DECIMAL(10,2),
    transaction_type VARCHAR(6),
    transaction_category VARCHAR(50),
    account_balance DECIMAL(10,2)
);

OpenAI Example

# !pip install -Uqq openai==1.52.0
import json
import os

from openai import OpenAI

client = OpenAI(
    base_url="https://<url>/v1/inference/oai/v1",
    api_key="XXX",
)

model = "gretelai/auto"
message = """Generate a mock dataset for users from the Foo company based in France.
Each user should have the following columns:
* first_name: traditional French first names.
* last_name: traditional French surnames. 
* email: formatted as the first letter of their first name followed by their last name @foo.io (e.g., jdupont@foo.io)
* gender: Male/Female
* city: a city in France
* country: always 'France'.
"""

chat_completion_stream = client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": message,
        }
    ],
    stream=True,
    n=10,
    model=model,
)
for chunk in chat_completion_stream:
    if content := chunk.choices[0].delta.content:
        # each chunk streams back as a json row
        print(json.loads(content)["table_data"])

    if usage := chunk.usage:
        print(usage.model_dump())

Gretel Client Example

# !pip install gretel-client
import pandas as pd

from gretel_client import Gretel

PROMPT = """Generate a mock dataset for users from the Foo company based in France.
Each user should have the following columns:
* first_name: traditional French first names.
* last_name: traditional French surnames. 
* email: formatted as the first letter of their first name followed by their last name @foo.io (e.g., jdupont@foo.io)
* gender: Male/Female
* city: a city in France
* country: always 'France'.
"""

EDIT_PROMPT = """Edit the table and add the following columns:
* occupation: a random occupation
* education level: make it relevant to the occupation
"""

table_headers = ["first_name", "last_name", "email", "gender", "city", "country"]
table_data = [
    {
        "first_name": "Lea",
        "last_name": "Martin",
        "email": "lmartin@foo.io",
        "gender": "Female",
        "city": "Lyon",
        "country": "France",
    }
]

SAMPLE_DATA = pd.DataFrame(table_data, columns=table_headers)

STREAM = True

from openai import OpenAI

# client = AzureOpenAI(azure_endpoint="http://localhost:8000/v1/inference/oai/v1", api_key="abc", api_version="foo")

# This is used for local testing since the AzureOpenAI class requires an actual deployed Azure OpenAI endpoint
client = OpenAI(
    base_url="https://<endpoint>/v1/inference/oai/v1",
    api_key="abc"
)

MODEL = "gretelai-azure/gpt-3.5-turbo"

azure_open_ai = Gretel.create_navigator_azure_oai_adapter(client)

usage, generated_df = azure_open_ai.generate(
    MODEL,
    PROMPT,
    num_records=10,
    sample_data=SAMPLE_DATA,
    stream=STREAM,
)

print(generated_df)
print("*****")
print(usage)

# Let's edit this now

EDIT_PROMPT = """Edit the table and add the following columns:
* Occupation: a random occupation
* Education level: make it relevant to the occupation
"""

usage, edited_df = azure_open_ai.edit(
    MODEL,
    EDIT_PROMPT,
    seed_data=generated_df,
    stream=STREAM
)
print(edited_df)
print("*****")
print(usage)

Transparency/Usage Guidance

Data Generation Architecture

Agentic workflow system for synthetic data generation
Multi-modal support (tabular, text)
Scalable generation (up to millions of records)
Underlying LLMs fine-tuned by Gretel on 10 different industry data and formats including healthcare, life sciences, financial, manufacturing, retail

Key Features

Natural language interface to specify data requirements
Schema-based data generation
Real-time and streaming data generation
Dataset augmentation and modification
Structured data supported as LLM inputs and outputs

Example Open Datasets

High-quality open synthetic datasets created using Navigator available on HuggingFace:

Text-to-SQL Dataset : Large-scale dataset for SQL generation
GSM8K Math Problem Solving Dataset : AI reasoning dataset
Multilingual Financial PII Dataset : Financial services training data

Coming Soon: AI Data Designer

The AI Data Designer functionality, coming soon to Model-as-a-Service (MaaS), will provide an end-to-end synthetic data pipeline using Navigator, enabling iterative improvements, validation, and evaluation of generated datasets directly through the Navigator SDK.

Service Limitations

Fine-tuning capability is not yet available in MaaS.
- See Navigator Fine-tuning (for tabular data) in Gretel Cloud.
- See Gretel GPT Fine-tuning (for natural language) in Gretel Cloud.
Batch SDK for large-scale generation currently limited to Gretel Cloud

Responsible AI Considerations

Navigator is designed to democratize synthetic data generation while upholding high standards of responsible AI development. The system incorporates automated alignment checks to detect the generation of harmful or discriminatory data while respecting legitimate use cases across industries. Navigator is trained exclusively on high-quality, license-compliant datasets spanning 10+ sectors, ensuring both legal compliance and output quality. However, like any advanced AI system, Navigator may occasionally produce unexpected or biased outputs. We therefore recommend that users conduct appropriate testing and validation for their specific use cases. Gretel’s governance framework includes privacy-preserving architecture, regular security audits, and continuous monitoring for bias and quality control. Through ongoing model updates and strict access controls, we maintain alignment with responsible AI principles while protecting against potential misuse. Users are encouraged to review our Responsible Use Guidelines and implement appropriate safety measures based on their specific applications and industry requirements.

Model Specifications

Last UpdatedNovember 2024

Input TypeText

Output TypeText

ProviderGretel

Languages1 Language

Quick Start