Skip to main content
Synkro integrates directly with HuggingFace for easy dataset sharing and fine-tuning.

Prerequisites

pip install datasets huggingface_hub

# Login to HuggingFace
huggingface-cli login

Quick Push

import synkro

dataset = synkro.generate(policy, traces=100)

# Push directly to Hub
url = dataset.push_to_hub("my-org/policy-dataset")
print(f"Dataset: {url}")

Push Options

dataset.push_to_hub(
    repo_id="my-org/policy-dataset",  # Required: HuggingFace repo ID
    format="messages",                 # Export format
    private=True,                      # Make repo private
    split="train",                     # Dataset split name
    token="hf_xxx",                    # HF token (uses cached if None)
)

Train/Test Split

import synkro

dataset = synkro.generate(policy, traces=500)

# Convert to HuggingFace Dataset
hf_dataset = dataset.to_hf_dataset()

# Create train/test split
split = hf_dataset.train_test_split(test_size=0.1)

# Push both splits
split.push_to_hub("my-org/policy-dataset")
# Creates: train (450) and test (50) splits

Multiple Formats

Upload different formats for different use cases:
dataset = synkro.generate(policy, traces=500)

# SFT format for fine-tuning
sft_dataset = dataset.to_hf_dataset(format="messages")
sft_dataset.push_to_hub("my-org/policy-sft-data")

# Eval format for testing
eval_dataset = dataset.to_hf_dataset(format="qa")
eval_dataset.push_to_hub("my-org/policy-eval-data")

# BERT format for classifiers
bert_dataset = dataset.to_hf_dataset(format="bert:classification")
bert_dataset.push_to_hub("my-org/policy-bert-classifier")

Custom Processing

import synkro

dataset = synkro.generate(policy, traces=500)

# Convert to HuggingFace Dataset
hf_dataset = dataset.to_hf_dataset()

# Add custom columns
def add_metadata(example):
    example["source"] = "synkro"
    example["version"] = "1.0"
    return example

hf_dataset = hf_dataset.map(add_metadata)

# Filter
hf_dataset = hf_dataset.filter(lambda x: len(x["messages"]) >= 2)

# Push
hf_dataset.push_to_hub("my-org/policy-dataset")

Load from Hub

Load synkro-generated datasets for fine-tuning:
from datasets import load_dataset

# Load your dataset
dataset = load_dataset("my-org/policy-dataset")

# Use with transformers
from transformers import Trainer, TrainingArguments

trainer = Trainer(
    model=model,
    args=TrainingArguments(...),
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
)
trainer.train()

Dataset Card

Synkro datasets work well with HuggingFace Dataset Cards:
---
language:
- en
license: mit
task_categories:
- text-generation
- conversational
tags:
- synkro
- synthetic
- policy-compliance
---

# Policy Compliance Dataset

Generated with [Synkro](https://github.com/velocitybolt/synkro).

## Dataset Description

Training data for policy-compliant customer service agents.

## Generation Config

- Traces: 500
- Model: gpt-5-mini
- Grading Model: gpt-5.2
- Pass Rate: 95%

## Usage

```python
from datasets import load_dataset

dataset = load_dataset("my-org/policy-dataset")

---

## Organization Datasets

Push to an organization:

```python
# Push to org
dataset.push_to_hub("my-company/customer-service-data", private=True)

# Push to personal account
dataset.push_to_hub("username/policy-data")

Large Datasets

For large datasets, use streaming upload:
import synkro
from synkro import create_pipeline, SilentReporter

# Generate large dataset with checkpointing
pipeline = create_pipeline(
    checkpoint_dir="./checkpoints",
    reporter=SilentReporter(),
    enable_hitl=False,
)

dataset = pipeline.generate(policy, traces=10000)

# Push (automatically handles large files)
dataset.push_to_hub("my-org/large-policy-dataset")

Versioning

HuggingFace Hub handles versioning automatically:
# Initial upload
dataset.push_to_hub("my-org/policy-dataset")

# Update with new data (creates new commit)
new_dataset = synkro.generate(updated_policy, traces=200)
new_dataset.push_to_hub("my-org/policy-dataset")

# Load specific version
from datasets import load_dataset
dataset = load_dataset("my-org/policy-dataset", revision="v1.0")

Best Practices

  1. Use descriptive repo names: customer-service-policy-sft > dataset1
  2. Include metadata: Add dataset cards with generation config
  3. Version major changes: Tag releases for production datasets
  4. Use private repos: For proprietary policies, use private=True
  5. Split appropriately: 90/10 train/test is common for SFT
  6. Document the policy: Include policy text in dataset card

Troubleshooting

Authentication Error

# Re-login
huggingface-cli login

# Or set token in code
dataset.push_to_hub("repo", token="hf_xxx")

Missing Dependencies

pip install datasets huggingface_hub

Large File Handling

HuggingFace Hub uses Git LFS for large files. For very large datasets:
# Split into shards
hf_dataset = dataset.to_hf_dataset()
hf_dataset.save_to_disk("./dataset_shards")

# Then upload directory
from huggingface_hub import upload_folder
upload_folder(
    folder_path="./dataset_shards",
    repo_id="my-org/large-dataset",
    repo_type="dataset"
)