HuggingFace Integration

Synkro integrates directly with HuggingFace for easy dataset sharing and fine-tuning.

Prerequisites

pip install datasets huggingface_hub

# Login to HuggingFace
huggingface-cli login

Quick Push

import synkro

dataset = synkro.generate(policy, traces=100)

# Push directly to Hub
url = dataset.push_to_hub("my-org/policy-dataset")
print(f"Dataset: {url}")

Push Options

dataset.push_to_hub(
    repo_id="my-org/policy-dataset",  # Required: HuggingFace repo ID
    format="messages",                 # Export format
    private=True,                      # Make repo private
    split="train",                     # Dataset split name
    token="hf_xxx",                    # HF token (uses cached if None)
)

Train/Test Split

import synkro

dataset = synkro.generate(policy, traces=500)

# Convert to HuggingFace Dataset
hf_dataset = dataset.to_hf_dataset()

# Create train/test split
split = hf_dataset.train_test_split(test_size=0.1)

# Push both splits
split.push_to_hub("my-org/policy-dataset")
# Creates: train (450) and test (50) splits

Multiple Formats

Upload different formats for different use cases:

dataset = synkro.generate(policy, traces=500)

# SFT format for fine-tuning
sft_dataset = dataset.to_hf_dataset(format="messages")
sft_dataset.push_to_hub("my-org/policy-sft-data")

# Eval format for testing
eval_dataset = dataset.to_hf_dataset(format="qa")
eval_dataset.push_to_hub("my-org/policy-eval-data")

# BERT format for classifiers
bert_dataset = dataset.to_hf_dataset(format="bert:classification")
bert_dataset.push_to_hub("my-org/policy-bert-classifier")

Custom Processing

import synkro

dataset = synkro.generate(policy, traces=500)

# Convert to HuggingFace Dataset
hf_dataset = dataset.to_hf_dataset()

# Add custom columns
def add_metadata(example):
    example["source"] = "synkro"
    example["version"] = "1.0"
    return example

hf_dataset = hf_dataset.map(add_metadata)

# Filter
hf_dataset = hf_dataset.filter(lambda x: len(x["messages"]) >= 2)

# Push
hf_dataset.push_to_hub("my-org/policy-dataset")

Load from Hub

Load synkro-generated datasets for fine-tuning:

from datasets import load_dataset

# Load your dataset
dataset = load_dataset("my-org/policy-dataset")

# Use with transformers
from transformers import Trainer, TrainingArguments

trainer = Trainer(
    model=model,
    args=TrainingArguments(...),
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
)
trainer.train()

Dataset Card

Synkro datasets work well with HuggingFace Dataset Cards:

---
language:
- en
license: mit
task_categories:
- text-generation
- conversational
tags:
- synkro
- synthetic
- policy-compliance
---

# Policy Compliance Dataset

Generated with [Synkro](https://github.com/velocitybolt/synkro).

## Dataset Description

Training data for policy-compliant customer service agents.

## Generation Config

- Traces: 500
- Model: gpt-5-mini
- Grading Model: gpt-5.2
- Pass Rate: 95%

## Usage

```python
from datasets import load_dataset

dataset = load_dataset("my-org/policy-dataset")

---

## Organization Datasets

Push to an organization:

```python
# Push to org
dataset.push_to_hub("my-company/customer-service-data", private=True)

# Push to personal account
dataset.push_to_hub("username/policy-data")

Large Datasets

For large datasets, use streaming upload:

import synkro
from synkro import create_pipeline, SilentReporter

# Generate large dataset with checkpointing
pipeline = create_pipeline(
    checkpoint_dir="./checkpoints",
    reporter=SilentReporter(),
    enable_hitl=False,
)

dataset = pipeline.generate(policy, traces=10000)

# Push (automatically handles large files)
dataset.push_to_hub("my-org/large-policy-dataset")

Versioning

HuggingFace Hub handles versioning automatically:

# Initial upload
dataset.push_to_hub("my-org/policy-dataset")

# Update with new data (creates new commit)
new_dataset = synkro.generate(updated_policy, traces=200)
new_dataset.push_to_hub("my-org/policy-dataset")

# Load specific version
from datasets import load_dataset
dataset = load_dataset("my-org/policy-dataset", revision="v1.0")

Best Practices

Use descriptive repo names: customer-service-policy-sft > dataset1
Include metadata: Add dataset cards with generation config
Version major changes: Tag releases for production datasets
Use private repos: For proprietary policies, use private=True
Split appropriately: 90/10 train/test is common for SFT
Document the policy: Include policy text in dataset card

Troubleshooting

Authentication Error

# Re-login
huggingface-cli login

# Or set token in code
dataset.push_to_hub("repo", token="hf_xxx")

Missing Dependencies

pip install datasets huggingface_hub

Large File Handling

HuggingFace Hub uses Git LFS for large files. For very large datasets:

# Split into shards
hf_dataset = dataset.to_hf_dataset()
hf_dataset.save_to_disk("./dataset_shards")

# Then upload directory
from huggingface_hub import upload_folder
upload_folder(
    folder_path="./dataset_shards",
    repo_id="my-org/large-dataset",
    repo_type="dataset"
)

Get Started

Core Concepts

Dataset Types

Guides

API Reference

Advanced

Examples

Reference

HuggingFace Integration

Prerequisites

Quick Push

Push Options

Train/Test Split

Multiple Formats

Custom Processing

Load from Hub

Dataset Card

Large Datasets

Versioning

Best Practices

Troubleshooting

Authentication Error

Missing Dependencies

Large File Handling

Get Started

Core Concepts

Dataset Types

Guides

API Reference

Advanced

Examples

Reference

​Prerequisites

​Quick Push

​Push Options

​Train/Test Split

​Multiple Formats

​Custom Processing

​Load from Hub

​Dataset Card

​Large Datasets

​Versioning

​Best Practices

​Troubleshooting

​Authentication Error

​Missing Dependencies

​Large File Handling

Prerequisites

Quick Push

Push Options

Train/Test Split

Multiple Formats

Custom Processing

Load from Hub

Dataset Card

Large Datasets

Versioning

Best Practices

Troubleshooting

Authentication Error

Missing Dependencies

Large File Handling