Dataset

The Dataset class is a collection of generated training traces. It provides methods for filtering, saving, and exporting traces in various formats.

Import

from synkro import Dataset

Properties

Property	Type	Description
`traces`	`list[Trace]`	The list of generated traces
`passing_rate`	`float`	Percentage of traces that passed grading (0.0-1.0)
`categories`	`list[str]`	Unique categories in the dataset

Methods

save()

Save dataset to a JSONL file.

dataset.save(
    path: str | Path | None = None,
    format: str = "messages",
    pretty_print: bool = False,
) -> Dataset

path

str | Path

default:"auto"

Output file path. If None, auto-generates a timestamped filename.

format

str

default:"messages"

Output format: "messages", "qa", "langsmith", "langfuse", "tool_call", "chatml", or "bert" / "bert:<task>"

pretty_print

bool

default:"False"

If True, format JSON with indentation (multi-line, human-readable)

Returns: Self (for method chaining)

# Auto-named file
dataset.save()  # synkro_messages_2024-01-15_1430.jsonl

# Custom path
dataset.save("training.jsonl")

# Different formats
dataset.save("eval.jsonl", format="qa")
dataset.save("langsmith.jsonl", format="langsmith")
dataset.save("bert.jsonl", format="bert:classification")

# Human-readable
dataset.save("readable.jsonl", pretty_print=True)

to_jsonl()

Convert dataset to JSONL string.

dataset.to_jsonl(
    format: str = "messages",
    pretty_print: bool = False,
) -> str

format

str

default:"messages"

Output format (same options as save())

pretty_print

bool

default:"False"

Format with indentation

Returns: JSONL formatted string

jsonl_str = dataset.to_jsonl()
jsonl_str = dataset.to_jsonl(format="chatml")

filter()

Filter traces by criteria. Returns a new Dataset with filtered traces.

dataset.filter(
    passed: bool | None = None,
    category: str | None = None,
    min_length: int | None = None,
) -> Dataset

passed

bool

default:"None"

Filter by grade pass/fail status

dedupe()

Remove duplicate or near-duplicate traces.

dataset.dedupe(
    threshold: float = 0.85,
    method: str = "semantic",
    field: str = "user",
) -> Dataset

threshold

float

default:"0.85"

Similarity threshold (0-1). Higher = stricter dedup. Only used for semantic method.

method

str

default:"semantic"

Deduplication method:

"exact": Remove exact text duplicates (fast)
"semantic": Remove semantically similar traces (requires sentence-transformers)

field

str

default:"user"

Which field to dedupe on: "user", "assistant", or "both"

Returns: New Dataset with duplicates removed

# Remove exact duplicates (fast)
deduped = dataset.dedupe(method="exact")

# Remove semantically similar (stricter)
deduped = dataset.dedupe(threshold=0.9, method="semantic")

# Dedupe based on assistant responses
deduped = dataset.dedupe(field="assistant")

Semantic deduplication requires the sentence-transformers package:

pip install sentence-transformers

to_hf_dataset()

Convert to HuggingFace Dataset.

dataset.to_hf_dataset(format: str = "messages") -> datasets.Dataset

format

str

default:"messages"

Output format (same options as save())

Returns: HuggingFace datasets.Dataset object

hf_dataset = dataset.to_hf_dataset()

# Push to Hub
hf_dataset.push_to_hub("my-org/policy-traces")

# Train/test split
split = hf_dataset.train_test_split(test_size=0.1)
split.push_to_hub("my-org/policy-traces")

# BERT format for encoder models
hf_dataset = dataset.to_hf_dataset(format="bert:classification")

Requires the datasets package:

pip install datasets

push_to_hub()

Push dataset directly to HuggingFace Hub.

dataset.push_to_hub(
    repo_id: str,
    format: str = "messages",
    private: bool = False,
    split: str = "train",
    token: str | None = None,
) -> str

repo_id

str

required

HuggingFace repo ID (e.g., "my-org/policy-data")

format

str

default:"messages"

Output format

private

bool

default:"False"

Whether the repo should be private

split

str

default:"train"

Dataset split name

token

str

default:"None"

HuggingFace token (uses cached token if not provided)

Returns: URL of the uploaded dataset

url = dataset.push_to_hub("my-org/policy-data")
url = dataset.push_to_hub("my-org/policy-data", private=True)

to_dict()

Convert dataset to a dictionary.

dataset.to_dict() -> dict

Returns: Dictionary with trace data and stats

d = dataset.to_dict()
# {
#   "traces": [...],
#   "stats": {
#     "total": 100,
#     "passing_rate": 0.95,
#     "categories": ["Refunds", "Returns"]
#   }
# }

summary()

Get a human-readable summary of the dataset.

dataset.summary() -> str

Returns: Summary string

print(dataset.summary())
# Dataset Summary
# ===============
# Total traces: 100
# Passing rate: 95.0%
# Categories: 5
#
# By category:
#   - Refunds: 25
#   - Returns: 20
#   ...

Container Protocol

Dataset supports standard Python container operations:

# Length
len(dataset)  # 100

# Iteration
for trace in dataset:
    print(trace.user_message)

# Indexing
first_trace = dataset[0]
last_trace = dataset[-1]

Export Format Reference

Format	Description	Use Case
`messages`	OpenAI messages format	Fine-tuning GPT models
`chatml`	ChatML format	Alternative chat format
`qa`	Q&A with ground truth	Evaluation datasets
`langsmith`	LangSmith format	LangSmith integration
`langfuse`	Langfuse format	Langfuse integration
`tool_call`	Tool calling format	Function calling datasets
`bert`	BERT classification	Encoder models
`bert:qa`	BERT extractive QA	Question answering

Get Started

Core Concepts

Dataset Types

Guides

API Reference

Advanced

Examples

Reference

Import

Properties

Methods

save()

to_jsonl()

filter()

dedupe()

to_hf_dataset()

push_to_hub()

to_dict()

summary()

Container Protocol

Export Format Reference

Get Started

Core Concepts

Dataset Types

Guides

API Reference

Advanced

Examples

Reference

​Import

​Properties

​Methods

​save()

​to_jsonl()

​filter()

​dedupe()

​to_hf_dataset()

​push_to_hub()

​to_dict()

​summary()

​Container Protocol

​Export Format Reference

Import

Properties

Methods

save()

to_jsonl()

filter()

dedupe()

to_hf_dataset()

push_to_hub()

to_dict()

summary()

Container Protocol

Export Format Reference