Skip to main content
The Dataset class is a collection of generated training traces. It provides methods for filtering, saving, and exporting traces in various formats.

Import

from synkro import Dataset

Properties

PropertyTypeDescription
traceslist[Trace]The list of generated traces
passing_ratefloatPercentage of traces that passed grading (0.0-1.0)
categorieslist[str]Unique categories in the dataset

Methods

save()

Save dataset to a JSONL file.
dataset.save(
    path: str | Path | None = None,
    format: str = "messages",
    pretty_print: bool = False,
) -> Dataset
path
str | Path
default:"auto"
Output file path. If None, auto-generates a timestamped filename.
format
str
default:"messages"
Output format: "messages", "qa", "langsmith", "langfuse", "tool_call", "chatml", or "bert" / "bert:<task>"
pretty_print
bool
default:"False"
If True, format JSON with indentation (multi-line, human-readable)
Returns: Self (for method chaining)
# Auto-named file
dataset.save()  # synkro_messages_2024-01-15_1430.jsonl

# Custom path
dataset.save("training.jsonl")

# Different formats
dataset.save("eval.jsonl", format="qa")
dataset.save("langsmith.jsonl", format="langsmith")
dataset.save("bert.jsonl", format="bert:classification")

# Human-readable
dataset.save("readable.jsonl", pretty_print=True)

to_jsonl()

Convert dataset to JSONL string.
dataset.to_jsonl(
    format: str = "messages",
    pretty_print: bool = False,
) -> str
format
str
default:"messages"
Output format (same options as save())
pretty_print
bool
default:"False"
Format with indentation
Returns: JSONL formatted string
jsonl_str = dataset.to_jsonl()
jsonl_str = dataset.to_jsonl(format="chatml")

filter()

Filter traces by criteria. Returns a new Dataset with filtered traces.
dataset.filter(
    passed: bool | None = None,
    category: str | None = None,
    min_length: int | None = None,
) -> Dataset
passed
bool
default:"None"
Filter by grade pass/fail status
category
str
default:"None"
Filter by scenario category
min_length
int
default:"None"
Minimum response length in characters
Returns: New Dataset with filtered traces
# Filter to passing traces only
passing = dataset.filter(passed=True)

# Filter by category
refunds = dataset.filter(category="Refund Policy")

# Filter by response length
long_responses = dataset.filter(min_length=500)

# Chain filters
high_quality = dataset.filter(passed=True).filter(min_length=200)

dedupe()

Remove duplicate or near-duplicate traces.
dataset.dedupe(
    threshold: float = 0.85,
    method: str = "semantic",
    field: str = "user",
) -> Dataset
threshold
float
default:"0.85"
Similarity threshold (0-1). Higher = stricter dedup. Only used for semantic method.
method
str
default:"semantic"
Deduplication method:
  • "exact": Remove exact text duplicates (fast)
  • "semantic": Remove semantically similar traces (requires sentence-transformers)
field
str
default:"user"
Which field to dedupe on: "user", "assistant", or "both"
Returns: New Dataset with duplicates removed
# Remove exact duplicates (fast)
deduped = dataset.dedupe(method="exact")

# Remove semantically similar (stricter)
deduped = dataset.dedupe(threshold=0.9, method="semantic")

# Dedupe based on assistant responses
deduped = dataset.dedupe(field="assistant")
Semantic deduplication requires the sentence-transformers package:
pip install sentence-transformers

to_hf_dataset()

Convert to HuggingFace Dataset.
dataset.to_hf_dataset(format: str = "messages") -> datasets.Dataset
format
str
default:"messages"
Output format (same options as save())
Returns: HuggingFace datasets.Dataset object
hf_dataset = dataset.to_hf_dataset()

# Push to Hub
hf_dataset.push_to_hub("my-org/policy-traces")

# Train/test split
split = hf_dataset.train_test_split(test_size=0.1)
split.push_to_hub("my-org/policy-traces")

# BERT format for encoder models
hf_dataset = dataset.to_hf_dataset(format="bert:classification")
Requires the datasets package:
pip install datasets

push_to_hub()

Push dataset directly to HuggingFace Hub.
dataset.push_to_hub(
    repo_id: str,
    format: str = "messages",
    private: bool = False,
    split: str = "train",
    token: str | None = None,
) -> str
repo_id
str
required
HuggingFace repo ID (e.g., "my-org/policy-data")
format
str
default:"messages"
Output format
private
bool
default:"False"
Whether the repo should be private
split
str
default:"train"
Dataset split name
token
str
default:"None"
HuggingFace token (uses cached token if not provided)
Returns: URL of the uploaded dataset
url = dataset.push_to_hub("my-org/policy-data")
url = dataset.push_to_hub("my-org/policy-data", private=True)

to_dict()

Convert dataset to a dictionary.
dataset.to_dict() -> dict
Returns: Dictionary with trace data and stats
d = dataset.to_dict()
# {
#   "traces": [...],
#   "stats": {
#     "total": 100,
#     "passing_rate": 0.95,
#     "categories": ["Refunds", "Returns"]
#   }
# }

summary()

Get a human-readable summary of the dataset.
dataset.summary() -> str
Returns: Summary string
print(dataset.summary())
# Dataset Summary
# ===============
# Total traces: 100
# Passing rate: 95.0%
# Categories: 5
#
# By category:
#   - Refunds: 25
#   - Returns: 20
#   ...

Container Protocol

Dataset supports standard Python container operations:
# Length
len(dataset)  # 100

# Iteration
for trace in dataset:
    print(trace.user_message)

# Indexing
first_trace = dataset[0]
last_trace = dataset[-1]

Export Format Reference

FormatDescriptionUse Case
messagesOpenAI messages formatFine-tuning GPT models
chatmlChatML formatAlternative chat format
qaQ&A with ground truthEvaluation datasets
langsmithLangSmith formatLangSmith integration
langfuseLangfuse formatLangfuse integration
tool_callTool calling formatFunction calling datasets
bertBERT classificationEncoder models
bert:qaBERT extractive QAQuestion answering