Overview
The EVALUATION dataset type generates test scenarios with ground truth labels for evaluating your models. Best for: Benchmarking, regression testing, eval platforms (LangSmith, Langfuse).Usage
Output Formats
Generic Q&A
LangSmith
Langfuse
Eval API
Generate scenarios without synthetic responses, then grade your own model:Ground Truth Fields
Each scenario includes:| Field | Description |
|---|---|
user_message | The test input |
expected_outcome | What should happen |
target_rule_ids | Rules being tested |
scenario_type | positive/negative/edge_case |
category | Policy category |