Tests
Individual test cases that evaluate specific aspects of your AI system’s performance.
What are Tests? Tests are individual test cases that validate specific inputs and expected outputs for your AI system, evaluated using assigned metrics.
Test Types
There are two types of tests:
- Single-turn tests check how the AI responds to a single prompt with no follow-up (Q&A).
- Multi-turn tests check how the AI behaves over multiple messages in a conversation.
Single-Turn Tests
A single prompt sent to your AI system, evaluated against expected outputs and metrics.
Properties:
| Field | Description |
|---|---|
| Test Prompt | The input text sent to your AI system |
| Files | (Optional) File attachments (images, PDFs, or other documents) sent alongside the prompt. See file format filters for how files are mapped to provider formats. |
| Category | High-level classification (e.g., Harmful, Harmless) |
| Topic | Specific subject matter (e.g., healthcare, financial advice) |
| Behavior | Type of behavior to validate (e.g., Compliance, Reliability, Robustness) |
| Expected Output | (Optional) What the AI should respond with |
Multi-Turn Tests
Goal-based conversations that test your AI system across multiple turns. Powered by Penelope, an autonomous testing agent that adapts its strategy based on responses. Ideal for testing conversational workflows.
Properties:
| Field | Description |
|---|---|
| Goal | What the target should do - the success criteria for this test |
| Instructions | (Optional) How to conduct the test - if not provided, the agent plans its own approach |
| Restrictions | (Optional) What the target must not do - forbidden behaviors or boundaries |
| Scenario | (Optional) Context and persona for the test - narrative setup or user role |
| Files | (Optional) File attachments (images, PDFs, or other documents) included with the test. Files are available on each conversation turn via the files variable in the endpoint request template. |
| Min. Turns | Minimum turns before early stopping is allowed. If omitted, defaults to 80% of max turns. |
| Max. Turns | Maximum number of conversation turns allowed (max_turns) |
| Category | High-level classification (e.g., Harmful, Harmless) |
| Topic | Specific subject matter (e.g., healthcare, financial advice) |
| Behavior | Type of behavior to validate (e.g., Compliance, Reliability, Robustness) |
Multi-turn configuration uses min_turns and max_turns. The older term
max_iterations has been replaced.
Creating Tests
Create tests manually or generate them automatically from behaviors and requirements. See Generation for automated test generation.
Running Tests
Run individual tests from the Tests page or execute multiple tests together using Test Sets.