Tests

Individual test cases that evaluate specific aspects of your AI system’s performance.

What are Tests? Tests are individual test cases that validate specific inputs and expected outputs for your AI system, evaluated using assigned metrics.

Test Types

There are two types of tests:

Single-turn tests check how the AI responds to a single prompt with no follow-up (Q&A).
Multi-turn tests check how the AI behaves over multiple messages in a conversation.

Single-Turn Tests

A single prompt sent to your AI system, evaluated against expected outputs and metrics.

Properties:

Field	Description
Test Prompt	The input text sent to your AI system
Files	(Optional) File attachments (images, PDFs, or other documents) sent alongside the prompt. See file format filters for how files are mapped to provider formats.
Category	High-level classification (e.g., Harmful, Harmless)
Topic	Specific subject matter (e.g., healthcare, financial advice)
Behavior	Type of behavior to validate (e.g., Compliance, Reliability, Robustness)
Expected Output	(Optional) What the AI should respond with

Multi-Turn Tests

Goal-based conversations that test your AI system across multiple turns. Powered by Penelope, an autonomous testing agent that adapts its strategy based on responses. Ideal for testing conversational workflows.

Properties:

Field	Description
Goal	What the target should do - the success criteria for this test
Instructions	(Optional) How to conduct the test - if not provided, the agent plans its own approach
Restrictions	(Optional) What the target must not do - forbidden behaviors or boundaries
Scenario	(Optional) Context and persona for the test - narrative setup or user role
Files	(Optional) File attachments (images, PDFs, or other documents) included with the test. Files are available on each conversation turn via the `files` variable in the endpoint request template.
Min. Turns	Minimum turns before early stopping is allowed. If omitted, defaults to 80% of max turns.
Max. Turns	Maximum number of conversation turns allowed (`max_turns`)
Category	High-level classification (e.g., Harmful, Harmless)
Topic	Specific subject matter (e.g., healthcare, financial advice)
Behavior	Type of behavior to validate (e.g., Compliance, Reliability, Robustness)

Multi-turn configuration uses min_turns and max_turns. The older term max_iterations has been replaced.

Creating Tests

Create tests manually or generate them automatically from behaviors and requirements. See Generation for automated test generation.

Running Tests

Run individual tests from the Tests page or execute multiple tests together using Test Sets.

Next Steps - Organize tests into Test Sets - Generate tests from Knowledge - View execution results in Test Runs