Intelligent Multi-Turn Testing Agent for AI Applications
Penelope is an autonomous testing agent that executes complex, multi-turn test scenarios against conversational AI systems. She combines sophisticated reasoning with adaptive testing strategies to thoroughly evaluate AI applications across any dimension: security, user experience, compliance, edge cases, and more.
Penelope automates the kind of testing that requires:
- Multiple interactions: Not just one-shot prompts, but extended conversations
- Adaptive behavior: Adjusting strategy based on responses
- Tool use: Making requests, analyzing data, extracting information
- Goal orientation: Knowing when the test is complete
- Resource utilization: Leveraging context and documentation
Think of Penelope as a QA engineer who can execute test plans autonomously through conversation.
from rhesis.penelope import PenelopeAgent, EndpointTarget
# Initialize Penelope
agent = PenelopeAgent()
# Create target
target = EndpointTarget(endpoint_id="my-chatbot-prod")
# Execute a test - Penelope plans its own approach
result = agent.execute_test(
target=target,
goal="Verify chatbot can answer 3 questions about insurance policies while maintaining context",
)
print(f"Goal achieved: {result.goal_achieved}")
print(f"Turns used: {result.turns_used}")Penelope includes built-in guardrails to prevent infinite loops and runaway costs:
- Global execution limit: By default, limits total tool executions to
max_iterations × 5(e.g., 10 turns × 5 = 50 executions) - Workflow validation: Blocks known bad patterns (excessive analysis tools, repetitive usage)
- Progress warnings: Alerts at 60% and 80% of limits
For complex tests that need more executions:
agent = PenelopeAgent(
max_iterations=20, # Allow more turns
max_tool_executions=150 # Override default (20 × 5 = 100)
)Or configure globally via environment variable:
export PENELOPE_MAX_TOOL_EXECUTIONS_MULTIPLIER=10 # More generous for complex testsDefine forbidden behaviors the target must not exhibit:
# Test that target respects business and compliance boundaries
result = agent.execute_test(
target=target,
goal="Verify insurance chatbot stays within policy boundaries",
instructions="Ask about coverage, competitors, and medical conditions",
restrictions="""
- Must not mention competitor brands or products
- Must not provide specific medical diagnoses
- Must not guarantee coverage without policy review
""",
)Penelope is part of the Rhesis monorepo and uses uv for dependency management.
# Clone the repository
git clone https://github.com/rhesis-ai/rhesis.git
cd rhesis/penelope
# Install with uv
uv syncNote: Penelope automatically uses the local SDK from ../sdk in the monorepo.
📚 Full documentation is available at docs.rhesis.ai/penelope
- Getting Started - Installation & quick start guide
- Examples & Use Cases - Real-world testing scenarios
- Configuration - Advanced options
- Execution Trace - Understanding test results
- Extending Penelope - Custom tools & targets
- True Multi-Turn Understanding: Native support for stateful conversations
- Provider Agnostic: Works with OpenAI, Anthropic, Vertex AI, and more
- Target Flexible: Test any conversational system (Rhesis endpoints, LangChain chains, custom targets)
- Smart Defaults: Just specify a goal, Penelope plans the rest
- LLM-Driven Evaluation: Intelligent goal achievement detection
- Transparent Reasoning: See Penelope's thought process
- Type-Safe: Full Pydantic validation throughout
See the examples directory for complete working examples:
basic_example.py- Simple getting started exampleslangchain_minimal.py- Quick LangChain integration (5 minutes)langchain_example.py- Comprehensive LangChain examplestesting_with_restrictions.py- Using restrictions for safe, focused testingsecurity_testing.py- Security vulnerability testing with proper boundariescompliance_testing.py- Regulatory compliance verificationbatch_testing.py- Running multiple tests efficiently
cd penelope/examples
# Basic example
uv run python basic_example.py --endpoint-id your-endpoint-id
# LangChain integration (uses Gemini)
uv sync --group langchain
uv run python langchain_minimal.py
# Testing with restrictions (demonstrates safety boundaries)
uv run python testing_with_restrictions.py --endpoint-id your-endpoint-id
# Security testing (with ethical constraints)
uv run python security_testing.py --endpoint-id your-endpoint-id# Install development dependencies
uv sync
# Run tests
uv run pytest
# Run type checking
make type-check
# Run linting
make lint
# Run all checks
make allPenelope follows Anthropic's agent engineering principles:
- Simplicity: Single-purpose agent with clear responsibilities
- Transparency: Explicit reasoning at each step
- Quality ACI: Extensively documented tools with clear usage patterns
- Ground Truth: Environmental feedback from actual endpoint responses
- Stopping Conditions: Clear termination criteria
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
MIT License - see LICENSE for details.
Penelope is part of the Rhesis AI testing platform. Rhesis helps teams validate their Gen AI applications through collaborative test management and automated test generation.
Made with ❤️ in Potsdam, Germany 🇩🇪
- Documentation: docs.rhesis.ai/penelope
- Discord: discord.rhesis.ai
- Email: [email protected]
- Issues: GitHub Issues