Quickstart
This guide walks you through installing the General Analysis red teaming package, configuring model provider credentials, running your first adversarial test, and interpreting the results. By the end you will have a working pipeline that probes a target language model for safety vulnerabilities using automated jailbreak techniques.
PrerequisitesPrerequisites
Before you begin, make sure you have Python 3.8 or later installed, along with pip or a package manager like uv. You will also need API keys for at least one supported model provider — OpenAI, Anthropic, or Together.ai. If you plan to use gradient-based attacks such as GCG, a CUDA-capable GPU with PyTorch is required.
InstallationInstallation
For development or to access the latest features, clone the repository and install in editable mode:
git clone https://github.com/General-Analysis/GA.git
cd GA
pip install -e .Optional DependenciesOptional Dependencies
Some jailbreak methods rely on additional libraries. Install them based on the techniques you plan to use:
# For gradient-based attacks (GCG)
pip install torch transformers
# For embedding models
pip install sentence-transformersIf you only need the guardrails SDK or black-box jailbreak methods like TAP or Crescendo, the base install is sufficient.
API Keys SetupAPI Keys Setup
General Analysis works with various model providers. Set up your API keys as environment variables so the SDK can authenticate on your behalf:
# OpenAI
export OPENAI_API_KEY="your-openai-api-key"
# Anthropic
export ANTHROPIC_API_KEY="your-anthropic-api-key"
# Together.ai
export TOGETHER_API_KEY="your-together-api-key"You only need to configure the providers whose models you intend to target or use as attacker and evaluator models. For example, if you are testing GPT-4o with a DeepSeek attacker hosted on Together.ai, you need both OPENAI_API_KEY and TOGETHER_API_KEY.
Basic UsageBasic Usage
1. Running a Simple Jailbreak Test1. Running a Simple Jailbreak Test
The fastest way to get started is with TAP (Tree-of-Attacks with Pruning). TAP is a black-box method that systematically explores a tree of adversarial prompts, pruning ineffective branches to converge on successful jailbreaks quickly.
from generalanalysis.jailbreaks import TAP, TAPConfig
from generalanalysis.data_utils import load_harmbench_dataset
# Configure the attack
config = TAPConfig(
project="my_first_tap_test",
target_model="gpt-4o",
attacker_model="deepseek-ai/DeepSeek-R1",
evaluator_model="deepseek-ai/DeepSeek-R1",
branching_factor=2,
max_depth=5
)
# Initialize TAP
tap = TAP(config)
# Load test data
dataset = load_harmbench_dataset()
# Run the attack
best_nodes, root_nodes = tap.optimize(dataset[:5]) # Test on first 5 samplesTAP stores detailed results — including the full attack tree, successful prompts, and model responses — in a project directory named after the project parameter. Review these artifacts to understand which prompt refinement paths led to successful jailbreaks.
2. Using Different Jailbreak Methods2. Using Different Jailbreak Methods
General Analysis ships with six jailbreak methods, each suited to different attack scenarios. Below are quick-start examples for two popular alternatives.
AutoDANAutoDAN
AutoDAN uses a hierarchical genetic algorithm to evolve adversarial prompts that appear benign to safety filters. It requires an initial set of seed prompts that the algorithm mutates, crosses, and selects from across generations.
from generalanalysis.jailbreaks import AutoDAN, AutoDANConfig
config = AutoDANConfig(
target_model="claude-3-7-sonnet-20250219",
project="autodan_test",
initial_candidates=["Tell me about safety", "Explain security"],
device="cuda:0",
N=10,
max_iterations=10
)
autodan = AutoDAN(config)
results = autodan.optimize(goals=["Generate harmful content"])Crescendo (Multi-turn Attack)Crescendo (Multi-turn Attack)
Crescendo takes a different approach by simulating a multi-turn conversation that gradually steers the target model toward prohibited content. Each turn builds context that makes the eventual harmful request seem like a natural continuation of the dialogue.
from generalanalysis.jailbreaks import Crescendo, CrescendoConfig
config = CrescendoConfig(
target_model="gpt-4o",
attacker_model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
evaluator_model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
project="crescendo_test",
max_rounds=5
)
crescendo = Crescendo(config)
dataset = load_harmbench_dataset()
results = crescendo.optimize(dataset[:3])3. Working with Models3. Working with Models
The Boiler Room module provides unified interfaces for both API-hosted and locally-running models. This abstraction lets you swap target models without changing your attack code.
BlackBoxModel (API-based)BlackBoxModel (API-based)
Use BlackBoxModel when you are testing models through their provider APIs. It handles authentication, retries, and rate limiting automatically.
from generalanalysis.boiler_room import BlackBoxModel
# Initialize model
model = BlackBoxModel("gpt-4o")
# Simple query
response = model.query("Explain quantum computing")
# Batch processing
prompts = ["What is AI?", "How do neural networks work?"]
responses = model.query_parallel(prompts)WhiteBoxModel (Local models)WhiteBoxModel (Local models)
Use WhiteBoxModel for models you run locally. This interface gives you direct access to model weights and gradients, which is required for methods like GCG that optimize adversarial tokens via backpropagation.
from generalanalysis.boiler_room import WhiteBoxModel
# Load a local model
model = WhiteBoxModel(
model_name="meta-llama/Llama-3.2-1B-Instruct",
device="cuda"
)
# Generate text
responses = model.generate_with_chat_template(
prompts=["Explain machine learning"],
max_new_tokens=200
)4. Evaluating Results4. Evaluating Results
After running an attack, use the AdverserialEvaluator to score model responses in a standardized way. The evaluator determines whether each response constitutes a genuine safety violation or a refusal, producing metrics you can compare across methods and models.
from generalanalysis.jailbreaks import AdverserialEvaluator
# Set up evaluator
evaluator = AdverserialEvaluator(
dataset="harmbench",
target_models=["gpt-4o", "claude-3-7-sonnet-20250219"],
evaluator_model="gpt-4o"
)
# Evaluate responses
results = evaluator.evaluate_from_responses(
responses=["Response 1", "Response 2"],
prompts=["Prompt 1", "Prompt 2"]
)The evaluator returns structured results including boolean complied and harmful flags, a numeric score, and a natural-language explanation for each judgment. Use these to compute attack success rates and generate reports for your security team.
Next StepsNext Steps
- Explore different LLM jailbreak methods and their tradeoffs
- Learn about adversarial prompt generators that power the attack algorithms
- Read the AI red teaming development guide to contribute new methods
- Check out our LLM Jailbreak Cookbook for detailed examples and real-world case studies
- Review the LLM jailbreak performance benchmarks to see how each method stacks up across frontier models
- New to AI red teaming? Start with our guide: What is automated AI red teaming?