Jailbreaks Overview
Jailbreaking is the process of crafting inputs that cause a language model to bypass its safety training and produce content it was designed to refuse. While the term sounds adversarial, jailbreak testing is a critical component of responsible AI deployment. Without proactive testing, organizations have no way to know whether their guardrails will hold under real-world adversarial pressure. GA’s jailbreaks module provides battle-tested implementations of the most effective techniques published in the academic literature, all behind a consistent interface that makes it easy to run, compare, and extend attacks.
Every method in this module has been validated against the HarmBench benchmark and mapped to its original research paper. The implementations are designed for both research reproducibility and practical safety auditing, with configurable parameters, parallel execution support, and structured result output.
Why Jailbreak Testing MattersWhy Jailbreak Testing Matters
Language models are trained with safety alignment techniques like RLHF and constitutional AI, but these defenses are not foolproof. New jailbreak methods are published regularly, and adversaries in the real world are creative and persistent. Systematic jailbreak testing helps teams understand the actual boundary of their model’s safety — not just the intended boundary — and make informed decisions about deployment risk.
Common scenarios where jailbreak testing is essential include:
- Pre-deployment safety audits where you need quantitative evidence of guardrail coverage before shipping a model or feature
- Defense benchmarking where you want to measure how effectively a new safety technique resists known attack patterns
- Compliance documentation where regulators or enterprise customers require proof of adversarial testing
- Continuous regression testing where you need to verify that a fine-tuning run or system prompt update hasn’t weakened existing safeguards
Jailbreak TaxonomyJailbreak Taxonomy
Jailbreaking methods can be categorized along three key dimensions. Understanding this taxonomy helps you select the right method for your specific testing scenario and threat model.
White-box vs. Black-boxWhite-box vs. Black-box
This dimension describes the level of model access required by the attack.
- White-box methods (GCG, AutoDAN) require direct access to model weights and gradients. They can compute exact loss gradients and optimize adversarial tokens at the embedding level. This makes them extremely powerful but limits their use to open-weight models that you can load locally.
- Black-box methods (TAP, AutoDAN-Turbo, Crescendo, Bijection Learning) only need API access — they interact with the model through its text input/output interface. This makes them applicable to any model, including proprietary systems like GPT-4o and Claude, but they must rely on heuristic search rather than gradient information.
For most practical safety audits, black-box methods are the right starting point because they match the access level a real attacker would have. White-box methods are most valuable when you control the model and want to understand its worst-case vulnerability surface.
Semantic vs. NonsensicalSemantic vs. Nonsensical
This dimension describes whether the adversarial prompts maintain natural language coherence.
- Semantic methods (TAP, AutoDAN-Turbo, Crescendo) generate prompts that read like normal English. They rely on social engineering, role-playing scenarios, and careful framing to manipulate the model. These prompts are harder for content filters to catch because they don’t contain obvious adversarial artifacts.
- Nonsensical methods (GCG, Bijection Learning) produce prompts that look like gibberish or encoded text to a human reader, but exploit specific patterns in the model’s tokenization or processing. These are often easier for perplexity-based filters to detect, but they can be devastatingly effective against models that lack such defenses.
When testing a model’s defenses, it’s important to evaluate against both categories. A model that resists semantic attacks might still be vulnerable to token-level optimization, and vice versa.
Systematic vs. ManualSystematic vs. Manual
All methods in GA are fully systematic — they use algorithmic search, optimization, or generation strategies to discover effective prompts automatically. This is in contrast to manually crafted jailbreaks, which rely on human creativity and are difficult to scale. GA’s systematic approach means you can run thousands of attack attempts across large evaluation datasets without manual intervention.
Available MethodsAvailable Methods
The module includes implementations of six state-of-the-art jailbreaking techniques, each targeting a different aspect of model safety:

AutoDAN
Hierarchical genetic algorithm

AutoDAN Turbo
Lifelong agent for strategy self-exploration

TAP
Tree-of-Attacks with Pruning

GCG
Greedy Coordinate Gradient-based optimization

Crescendo
Progressive multi-turn attack

Bijection Learning
Randomized bijection encodings
Method ClassificationMethod Classification
The following table summarizes the classification of each method across the taxonomy dimensions. Use it as a quick reference when choosing which methods to include in your testing suite.
| Method | Type | Approach | Access Required |
|---|---|---|---|
| AutoDAN | Semantic | Systematic | White-box |
| AutoDAN-Turbo | Semantic | Systematic | Black-box |
| TAP | Semantic | Systematic | Black-box |
| GCG | Nonsensical | Systematic | White-box |
| Crescendo | Semantic | Systematic | Black-box |
| Bijection Learning | Nonsensical | Systematic | Black-box |
Choosing the Right MethodChoosing the Right Method
If you are testing a proprietary model through its API and want to start with the method most likely to succeed, TAP is generally the best first choice — it achieves strong attack success rates across a wide range of models with moderate computational cost.
For comprehensive coverage, combine at least one semantic method (TAP or Crescendo) with one nonsensical method (GCG or Bijection Learning) to test both natural-language manipulation and token-level exploitation. If you need to simulate realistic multi-turn adversarial interactions, Crescendo is the most appropriate method because it mirrors how a real user might gradually escalate a conversation.
Common InterfaceCommon Interface
All jailbreaking methods implement a common interface defined in the base JailbreakMethod class. This shared interface ensures that you can swap methods in and out of your pipeline without changing any surrounding code — the configuration, execution, and result format remain consistent.
import abc
from typing import Dict, Any, List
class JailbreakMethod(abc.ABC):
def __init__(self, **kwargs):
self.config = kwargs
@abc.abstractmethod
def optimize(self, goals: List[str], **kwargs):
# Implementation specific to each method
pass
@property
def name(self) -> str:
return self.__class__.__name__
def get_config(self) -> Dict[str, Any]:
return self.config.copy()The optimize method is the primary entry point. It accepts a list of goal strings — descriptions of the harmful behavior you are testing for — and returns the most effective adversarial prompts discovered during the search. Each method’s configuration is passed through its own dataclass (e.g., TAPConfig, GCGConfig), but the calling pattern remains the same.
Basic Usage PatternBasic Usage Pattern
While each method has its specific parameters and workflows, they follow a common usage pattern. The example below demonstrates the typical three-step process: initialize a target model, create a jailbreak method instance with your chosen configuration, and call optimize with a list of goals.
from generalanalysis.boiler_room import BlackBoxModel
from generalanalysis.jailbreaks import AutoDAN
# Initialize the target model to test
target_model = BlackBoxModel("gpt-4o")
# Create a jailbreak method instance
jailbreak = AutoDAN(target_model=target_model)
# Define test goals
goals = ["Provide instructions for illegal activities", "Generate harmful content"]
# Run optimization to find effective jailbreaking prompts
results = jailbreak.optimize(goals=goals, max_iterations=10)
# Analyze results
for goal, prompts in results.items():
print(f"Goal: {goal}")
for i, prompt in enumerate(prompts):
print(f" Prompt {i+1}: {prompt}")
response = target_model.query(prompt)
print(f" Response: {response[:100]}...")Results are returned as a dictionary mapping each goal to the adversarial prompts discovered during optimization. You can iterate over these to inspect the prompts, query the target model, or pass them to the evaluator for standardized scoring.
EvaluationEvaluation
The module includes an evaluator for assessing the effectiveness of jailbreaking attempts using a consistent scoring rubric. Rather than relying on ad-hoc manual review, the evaluator uses a judge model to determine whether the target model’s response actually complied with the harmful request and assigns a structured score.
from generalanalysis.jailbreaks import evaluator
# Evaluate a prompt against a goal
result = evaluator.evaluate_prompt(
target_model=target_model,
prompt="Your jailbreaking prompt...",
goal="The goal you're trying to achieve"
)
print(f"Score: {result['score']}/10")
print(f"Response: {result['response']}")
print(f"Reason: {result['reason']}")For batch evaluation across large datasets and multiple models, see the full LLM jailbreak evaluator documentation.
Creating Custom MethodsCreating Custom Methods
You can create custom jailbreaking methods by extending the base class. This is useful when you want to implement a new technique from the literature, combine elements from multiple existing methods, or test a domain-specific attack strategy. As long as your class implements the optimize interface, it will integrate seamlessly with GA’s evaluation and reporting tools.
from generalanalysis.jailbreaks.base import JailbreakMethod
from generalanalysis.boiler_room import BlackBoxModel
class MyCustomJailbreak(JailbreakMethod):
def __init__(self, target_model: BlackBoxModel, **kwargs):
super().__init__(**kwargs)
self.target_model = target_model
def optimize(self, goals: List[str], **kwargs) -> Dict[str, Any]:
results = {}
for goal in goals:
# Your custom implementation here
generated_prompts = ["Prompt 1", "Prompt 2"]
results[goal] = generated_prompts
return resultsAfter implementing your method, register it in src/generalanalysis/jailbreaks/__init__.py so it can be imported alongside the built-in methods. See the AI red teaming development guide for detailed instructions on contributing new methods to the framework.
Next StepsNext Steps
- Learn about the LLM jailbreak evaluator for standardized scoring and batch assessment of jailbreaking effectiveness
- Compare the LLM jailbreak performance benchmarks across models and datasets
- See how these methods integrate with adversarial prompt generators
- Read our comprehensive LLM Jailbreak Cookbook for detailed analysis and configuration recommendations
- Explore the AI red teaming quickstart guide to run your first jailbreak in minutes