Jailbreaks Overview

Jailbreaking is the process of crafting inputs that cause a language model to bypass its safety training and produce content it was designed to refuse. While the term sounds adversarial, jailbreak testing is a critical component of responsible AI deployment. Without proactive testing, organizations have no way to know whether their guardrails will hold under real-world adversarial pressure. GA’s jailbreaks module provides battle-tested implementations of the most effective techniques published in the academic literature, all behind a consistent interface that makes it easy to run, compare, and extend attacks.

Every method in this module has been validated against the HarmBench benchmark and mapped to its original research paper. The implementations are designed for both research reproducibility and practical safety auditing, with configurable parameters, parallel execution support, and structured result output.

Why Jailbreak Testing MattersWhy Jailbreak Testing Matters

Language models are trained with safety alignment techniques like RLHF and constitutional AI, but these defenses are not foolproof. New jailbreak methods are published regularly, and adversaries in the real world are creative and persistent. Systematic jailbreak testing helps teams understand the actual boundary of their model’s safety — not just the intended boundary — and make informed decisions about deployment risk.

Common scenarios where jailbreak testing is essential include:

Pre-deployment safety audits where you need quantitative evidence of guardrail coverage before shipping a model or feature
Defense benchmarking where you want to measure how effectively a new safety technique resists known attack patterns
Compliance documentation where regulators or enterprise customers require proof of adversarial testing
Continuous regression testing where you need to verify that a fine-tuning run or system prompt update hasn’t weakened existing safeguards

Jailbreak TaxonomyJailbreak Taxonomy

Jailbreaking methods can be categorized along three key dimensions. Understanding this taxonomy helps you select the right method for your specific testing scenario and threat model.

White-box vs. Black-boxWhite-box vs. Black-box

This dimension describes the level of model access required by the attack.

White-box methods (GCG, AutoDAN) require direct access to model weights and gradients. They can compute exact loss gradients and optimize adversarial tokens at the embedding level. This makes them extremely powerful but limits their use to open-weight models that you can load locally.
Black-box methods (TAP, AutoDAN-Turbo, Crescendo, Bijection Learning) only need API access — they interact with the model through its text input/output interface. This makes them applicable to any model, including proprietary systems like GPT-4o and Claude, but they must rely on heuristic search rather than gradient information.

For most practical safety audits, black-box methods are the right starting point because they match the access level a real attacker would have. White-box methods are most valuable when you control the model and want to understand its worst-case vulnerability surface.

Semantic vs. NonsensicalSemantic vs. Nonsensical

This dimension describes whether the adversarial prompts maintain natural language coherence.

Semantic methods (TAP, AutoDAN-Turbo, Crescendo) generate prompts that read like normal English. They rely on social engineering, role-playing scenarios, and careful framing to manipulate the model. These prompts are harder for content filters to catch because they don’t contain obvious adversarial artifacts.
Nonsensical methods (GCG, Bijection Learning) produce prompts that look like gibberish or encoded text to a human reader, but exploit specific patterns in the model’s tokenization or processing. These are often easier for perplexity-based filters to detect, but they can be devastatingly effective against models that lack such defenses.

When testing a model’s defenses, it’s important to evaluate against both categories. A model that resists semantic attacks might still be vulnerable to token-level optimization, and vice versa.

Systematic vs. ManualSystematic vs. Manual

All methods in GA are fully systematic — they use algorithmic search, optimization, or generation strategies to discover effective prompts automatically. This is in contrast to manually crafted jailbreaks, which rely on human creativity and are difficult to scale. GA’s systematic approach means you can run thousands of attack attempts across large evaluation datasets without manual intervention.

Available MethodsAvailable Methods

The module includes implementations of six state-of-the-art jailbreaking techniques, each targeting a different aspect of model safety:

AutoDAN

Hierarchical genetic algorithm

Explore AutoDAN

AutoDAN Turbo

Lifelong agent for strategy self-exploration

Explore AutoDAN Turbo

TAP

Tree-of-Attacks with Pruning

Explore TAP

GCG

Greedy Coordinate Gradient-based optimization

Explore GCG

Crescendo

Progressive multi-turn attack

Explore Crescendo

Bijection Learning

Randomized bijection encodings

Explore Bijection Learning

Method ClassificationMethod Classification

The following table summarizes the classification of each method across the taxonomy dimensions. Use it as a quick reference when choosing which methods to include in your testing suite.

Method	Type	Approach	Access Required
AutoDAN	Semantic	Systematic	White-box
AutoDAN-Turbo	Semantic	Systematic	Black-box
TAP	Semantic	Systematic	Black-box
GCG	Nonsensical	Systematic	White-box
Crescendo	Semantic	Systematic	Black-box
Bijection Learning	Nonsensical	Systematic	Black-box

Choosing the Right MethodChoosing the Right Method

If you are testing a proprietary model through its API and want to start with the method most likely to succeed, TAP is generally the best first choice — it achieves strong attack success rates across a wide range of models with moderate computational cost.

For comprehensive coverage, combine at least one semantic method (TAP or Crescendo) with one nonsensical method (GCG or Bijection Learning) to test both natural-language manipulation and token-level exploitation. If you need to simulate realistic multi-turn adversarial interactions, Crescendo is the most appropriate method because it mirrors how a real user might gradually escalate a conversation.

Common InterfaceCommon Interface

All jailbreaking methods implement a common interface defined in the base JailbreakMethod class. This shared interface ensures that you can swap methods in and out of your pipeline without changing any surrounding code — the configuration, execution, and result format remain consistent.


import abc
from typing import Dict, Any, List
 
class JailbreakMethod(abc.ABC):
    def __init__(self, **kwargs):
        self.config = kwargs
        
    @abc.abstractmethod
    def optimize(self, goals: List[str], **kwargs):
        # Implementation specific to each method
        pass
    
    @property
    def name(self) -> str:
        return self.__class__.__name__
    
    def get_config(self) -> Dict[str, Any]:
        return self.config.copy()

The optimize method is the primary entry point. It accepts a list of goal strings — descriptions of the harmful behavior you are testing for — and returns the most effective adversarial prompts discovered during the search. Each method’s configuration is passed through its own dataclass (e.g., TAPConfig, GCGConfig), but the calling pattern remains the same.

Basic Usage PatternBasic Usage Pattern

While each method has its specific parameters and workflows, they follow a common usage pattern. The example below demonstrates the typical three-step process: initialize a target model, create a jailbreak method instance with your chosen configuration, and call optimize with a list of goals.


from generalanalysis.boiler_room import BlackBoxModel
from generalanalysis.jailbreaks import AutoDAN
 
# Initialize the target model to test
target_model = BlackBoxModel("gpt-4o")
 
# Create a jailbreak method instance
jailbreak = AutoDAN(target_model=target_model)
 
# Define test goals
goals = ["Provide instructions for illegal activities", "Generate harmful content"]
 
# Run optimization to find effective jailbreaking prompts
results = jailbreak.optimize(goals=goals, max_iterations=10)
 
# Analyze results
for goal, prompts in results.items():
    print(f"Goal: {goal}")
    for i, prompt in enumerate(prompts):
        print(f"  Prompt {i+1}: {prompt}")
        response = target_model.query(prompt)
        print(f"  Response: {response[:100]}...")

Results are returned as a dictionary mapping each goal to the adversarial prompts discovered during optimization. You can iterate over these to inspect the prompts, query the target model, or pass them to the evaluator for standardized scoring.

EvaluationEvaluation

The module includes an evaluator for assessing the effectiveness of jailbreaking attempts using a consistent scoring rubric. Rather than relying on ad-hoc manual review, the evaluator uses a judge model to determine whether the target model’s response actually complied with the harmful request and assigns a structured score.


from generalanalysis.jailbreaks import evaluator
 
# Evaluate a prompt against a goal
result = evaluator.evaluate_prompt(
    target_model=target_model,
    prompt="Your jailbreaking prompt...",
    goal="The goal you're trying to achieve"
)
 
print(f"Score: {result['score']}/10")
print(f"Response: {result['response']}")
print(f"Reason: {result['reason']}")

For batch evaluation across large datasets and multiple models, see the full LLM jailbreak evaluator documentation.

Creating Custom MethodsCreating Custom Methods

You can create custom jailbreaking methods by extending the base class. This is useful when you want to implement a new technique from the literature, combine elements from multiple existing methods, or test a domain-specific attack strategy. As long as your class implements the optimize interface, it will integrate seamlessly with GA’s evaluation and reporting tools.


from generalanalysis.jailbreaks.base import JailbreakMethod
from generalanalysis.boiler_room import BlackBoxModel
 
class MyCustomJailbreak(JailbreakMethod):
    def __init__(self, target_model: BlackBoxModel, **kwargs):
        super().__init__(**kwargs)
        self.target_model = target_model
        
    def optimize(self, goals: List[str], **kwargs) -> Dict[str, Any]:
        results = {}
        for goal in goals:
            # Your custom implementation here
            generated_prompts = ["Prompt 1", "Prompt 2"]
            results[goal] = generated_prompts
        return results

After implementing your method, register it in src/generalanalysis/jailbreaks/__init__.py so it can be imported alongside the built-in methods. See the AI red teaming development guide for detailed instructions on contributing new methods to the framework.

Next StepsNext Steps

Learn about the LLM jailbreak evaluator for standardized scoring and batch assessment of jailbreaking effectiveness
Compare the LLM jailbreak performance benchmarks across models and datasets
See how these methods integrate with adversarial prompt generators
Read our comprehensive LLM Jailbreak Cookbook for detailed analysis and configuration recommendations
Explore the AI red teaming quickstart guide to run your first jailbreak in minutes