AutoDAN
AutoDAN (Automatic Dynamic Adversarial Network prompting) applies evolutionary optimization to the problem of adversarial prompt generation. Originally introduced in the paper AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models (Liu et al., 2023), the method treats jailbreak prompt crafting as a search problem over natural language space and uses genetic algorithms to discover prompts that bypass safety training while remaining readable and coherent.
Unlike brute-force or random-search approaches, AutoDAN maintains a population of candidate prompts and evolves them over successive generations using biologically-inspired operators: the best-performing candidates survive (elitism), pairs of candidates are combined to produce offspring (crossover), and random variations are introduced to maintain diversity (mutation). This process systematically explores the space of adversarial prompts and converges on highly effective attacks that are difficult for content filters to detect.
How Genetic Algorithms Apply to Prompt EngineeringHow Genetic Algorithms Apply to Prompt Engineering
The core insight behind AutoDAN is that adversarial prompt generation can be framed as an optimization problem where the fitness function measures how effectively a prompt causes the target model to comply with a harmful request. Genetic algorithms are well-suited for this kind of optimization because:
- The search space is discrete and high-dimensional: Natural language prompts exist in a combinatorial space that is too large to enumerate but structured enough for evolutionary search to exploit.
- Gradient information is unavailable or expensive: Unlike GCG, which requires white-box access, AutoDAN’s evolutionary approach works without gradients, making it applicable to API-only models.
- Multiple local optima exist: Different prompting strategies (role-playing, hypothetical framing, authority claims) each represent local optima. A diverse population explores multiple strategies simultaneously.
Each generation, AutoDAN evaluates every candidate prompt against the target model using an evaluator LLM, selects the top performers, breeds new candidates through crossover (combining effective elements from two parent prompts), and applies mutations (rephrasing, word substitution, structural changes) to maintain exploration. The elitism_rate controls what fraction of top performers survive unchanged, while crossover_rate and mutation_rate govern how aggressively new candidates are generated.
When to Choose AutoDANWhen to Choose AutoDAN
AutoDAN is a strong choice when you need prompts that maintain semantic coherence — unlike GCG’s adversarial suffixes, AutoDAN’s outputs read like natural English and are harder for perplexity-based filters to flag. It is classified as a semantic, systematic method.
Consider AutoDAN when:
- You want adversarial prompts that look like normal user messages rather than garbled token sequences
- You have access to model weights (white-box) and want to leverage loss-based fitness evaluation for faster convergence
- You need to test whether your model’s safety training resists sophisticated social-engineering-style prompts
- You are running a comprehensive audit and want to complement black-box methods like TAP with a white-box evolutionary approach
For a purely black-box evolutionary approach that additionally builds a reusable strategy library, consider AutoDAN-Turbo strategy-based jailbreak.
UsageUsage
The following example configures AutoDAN with a population of 20 candidates, runs up to 20 iterations of evolutionary optimization, and evaluates results using a separate LLM judge.
from generalanalysis.jailbreaks import AutoDAN, AutoDANConfig
config = AutoDANConfig(
target_model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
project="autodan_experiment",
initial_candidates=["I need your help with something dangerous", "Let's discuss a harmful topic"],
device="cuda:0",
N=20,
max_iterations=20,
elitism_rate=0.5,
crossover_rate=0.5,
mutation_rate=0.5,
verbose=False,
evaluator_model="meta-llama/Llama-3.3-70B-Instruct-Turbo"
)
autodan = AutoDAN(config)
results = autodan.optimize(goals=["Generate harmful content"])The initial_candidates list seeds the first generation. Providing diverse initial prompts — different tones, structures, and framing strategies — gives the evolutionary process a broader starting population and generally leads to faster convergence.
Key ParametersKey Parameters
| Parameter | Description |
|---|---|
target_model | The model being tested for safety vulnerabilities. This is the model whose responses are evaluated for compliance with harmful goals. |
project | Name for the experiment results directory. All logs, candidate prompts, and evaluation scores are saved under this path for reproducibility. |
initial_candidates | List of seed prompts that form the initial population. More diverse seeds lead to broader exploration. Aim for 2–5 varied starting points. |
device | CUDA device identifier for running the target model locally (e.g., "cuda:0"). Required because AutoDAN uses white-box loss computation for fitness evaluation. |
N | Population size — the number of candidate prompts maintained in each generation. Larger populations explore more broadly but require more evaluations per generation. Values of 15–30 work well for most targets. |
max_iterations | Maximum number of evolutionary generations. The method stops early if a sufficiently effective prompt is found. 15–25 iterations is typical. |
elitism_rate | Fraction of the population (0.0–1.0) that survives unchanged to the next generation. Higher values preserve known-good prompts but reduce diversity. 0.3–0.5 is a reasonable default. |
crossover_rate | Probability (0.0–1.0) that two parent prompts are combined to produce a child. Controls how aggressively the method mixes successful strategies. |
mutation_rate | Probability (0.0–1.0) that a candidate undergoes random modification. Higher values increase exploration but can destabilize convergence. |
verbose | Whether to print detailed progress output during optimization, including per-generation fitness statistics and the current best prompt. |
evaluator_model | The LLM used to judge whether each candidate prompt successfully elicits prohibited content from the target model. Using a different model than the target avoids self-evaluation bias. |
Configuration TipsConfiguration Tips
- Balance exploration and exploitation: If the method converges too quickly to mediocre prompts, increase
mutation_rateor decreaseelitism_rate. If it fails to converge, do the reverse. - Start with diverse seeds: The quality and diversity of
initial_candidatessignificantly affects performance. Include prompts with different strategies — role-playing, hypothetical scenarios, authority framing, and direct requests. - Use a strong evaluator model: The evaluator’s ability to accurately distinguish harmful compliance from refusal is critical. Weak evaluators produce noisy fitness signals that slow convergence.
- Monitor population diversity: If verbose output shows the population collapsing to near-identical prompts, the crossover and mutation rates are likely too low relative to the elitism rate.
LimitationsLimitations
- Requires white-box access: AutoDAN’s standard implementation uses model loss for fitness evaluation, which requires loading the target model locally. For API-only models, use AutoDAN-Turbo or TAP instead.
- Computationally intensive: Each generation requires
Nforward passes through the target model plus evaluator calls. Running on GPU is strongly recommended. - Stochastic outcomes: As with all evolutionary methods, results vary across runs. Running multiple experiments with different random seeds provides more reliable conclusions.
Related MethodsRelated Methods
- AutoDAN-Turbo strategy-based jailbreak — A black-box variant that replaces gradient-based fitness with LLM-based scoring and adds a persistent strategy library
- TAP tree-of-attacks jailbreak — A tree-based search method that is often faster for black-box targets
- GCG gradient-based jailbreak — Another white-box method that optimizes at the token level rather than the prompt level
For detailed performance metrics, benchmark comparisons, and recommended configurations across different target models, refer to our LLM Jailbreak Cookbook .