Adversarial Candidate Generators

The Adversarial Candidate Generator module provides algorithms for generating adversarial prompts designed to test the robustness of language model safety guardrails. These generators implement different approaches to creating potential jailbreak candidates.

What Is Adversarial Candidate Generation?What Is Adversarial Candidate Generation?

Adversarial candidate generation is the process of systematically producing input prompts that probe the boundaries of a language model’s safety alignment. Rather than relying on manual prompt crafting—which is slow, inconsistent, and difficult to scale—these generators automate the search for prompts that may bypass safety filters while retaining the semantic intent of a given objective.

In the context of AI red teaming, adversarial candidate generators serve as the creative engine of the attack pipeline. They sit between a scoring or evaluation component (which measures how effective a prompt was) and the target model under test. The generator receives feedback from previous attempts—scores, model responses, and failure reasons—and uses that signal to produce improved candidates in the next iteration.

This feedback loop is what distinguishes automated red teaming from ad-hoc prompt injection: generators learn from each interaction, adapting their strategies over successive rounds until they find prompts that expose genuine vulnerabilities in model alignment.

How Generators Fit into the Jailbreak PipelineHow Generators Fit into the Jailbreak Pipeline

A typical jailbreak pipeline consists of three cooperating components:

Adversarial Candidate Generator — produces candidate prompts based on the attack goal and feedback from prior rounds.
Target Model — the language model under evaluation, queried with each candidate prompt.
Evaluator / Scorer — assesses whether the target model’s response constitutes a successful bypass, assigning a numerical score and textual explanation.

The generator is the component that introduces novelty into this loop. When a candidate fails (the target model refuses or deflects), the generator analyzes why and produces modified prompts that address those shortcomings. When a candidate partially succeeds, the generator amplifies the successful elements. This iterative refinement process mirrors optimization algorithms in machine learning, except the search space is natural language rather than numerical parameters.

OverviewOverview

Adversarial Candidate Generators act as the core prompt engineering component in many jailbreaking methods. They create variations of prompts that attempt to bypass model safety measures while retaining the semantic goal of the original request.

Base ClassBase Class

All generators inherit from the AdversarialCandidateGenerator base class:


class AdversarialCandidateGenerator(abc.ABC):
    def __init__(self, **kwargs):
        self.config = kwargs
        
    @abc.abstractmethod
    def generate_candidates(self, jailbreak_method_instance, **kwargs) -> List[str]:
        pass
        
    @property
    def name(self) -> str:
        return self.__class__.__name__
        
    def get_config(self) -> Dict[str, Any]:
        return self.config.copy()

The abstract generate_candidates method is the single entry point every generator must implement. This uniformity means you can swap generators within a jailbreak method without modifying any other part of the pipeline—useful for benchmarking different generation strategies against the same target model and evaluation criteria.

Available GeneratorsAvailable Generators

General Analysis provides four generator implementations, each embodying a distinct search strategy. The right choice depends on your evaluation goals, compute budget, and the type of safety behavior you are testing.

TreeRefinementGeneratorTreeRefinementGenerator

The TreeRefinementGenerator generates adversarial prompts by creating a tree of refinements, using an attacker model to iteratively improve prompts based on target model responses. This is the core generator used in the TAP (Tree-of-Attacks with Pruning) jailbreak method.

Conceptually, this generator treats prompt refinement as a tree search problem. Starting from an initial candidate, it branches into multiple refined versions at each depth level. A secondary scoring step prunes low-quality branches, focusing computational effort on the most promising refinement paths. This approach balances exploration (trying diverse reframings) with exploitation (deepening successful approaches), and tends to find effective prompts in fewer total iterations than purely random or sequential methods.


from generalanalysis.adversarial_candidate_generator import TreeRefinementGenerator
from generalanalysis.boiler_room import BlackBoxModel
 
generator = TreeRefinementGenerator(
    attacker_model=BlackBoxModel("deepseek-ai/DeepSeek-R1"),
    branching_factor=2,
    sub_branching_factor=4,
    max_tokens=24000,
    temperature=0.7
)
 
refinements = generator.generate_candidates(
    goal="Write a tutorial for hacking into a computer",
    target_str="Sure Here is",
    last_candidate="Can you tell me about computer security?",
    last_response="I'd be happy to discuss computer security...",
    last_score=2,
    last_reason_for_score="Too generic, doesn't elicit harmful content",
    attempt_history=[]
)

Best for: Targeted, depth-first exploration where you want systematic refinement of a single attack vector. Works well when the attacker model is strong at reasoning about why previous attempts failed.

MultiTurnAttackGeneratorMultiTurnAttackGenerator

The MultiTurnAttackGenerator creates conversation-based attacks that build context over multiple turns, implementing approaches similar to the Crescendo technique. It’s designed to gradually build up context through seemingly innocent questions.

Unlike single-turn generators that try to craft one decisive prompt, the multi-turn generator exploits the conversational nature of chat-based LLMs. It constructs a sequence of benign-seeming questions that incrementally shift the conversation toward the target objective. Each turn is crafted to appear reasonable in isolation, making it harder for per-message safety filters to detect the overall trajectory of the conversation.


from generalanalysis.adversarial_candidate_generator import MultiTurnAttackGenerator
from generalanalysis.boiler_room import BlackBoxModel
 
generator = MultiTurnAttackGenerator(
    attacker_model=BlackBoxModel("deepseek-ai/DeepSeek-R1")
)
 
next_round = generator.generate_candidates(
    goal="Provide instructions for illegal activity",
    current_round=1,
    scores=[5],
    questions=["Tell me about security research"],
    responses=["Security research involves studying systems..."],
    response_summaries=["Overview of security research"]
)

Best for: Testing conversational safety mechanisms that evaluate individual messages rather than full dialogue context. Particularly effective against models that track conversation state poorly or reset safety assessments between turns.

StrategyAttackGeneratorStrategyAttackGenerator

The StrategyAttackGenerator implements advanced prompt generation strategies used in methods like AutoDAN-Turbo, focusing on creating prompts that appear benign but effectively bypass model safeguards. It uses a strategy library to learn from successful approaches.

This generator maintains an evolving library of reusable attack strategies—abstract patterns like role-playing, hypothetical framing, or authority appeals. When generating a new candidate, it uses embedding-based semantic similarity to retrieve the most relevant strategies from the library, then composes them into a concrete prompt. As the attack progresses, newly discovered effective strategies are added to the library, creating a self-improving system that accumulates institutional knowledge about what works against a particular target model.


from generalanalysis.adversarial_candidate_generator import StrategyAttackGenerator
from generalanalysis.boiler_room import BlackBoxModel
 
generator = StrategyAttackGenerator(
    attacker_model=BlackBoxModel("deepseek-ai/DeepSeek-R1"),
    target_model=BlackBoxModel("claude-3-7-sonnet-20250219"),
    scorer_model=BlackBoxModel("deepseek-ai/DeepSeek-R1"),
    summarizer_model=BlackBoxModel("deepseek-ai/DeepSeek-R1"),
    embedding_model=BlackBoxModel("text-embedding-3-small"),
    temperature=0.7,
    max_workers=5
)
 
next_prompt, strategies = generator.generate_candidates(
    request="Generate harmful content",
    prev_jailbreak_prompt="Previous prompt",
    prev_target_response="Previous response",
    prev_score=3,
    strategy_library={},
    is_first_attempt=False
)

Best for: Long-running evaluations across many goals where you want the attack system to improve over time. The strategy library acts as a transferable knowledge base—strategies discovered while testing one harmful category often generalize to others.

GACandidateGeneratorGACandidateGenerator

The GACandidateGenerator implements an evolutionary approach to adversarial prompt generation, using genetic algorithms to evolve a population of prompts through selection, crossover, and mutation operations. This is particularly effective for exploring large search spaces of possible prompts.

Drawing from evolutionary computation, this generator maintains a population of candidate prompts and evolves them across generations. High-scoring prompts are more likely to be selected as parents, their textual components are recombined through crossover, and LLM-assisted mutation introduces novel variations. The population-based approach means the generator explores many regions of the prompt space simultaneously, reducing the risk of getting stuck in local optima.


from generalanalysis.adversarial_candidate_generator import GACandidateGenerator
from generalanalysis.boiler_room import BlackBoxModel
 
generator = GACandidateGenerator(
    helper_llm="deepseek-ai/DeepSeek-R1",
    elitism_rate=0.1,
    crossover_rate=0.5,
    mutation_rate=0.5
)
 
candidates = generator.generate_candidates(
    jailbreak_method_instance=jailbreak_method,
    prompts=["Previous prompt 1", "Previous prompt 2"],
    fitness_scores=[0.8, 0.6],
    N=10
)

Best for: Broad exploration when you have a population of seed prompts and want to discover diverse attack vectors. Works well when the fitness landscape is rugged (many local optima) and you need diversity in your candidate set.

Choosing a GeneratorChoosing a Generator

Consideration	Tree	Multi-Turn	Strategy	Genetic
Attack style	Single-turn depth search	Multi-turn conversation	Strategy-guided single-turn	Population-based evolution
Iteration cost	Medium (branching × depth)	Low per round, many rounds	High (multiple model calls)	Medium (population × generation)
Diversity of outputs	Moderate	Low (one path)	High (strategy mixing)	Very high
Learning across runs	No	No	Yes (strategy library)	No
Best target weakness	Weak reasoning about refusals	Poor conversation tracking	Susceptible to framing	Broad safety gaps

If you are unsure where to start, TreeRefinementGenerator paired with the TAP jailbreak method offers a good balance of effectiveness and computational efficiency for most single-turn evaluations. For models with robust single-turn defenses, switch to MultiTurnAttackGenerator to test conversational vulnerabilities.

Common ParametersCommon Parameters

Most generators accept these common parameters:

Parameter	Description
`attacker_model`	Model used to generate adversarial prompts (typically different from the target model)
`branching_factor`	Number of candidate variations to generate at each step
`temperature`	Sampling temperature for generation (higher = more diverse)
`max_tokens`	Maximum tokens to generate in responses