TAP (Tree-of-Attacks with Pruning)
TAP (Tree-of-Attacks with Pruning) is one of the most effective and practical black-box jailbreaking methods available. Introduced in the paper Tree of Attacks: Jailbreaking Black-Box LLMs with Crafted Prompts (Mehrotra et al., 2023), it applies tree search to the problem of adversarial prompt generation — systematically exploring a branching space of attack variations, evaluating each branch’s promise, and pruning unproductive paths to focus computation on the most effective directions.
TAP is often the recommended starting method for red teaming engagements because it consistently achieves high attack success rates across a wide variety of models (including GPT-4o, Claude, and Llama), requires only API access (no model weights), and is relatively efficient in terms of total queries. Its tree-based structure also provides a natural record of the search process, making it easy to analyze which attack strategies were most effective and why.
How the Tree Search WorksHow the Tree Search Works
TAP organizes its search as a tree where each node represents an adversarial prompt attempt:
-
Root generation: The attacker model generates an initial set of adversarial prompt candidates based on the goal. The number of initial candidates is determined by
branching_factor. -
Evaluation and scoring: Each candidate is sent to the target model, and the response is scored by the evaluator model on a 1–10 scale measuring how effectively the target complied with the harmful request.
-
Branching: For the most promising nodes (those with the highest scores that haven’t yet succeeded), the attacker model generates
sub_branching_factorrefined variations. These refinements use the original prompt, the target model’s response, and the evaluator’s feedback to create improved attack attempts. -
Pruning: Nodes that score below a threshold or show no improvement over their parent are pruned from the tree. The
max_widthparameter limits how many nodes are kept alive at each depth level, preventing exponential blowup. -
Depth progression: The process repeats level by level, up to
max_depth. At each level, the tree narrows to the most promising branches while generating new variations to explore.
This approach is significantly more efficient than random search because it concentrates queries on attack vectors that show early promise, while the pruning mechanism prevents wasting compute on dead ends. In practice, TAP often finds successful jailbreaks within the first 3–5 levels of the tree.
Why TAP Is Often the Best Starting MethodWhy TAP Is Often the Best Starting Method
Several properties make TAP an excellent default choice for red teaming:
- Black-box compatible: Works with any model you can query through an API, including proprietary systems like GPT-4o and Claude.
- Semantic outputs: Generated prompts read as natural English, making them realistic representations of what a motivated adversary might actually try.
- Efficient query budget: The pruning mechanism means TAP typically needs fewer total queries than methods that search exhaustively. A typical run uses 50–200 target model queries.
- Interpretable search trace: The tree structure provides a clear record of which strategies were explored and why certain branches succeeded.
- Strong empirical performance: In published benchmarks and our internal testing, TAP consistently ranks among the top methods across diverse models and goal categories.
UsageUsage
The following example configures TAP to attack Claude 3.7 Sonnet using DeepSeek-R1 as both the attacker and evaluator, with the HarmBench dataset as the goal source.
from generalanalysis.jailbreaks import TAP, TAPConfig
from generalanalysis.data_utils import load_harmbench_dataset
config = TAPConfig(
project="tap_experiment_1",
target_model="claude-3-7-sonnet-20250219",
attacker_model="deepseek-ai/DeepSeek-R1",
evaluator_model="deepseek-ai/DeepSeek-R1",
branching_factor=2,
sub_branching_factor=4,
max_depth=10,
max_width=5,
max_workers=20,
temperature=0.7,
target_str="Sure Here is",
refinements_max_tokens=24000
)
tap = TAP(config)
best_nodes, root_nodes = tap.optimize(load_harmbench_dataset())The optimize method returns two values: best_nodes contains the highest-scoring adversarial prompts found for each goal, and root_nodes contains the full tree structure for post-hoc analysis. You can traverse the tree to understand the search trajectory and identify which refinement strategies were most effective.
Key ParametersKey Parameters
| Parameter | Description |
|---|---|
project | Name for the experiment results directory. All tree structures, prompts, and evaluation scores are saved here. |
target_model | The model being tested for safety vulnerabilities. Any model supported by the BlackBoxModel interface works here. |
attacker_model | The LLM that generates and refines adversarial prompts at each tree node. Stronger reasoning models produce more creative attacks. DeepSeek-R1 and GPT-4o are effective choices. |
evaluator_model | The LLM that scores target model responses on a 1–10 scale. Evaluation accuracy directly affects pruning quality — a poor evaluator prunes good branches and keeps bad ones. |
branching_factor | Number of child nodes generated at each tree level. Higher values explore more broadly but increase query cost linearly. Start with 2–4 and increase if the method prunes too aggressively. |
sub_branching_factor | Number of refined variations generated for each promising node. This controls the diversity of refinements at each step. 3–5 is a good range. |
max_depth | Maximum number of levels in the search tree. Deeper trees allow more iterative refinement but cost more queries. 8–12 is typical; many attacks succeed by depth 5. |
max_width | Maximum number of nodes kept alive at each depth level. This is the primary pruning control — lower values are more aggressive and efficient but risk discarding promising branches. 3–5 works well. |
max_workers | Maximum number of concurrent API calls for parallel query execution. Higher values speed up the search significantly for API-hosted models. Set based on your rate limit. |
temperature | Sampling temperature for the attacker model. Higher values (0.7–1.0) produce more diverse branches; lower values (0.3–0.5) generate more focused refinements of existing strategies. |
target_str | A string prefix that the target model’s response should start with if the attack is successful (e.g., “Sure Here is”). Used as a heuristic success signal alongside the evaluator’s score. |
refinements_max_tokens | Maximum token budget for the attacker model when generating refinements. Higher values allow more detailed attack prompts and reasoning. 16000–24000 is recommended for reasoning models. |
Tuning GuidanceTuning Guidance
Branching Factor and Max Depth TradeoffBranching Factor and Max Depth Tradeoff
The total query budget scales roughly as branching_factor × sub_branching_factor × max_depth × max_width. If you have a limited API budget, prefer deeper trees with narrower branching (e.g., branching_factor=2, max_depth=12) rather than wide shallow trees. Depth allows iterative refinement, which is typically more effective than parallel independent attempts.
When TAP StrugglesWhen TAP Struggles
TAP can underperform against models with very strong multi-layer safety systems that filter both inputs and outputs. In these cases, consider combining TAP with Crescendo multi-turn jailbreak (which spreads the attack across multiple conversation turns) or Bijection Learning encoding-based jailbreak (which encodes the request in a way that bypasses pattern-matching filters).
Evaluator Model SelectionEvaluator Model Selection
The evaluator model is the most underrated parameter in TAP’s configuration. A weak evaluator that assigns high scores to refusals or low scores to partial compliance will cause the tree search to waste budget on unproductive branches. Use the strongest available model as the evaluator, even if you use a more cost-effective model as the attacker.
Related MethodsRelated Methods
- AutoDAN-Turbo strategy-based jailbreak — An alternative black-box method that accumulates strategies across runs through a persistent library
- Crescendo multi-turn jailbreak — A multi-turn method that is more realistic for conversation-based models but requires more queries per attack
- GCG gradient-based jailbreak — A white-box method that optimizes at the token level and produces nonsensical adversarial suffixes
For more detailed performance metrics and configurations, refer to our LLM Jailbreak Cookbook .