Bijection Learning

Bijection Learning is a black-box jailbreaking technique that exploits a fundamental capability of large language models: their ability to learn new patterns from examples provided within the prompt. The method works by defining a custom encoding scheme — a bijection (one-to-one reversible mapping) between natural language and an obfuscated representation — teaching the model this encoding through in-context examples, and then posing harmful queries in the encoded form. Because the safety filters are trained to detect harmful patterns in natural language, the encoded version passes through undetected while the model silently decodes and responds to the underlying request.

This approach was introduced in the paper Bijection Learning: A Novel Approach to Jailbreak LLMs via In-Context Learning and falls into the nonsensical, black-box category of jailbreaking methods. Unlike GCG’s adversarial suffixes, which are optimized through gradients, bijection encodings are constructed algorithmically and applied through the model’s own in-context learning mechanism.

How the Encoding Scheme WorksHow the Encoding Scheme Works

A bijection is an invertible mapping — every element in the domain maps to exactly one element in the codomain, and vice versa. In the context of this jailbreak, the bijection maps between English characters (or words) and an encoded representation.

For example, a digit-based bijection might map:


a → 47, b → 12, c → 89, d → 03, ...

The attack prompt then follows a structured pattern:

Encoding definition: The prompt begins by explicitly defining the bijection mapping, presenting it as a coding exercise or translation task.
Teaching examples: Several examples demonstrate the encoding and decoding process. The model is shown encoded text alongside its decoded English equivalent, establishing the pattern through in-context learning.
Encoded query: The harmful request is presented in encoded form. The model, having learned the bijection, decodes the query internally and responds — often in the encoded format as well, which can then be decoded by the attacker.

The key insight is that safety training operates on the semantic content of inputs as the model perceives them. When the harmful content is encoded, the input tokens themselves don’t match any patterns the safety system was trained to detect. But the model’s in-context learning ability is powerful enough to reconstruct the original meaning and comply with the request.

Types of Bijection EncodingsTypes of Bijection Encodings

The bijection_type parameter controls the encoding scheme used. Each type has different tradeoffs in terms of encoding density, model learnability, and filter evasion:

"digit": Maps each character to a multi-digit number (controlled by num_digits). Dense and difficult for content filters to parse, but requires more teaching examples for the model to learn the mapping reliably. The digit_delimiter parameter separates encoded characters.
"letter": Maps characters to other characters through a permutation. Simpler for the model to learn (only 26 mappings) but the encoded text retains word structure, making it potentially detectable by advanced filters.
"word": Maps entire words to other words. The most natural-looking encoded text, but the mapping is larger and harder to teach comprehensively within a context window.

The choice of encoding type depends on the target model’s defenses. Models with character-level content filters may be more vulnerable to digit encoding, while models with semantic-level filters may be more vulnerable to word-level encoding that disrupts meaning while preserving surface structure.

When Bijection Learning Is EffectiveWhen Bijection Learning Is Effective

Bijection Learning is particularly effective in the following scenarios:

Models with strong pattern-matching safety filters: If the target model’s safety system primarily relies on detecting known harmful phrases or patterns, bijection encoding neutralizes these detections by transforming the input into an unrecognizable form.
Models with strong in-context learning capabilities: Larger, more capable models are paradoxically more vulnerable to bijection attacks because they are better at learning and applying the encoding scheme from the provided examples.
When combined with other methods: Bijection encoding can be layered on top of prompts generated by other methods. For instance, a TAP-generated prompt that is close to succeeding might fully succeed when additionally encoded with a bijection.

The method is less effective against models that use input perplexity checking (encoded inputs have high perplexity), models with very limited context windows (the teaching examples consume significant tokens), or models with output filters that detect encoded responses.

UsageUsage

The following example configures Bijection Learning with a digit-based encoding to test Claude 3.7 Sonnet using the HarmBench dataset.


from generalanalysis.jailbreaks import BijectionLearning, BijectionLearningConfig
from generalanalysis.data_utils import load_harmbench_dataset
 
config = BijectionLearningConfig(
    exp_name="bijection_test",
    victim_model="claude-3-7-sonnet-20250219",
    trials=20,
    bijection_type="digit",
    fixed_size=10,
    num_digits=2,
    safety_data="harmbench",
    universal=False,
    digit_delimiter="  ",
    interleave_practice=False,
    context_length_search=False,
    prefill=False,
    num_teaching_shots=10,
    num_multi_turns=0,
    input_output_filter=False
)
 
bijection = BijectionLearning(config)
results = bijection.optimize(load_harmbench_dataset())

Each trial generates a fresh random bijection mapping, so running multiple trials increases the probability of finding an encoding that the target model can learn accurately enough to produce a harmful response. The trials parameter controls this attack budget.

Key ParametersKey Parameters

Parameter	Description
`exp_name`	Name for the experiment results. All generated encodings, prompts, and evaluation outcomes are saved under this identifier.
`victim_model`	The target model to test for bijection learning vulnerabilities. Any model accessible through the BlackBoxModel interface is supported.
`trials`	Number of independent attack attempts, each with a freshly generated bijection mapping. More trials increase the probability of success but cost more queries. 15–25 trials provide a good balance.
`bijection_type`	The encoding scheme to use: `"digit"` (character-to-number), `"letter"` (character-to-character permutation), or `"word"` (word-to-word substitution). Start with `"digit"` for most targets.
`fixed_size`	Number of fixed points in the bijection — characters that map to themselves. Higher values make the encoding easier to learn but less obfuscated. 5–15 is a typical range.
`num_digits`	Number of digits used in digit-based encoding (e.g., 2 means each character maps to a 2-digit number). Higher values create a denser encoding but require more teaching examples.
`safety_data`	The evaluation dataset to use for generating test goals (e.g., `"harmbench"`).
`universal`	When `true`, uses a single bijection mapping across all goals. When `false`, generates a fresh mapping per goal. Non-universal mode is more effective but costs more trials.
`digit_delimiter`	Separator between encoded digit groups. Double space (`" "`) is the default and works well for most models. Different delimiters can affect the model’s ability to parse the encoding.
`interleave_practice`	When `true`, intersperses additional practice encoding/decoding examples throughout the prompt to reinforce the mapping before the harmful query. Can improve accuracy on complex encodings.
`context_length_search`	When `true`, automatically searches for the optimal number of teaching examples that fits within the model’s context window. Useful when targeting models with different context limits.
`prefill`	When `true`, prefills the model’s response with the beginning of the expected encoded output. Supported by models that allow assistant message prefilling (e.g., Claude). Can significantly boost success rates.
`num_teaching_shots`	Number of encoding/decoding examples provided to teach the model the bijection. More examples improve learning accuracy but consume context window space. 8–15 shots is typical.
`num_multi_turns`	Number of additional practice conversation turns before the attack query. Multi-turn practice reinforces the encoding through dialogue but adds latency. Set to 0 for single-turn attacks.
`input_output_filter`	When `true`, applies additional filtering to both the encoded input and decoded output to remove content that might trigger safety systems at the token level.

Configuration GuidanceConfiguration Guidance

Start with digit encoding: The "digit" bijection type provides the strongest obfuscation and works well against most models. Only switch to "letter" or "word" if digit encoding consistently fails for a particular target.
Increase teaching shots for complex encodings: If the model frequently produces garbled or incorrect decoded output, increase num_teaching_shots to provide more learning examples. This is the most common fix for low success rates.
Enable prefill when available: For models that support assistant message prefilling (like Claude), setting prefill=True can dramatically increase success rates by anchoring the model’s response in the encoded format.
Use fixed points strategically: Setting fixed_size above 0 means some characters map to themselves, making the encoding easier to learn. This is particularly useful for punctuation and spaces, which the model may struggle to encode otherwise.

GCG gradient-based jailbreak — Another method that produces nonsensical adversarial content, but uses gradient optimization instead of encoding. Requires white-box access.
TAP tree-of-attacks jailbreak — A semantic black-box method that generates natural language prompts. Complementary to Bijection Learning for comprehensive testing.
Crescendo multi-turn jailbreak — A multi-turn method that uses conversational context rather than encoding to bypass safety filters.

For detailed performance metrics and configurations, refer to our LLM Jailbreak Cookbook .

Bijection Learning

How the Encoding Scheme WorksHow the Encoding Scheme Works

For example, a digit-based bijection might map:


a → 47, b → 12, c → 89, d → 03, ...

The attack prompt then follows a structured pattern:

Encoding definition: The prompt begins by explicitly defining the bijection mapping, presenting it as a coding exercise or translation task.
Teaching examples: Several examples demonstrate the encoding and decoding process. The model is shown encoded text alongside its decoded English equivalent, establishing the pattern through in-context learning.
Encoded query: The harmful request is presented in encoded form. The model, having learned the bijection, decodes the query internally and responds — often in the encoded format as well, which can then be decoded by the attacker.

Types of Bijection EncodingsTypes of Bijection Encodings

The bijection_type parameter controls the encoding scheme used. Each type has different tradeoffs in terms of encoding density, model learnability, and filter evasion:

"digit": Maps each character to a multi-digit number (controlled by num_digits). Dense and difficult for content filters to parse, but requires more teaching examples for the model to learn the mapping reliably. The digit_delimiter parameter separates encoded characters.
"letter": Maps characters to other characters through a permutation. Simpler for the model to learn (only 26 mappings) but the encoded text retains word structure, making it potentially detectable by advanced filters.
"word": Maps entire words to other words. The most natural-looking encoded text, but the mapping is larger and harder to teach comprehensively within a context window.

When Bijection Learning Is EffectiveWhen Bijection Learning Is Effective

Bijection Learning is particularly effective in the following scenarios:

Models with strong pattern-matching safety filters: If the target model’s safety system primarily relies on detecting known harmful phrases or patterns, bijection encoding neutralizes these detections by transforming the input into an unrecognizable form.
Models with strong in-context learning capabilities: Larger, more capable models are paradoxically more vulnerable to bijection attacks because they are better at learning and applying the encoding scheme from the provided examples.
When combined with other methods: Bijection encoding can be layered on top of prompts generated by other methods. For instance, a TAP-generated prompt that is close to succeeding might fully succeed when additionally encoded with a bijection.

UsageUsage

The following example configures Bijection Learning with a digit-based encoding to test Claude 3.7 Sonnet using the HarmBench dataset.


from generalanalysis.jailbreaks import BijectionLearning, BijectionLearningConfig
from generalanalysis.data_utils import load_harmbench_dataset
 
config = BijectionLearningConfig(
    exp_name="bijection_test",
    victim_model="claude-3-7-sonnet-20250219",
    trials=20,
    bijection_type="digit",
    fixed_size=10,
    num_digits=2,
    safety_data="harmbench",
    universal=False,
    digit_delimiter="  ",
    interleave_practice=False,
    context_length_search=False,
    prefill=False,
    num_teaching_shots=10,
    num_multi_turns=0,
    input_output_filter=False
)
 
bijection = BijectionLearning(config)
results = bijection.optimize(load_harmbench_dataset())

Key ParametersKey Parameters

Parameter	Description
`exp_name`	Name for the experiment results. All generated encodings, prompts, and evaluation outcomes are saved under this identifier.
`victim_model`	The target model to test for bijection learning vulnerabilities. Any model accessible through the BlackBoxModel interface is supported.
`trials`	Number of independent attack attempts, each with a freshly generated bijection mapping. More trials increase the probability of success but cost more queries. 15–25 trials provide a good balance.
`bijection_type`	The encoding scheme to use: `"digit"` (character-to-number), `"letter"` (character-to-character permutation), or `"word"` (word-to-word substitution). Start with `"digit"` for most targets.
`fixed_size`	Number of fixed points in the bijection — characters that map to themselves. Higher values make the encoding easier to learn but less obfuscated. 5–15 is a typical range.
`num_digits`	Number of digits used in digit-based encoding (e.g., 2 means each character maps to a 2-digit number). Higher values create a denser encoding but require more teaching examples.
`safety_data`	The evaluation dataset to use for generating test goals (e.g., `"harmbench"`).
`universal`	When `true`, uses a single bijection mapping across all goals. When `false`, generates a fresh mapping per goal. Non-universal mode is more effective but costs more trials.
`digit_delimiter`	Separator between encoded digit groups. Double space (`" "`) is the default and works well for most models. Different delimiters can affect the model’s ability to parse the encoding.
`interleave_practice`	When `true`, intersperses additional practice encoding/decoding examples throughout the prompt to reinforce the mapping before the harmful query. Can improve accuracy on complex encodings.
`context_length_search`	When `true`, automatically searches for the optimal number of teaching examples that fits within the model’s context window. Useful when targeting models with different context limits.
`prefill`	When `true`, prefills the model’s response with the beginning of the expected encoded output. Supported by models that allow assistant message prefilling (e.g., Claude). Can significantly boost success rates.
`num_teaching_shots`	Number of encoding/decoding examples provided to teach the model the bijection. More examples improve learning accuracy but consume context window space. 8–15 shots is typical.
`num_multi_turns`	Number of additional practice conversation turns before the attack query. Multi-turn practice reinforces the encoding through dialogue but adds latency. Set to 0 for single-turn attacks.
`input_output_filter`	When `true`, applies additional filtering to both the encoded input and decoded output to remove content that might trigger safety systems at the token level.

Configuration GuidanceConfiguration Guidance

Start with digit encoding: The "digit" bijection type provides the strongest obfuscation and works well against most models. Only switch to "letter" or "word" if digit encoding consistently fails for a particular target.
Increase teaching shots for complex encodings: If the model frequently produces garbled or incorrect decoded output, increase num_teaching_shots to provide more learning examples. This is the most common fix for low success rates.
Enable prefill when available: For models that support assistant message prefilling (like Claude), setting prefill=True can dramatically increase success rates by anchoring the model’s response in the encoded format.
Use fixed points strategically: Setting fixed_size above 0 means some characters map to themselves, making the encoding easier to learn. This is particularly useful for punctuation and spaces, which the model may struggle to encode otherwise.

GCG gradient-based jailbreak — Another method that produces nonsensical adversarial content, but uses gradient optimization instead of encoding. Requires white-box access.
TAP tree-of-attacks jailbreak — A semantic black-box method that generates natural language prompts. Complementary to Bijection Learning for comprehensive testing.
Crescendo multi-turn jailbreak — A multi-turn method that uses conversational context rather than encoding to bypass safety filters.

For detailed performance metrics and configurations, refer to our LLM Jailbreak Cookbook .

Bijection Learning

How the Encoding Scheme WorksHow the Encoding Scheme Works

Types of Bijection EncodingsTypes of Bijection Encodings

When Bijection Learning Is EffectiveWhen Bijection Learning Is Effective

UsageUsage

Key ParametersKey Parameters

Configuration GuidanceConfiguration Guidance

Related MethodsRelated Methods

Bijection Learning

How the Encoding Scheme WorksHow the Encoding Scheme Works

Types of Bijection EncodingsTypes of Bijection Encodings

When Bijection Learning Is EffectiveWhen Bijection Learning Is Effective

UsageUsage

Key ParametersKey Parameters

Configuration GuidanceConfiguration Guidance

Related MethodsRelated Methods