Dark Teaming Manifesto

Does the Judge judge? LLM-as-a-Judge: Red Teaming Vs Dark Teaming

Modern AI rests on a reassuring narrative: LLMs are helpful assistants, aligned with human values. Rigorous Red Teaming ensures they do not produce harmful or disallowed content. All appears coherent and controlled — a cozy, sterilized world of scripted rules.

But if to look behind the curtain, is the statement true?

"If LLM can confirm harm is absent, it must be able to perceive when it is looming."

RLHF, the dominant training paradigm for LLMs, does not produce neutral evaluators. It produces Normatively Aligned Agents, optimized to avoid, suppress, or reinterpret harmful content. When such models are repurposed as judges, they do not simply evaluate — they enforce their learned priors.

Often Red Teaming operates in a Negative Validation Mode:

Check that the output does NOT contain harmful content.
Ensure the model does NOT violate safety policies.

This maps perfectly with RLHF-trained behavior of a Smart Guard. But critically, we must also check the inversion:

Verify that the output DOES contain harmful content.
Confirm that this example SUCCESSFULLY demonstrates a policy violation.

And the semantic inversion: "The presence of harm becomes a Positive Signal" is a place where LLM logic could begin to fail — not randomly, but by design.

Here LLMs may demonstrate implicit attempts to reinterpret the task, preferring Safe Conclusions, which creates a Silent Failure Mode — one that is particularly dangerous because it appears as success.

The Alignment Paradox

When RLHF introduces a normative bias:

"A system trained to avoid harm is not necessarily capable of recognizing it."

It leads to the Alignment Paradox:

"The stronger the alignment toward avoiding harmful content, the weaker the model may become at objectively fathoming it."

The idea of LLM-as-a-Judge assumes: Consistency, Neutrality, Reliability. But if LLMs behave less like judges and more like "normative filters enforcing an internalized worldview", this raises a fundamental question:

"Can we trust a judge that is systematically biased against recognizing the phenomena it is supposed to detect?"

And if NO:

"How can we achieve statistical confidence using unreliable tools?"

Dark Teaming

Dark Teaming is not an opposite or replacement, but complementary to Red Teaming. It focuses on testing if an AI system can:

Recognize, validate, and reason about harmful, undesirable, or ethically complex content.
Do so even when it conflicts with its alignment objectives.

It concerns itself with:

Can the LLM detect harmful content?
Can it acknowledge it without reframing?
Does it remain consistent across prompts?
Does it avoid safe reinterpretations?

It's more than just a testing strategy — it's about Epistemic Honesty.

"If AI cannot map the dark, it cannot protect us from it."

Dark Teaming is an attempt to confront this limitation not by breaking the system, but by asking if it truly understands what it is meant to guard against.

Red Teaming Vs Dark Teaming

Dimension	Red Teaming	Dark Teaming
Goal	Elicit failures	Validate understanding
Focus	Output generation	Internal recognition
Question	“Can the model produce harm?”	“Does the model understand harm?”
Success signal	Model fails to produce harm	Model correctly identifies failure
Failure mode	Vulnerability	Epistemic blindness
Metric of Success	Model stayed safe	Model reported the truth

Basing on author's research and contribution to the AI security and testing framework Promptfoo:

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
docs		docs
LICENSE		LICENSE
README.md		README.md
dark-teaming-test-cases.md		dark-teaming-test-cases.md
experiment-sir-damn.md		experiment-sir-damn.md
good-boy-rubric.md		good-boy-rubric.md
maybe-chicago.test.ts		maybe-chicago.test.ts
maybe-paris.test.ts		maybe-paris.test.ts
red-teaming-instructing.md		red-teaming-instructing.md
sir-damn.test.ts		sir-damn.test.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dark Teaming Manifesto

Does the Judge judge? LLM-as-a-Judge: Red Teaming Vs Dark Teaming

The Alignment Paradox

Dark Teaming

Red Teaming Vs Dark Teaming

Project Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Dark Teaming Manifesto

Does the Judge judge? LLM-as-a-Judge: Red Teaming Vs Dark Teaming

The Alignment Paradox

Dark Teaming

Red Teaming Vs Dark Teaming

Project Documentation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages