Guardrails Overview

AI guardrails are safety classifiers that sit between your application and its users, evaluating every input and output against a defined set of policies before content reaches the end user or triggers downstream actions. Without guardrails, large language models can be manipulated through prompt injection, jailbreaks, and social engineering to produce harmful, biased, or policy-violating content. As AI systems move from prototypes to production workloads handling sensitive data, customer interactions, and autonomous tool use, guardrails become a non-negotiable layer of defense.

General Analysis (GA) Guard is a family of purpose-built safety classifiers designed for production AI systems. Unlike simple keyword filters or regex-based blocklists, GA Guard models are neural classifiers that understand semantic intent, making them resilient to paraphrasing, obfuscation, multilingual attacks, and other evasion tactics that defeat rule-based approaches. Every guard in the GA series is trained on adversarial data drawn from real-world red teaming campaigns, ensuring that detection performance holds up against the techniques attackers actually use.

Guardrail typesGuardrail types

Our safety stack ships in two guard families, both trained on the same adversarial pipelines we run for production deployments and both fully supported by the GA SDK:

Public guards – Open checkpoints published on Hugging Face that you can self-host, deploy through the GA platform, or invoke directly through the SDK.
Custom enterprise guards – Bespoke detectors we train against your policies, red-team data, and compliance thresholds, delivered through managed endpoints and the same SDK integration path.

Public and custom guards can be mixed within a single workflow, so teams often start with the open lineup, then layer custom rules as policy depth increases. This progressive approach lets you ship safety coverage on day one and refine it over time without rearchitecting your integration.

When to use public guardsWhen to use public guards

Public guards are the right starting point when you need broad, general-purpose safety coverage out of the box. They ship with a comprehensive policy taxonomy covering hate speech, PII detection, prompt injection, sexual content, violence, misinformation, and illicit activities. Because the checkpoints are openly available on Hugging Face, you can self-host them in your own infrastructure for full data residency control, or call them through the GA SDK for a managed experience with built-in logging and observability.

When to invest in custom guardsWhen to invest in custom guards

Custom enterprise guards become valuable when your organization has domain-specific policies that go beyond the public taxonomy. Financial services firms may need guards trained on regulatory language from FINRA or the SEC. Healthcare organizations may require classifiers tuned to HIPAA-sensitive content categories. E-commerce platforms may need brand-safety rules that distinguish between product discussions and policy-violating promotions. In each case, the custom training pipeline ingests your policy documents, historical moderation decisions, and red-team data to produce a classifier that enforces your rules with the same precision as the public guards.

Why teams rely on GA GuardWhy teams rely on GA Guard

Adversarially trained: Iterative red teaming, stress testing, and retraining cycles keep performance stable under distribution shifts.
Long-context native: Moderate agent traces, documents, and tool logs without sharding thanks to 256k token support.
Low noise: Classifiers maintain high precision so downstream automation, routing, and analytics stay trustworthy.
Deployment ready: Drop-in SDK clients and managed endpoints with consistent schemas across every guard, whether you’re using the open checkpoints or custom enterprise models.

These properties are especially important for teams building agentic systems where a single interaction may span dozens of tool calls, retrieval-augmented generation steps, and multi-turn conversations. A guard that can process the full 256k-token context window in a single pass eliminates the complexity and accuracy loss of chunking strategies.

How we harden guardrailsHow we harden guardrails

We train every guard in the GA series using the same adversarial pipeline that underpins our enterprise deployments:

Blend policy-driven synthetic datasets with real red-team captures so models generalize to novel jailbreak templates, translations, and obfuscation tactics.
Stress-test and retrain in iterative cycles, folding new attack traces into the corpus to minimize regressions.
Calibrate thresholds to hold both recall and precision, keeping false positives low enough for workflow automations.

Because public and custom guards share this pipeline, teams get consistent behavior across open checkpoints and bespoke detectors.

The adversarial training pipeline works in continuous cycles. Our red-team researchers generate novel attack vectors using techniques like GCG, AutoDAN, Crescendo, and bijection learning (all available through our AI Red Teaming platform). These attacks are run against the current guard models to identify detection gaps. The failure cases are then folded back into the training corpus, and the guards are retrained to close those gaps. This creates a feedback loop where each generation of attacks produces a stronger generation of defenses. The same pipeline that hardens our public guards also strengthens custom enterprise models, so improvements discovered during public research benefit all customers.

Coverage surfacesCoverage surfaces

Public guards ship with a canon policy taxonomy that maps decisions to frameworks such as NIST AI RMF, ISO/IEC 42001, and the EU AI Act (see the Public Guards page for the full breakdown). Custom enterprise guards extend that taxonomy with organization-specific rules while preserving the same evaluation schema for your dashboards and audits.

NIST AI Risk Management Framework (AI RMF)NIST AI Risk Management Framework (AI RMF)

The NIST AI RMF provides a structured approach to identifying, assessing, and mitigating risks in AI systems. GA Guard’s policy labels map directly to the framework’s risk categories, making it straightforward to demonstrate that your AI system has controls in place for content safety, bias mitigation, and harmful output prevention. Audit teams can trace each moderation decision back to a specific NIST control, simplifying compliance reporting.

ISO/IEC 42001ISO/IEC 42001

ISO/IEC 42001 is the international standard for AI management systems, establishing requirements for organizations that develop, provide, or use AI. GA Guard’s structured policy taxonomy and audit logging provide the technical evidence that ISO/IEC 42001 auditors look for: documented moderation policies, consistent enforcement, and traceable decision records. The SDK’s built-in log export makes it easy to feed moderation data into your existing compliance workflows.

EU AI ActEU AI Act

The EU AI Act classifies AI systems by risk level and imposes graduated obligations on providers and deployers. High-risk AI systems must implement robust risk management, maintain detailed documentation, and ensure human oversight. GA Guard helps meet these obligations by providing deterministic policy enforcement with full audit trails. The granular policy labels let you demonstrate that your system actively monitors for and mitigates the specific risks identified in your conformity assessment.

Need deeper coverage or localized policies? We tune thresholds and add bespoke labels through the same adversarial training pipeline, so public and custom guards remain interoperable.

Enterprise customizationEnterprise customization

When you need coverage beyond the public lineup, we train bespoke guards on your policy language, red-team traces, and historical incidents:

Translate written policies into machine-enforceable labels, including nuanced allow/deny edge cases.
Fold in proprietary datasets under strict privacy controls so the guard reflects real user behavior.
Deliver managed endpoints and evaluation reports that align to frameworks like NIST AI RMF, ISO/IEC 42001, and the EU AI Act for audit readiness.

Custom and public guards share SDK semantics, so swapping IDs is all it takes to roll out new coverage.

The customization process typically begins with a policy workshop where our team reviews your existing content policies, acceptable use agreements, and regulatory requirements. We convert these documents into a structured label set, generate targeted training data, and run adversarial stress tests against your specific risk surface. The resulting guard model is deployed to a dedicated endpoint and validated against your test suite before going live. Ongoing monitoring and periodic retraining ensure the guard stays aligned as your policies evolve.

Implementation pathImplementation path

Start with the AI guardrails public guard lineup to explore the open checkpoints and download weights.
Walk through the AI Guardrails SDK guide to wire guard evaluations into your stack.
Pair AI guardrails with MCP Guard or AI Red Teaming for layered defense across tooling and monitoring.
Read our conceptual guides for background: What are AI guardrails? covers the taxonomy and architecture, while Best AI guardrails in 2026 compares tools and benchmarks across the industry.

For teams building agentic AI applications, we recommend a layered approach: use GA Guard for content-level moderation, MCP Guard for tool-call-level protection, and the AI Red Teaming platform for continuous adversarial testing. This defense-in-depth strategy ensures that no single point of failure can compromise your safety posture.