Inspiration: The Enterprise AI Bottleneck

We all want to deploy generative AI, but every enterprise faces the exact same critical roadblock: hallucinations.

When a Large Language Model (LLM) powers your customer-facing applications, a single hallucination about a return policy or warranty is not just a bug—it is a legal liability. We saw stories like the airline chatbot that hallucinated a fake bereavement discount, forcing the company to pay damages.

We realized that current Retrieval-Augmented Generation (RAG) systems operate on a "retrieve-then-hope" paradigm. We needed something different: Verify-then-Deploy. We were inspired to build DeployGuard, an autonomous CI/CD gatekeeper that uses a swarm of AI agents to rigorously fact-check LLM outputs against internal policies before they ever reach production.

What it does

DeployGuard acts as a zero-trust compliance gate in your CI/CD pipeline. Instead of just testing code, it tests the truthfulness of your AI models.

When a developer tests a new prompt or model update, DeployGuard intercepts the LLM's output and routes it through a multi-agent verification pipeline:

  1. Parses the generated text to extract legally binding claims.
  2. Retrieves the actual "ground truth" company policies from Elasticsearch.
  3. Compares the LLM's claims against the retrieved evidence.
  4. Calculates a deterministic Risk Score.

If the model hallucinates a policy that contradicts your Elasticsearch-indexed documents, DeployGuard immediately fails the deployment build.

How we built it

DeployGuard is built on a highly modular, multi-agent architecture orchestrated in Python. We utilized Amazon Bedrock for our foundational LLMs and Elasticsearch as our central intelligence for ground-truth retrieval.

Our core orchestrator passes state between four specialized agents:

  • Claim Extractor Agent: Uses structured generation to parse raw LLM output and isolate discrete, testable factual statements.
  • Evidence Retriever Agent: Translates the extracted claims into highly optimized search queries. We utilized Elasticsearch to perform hybrid search—combining dense vector embeddings with BM25 lexical search—against a trusted document index containing warranties and privacy guidelines.
  • Verification Agent: An adversarial agent that cross-references the original claims with the Elastic evidence, flagging any contradictions, unsupported statements, or subtle distortions.
  • Risk Scorer Agent: Computes a final deployment safety metric using a mathematical penalty system.

To formalize the risk assessment, we implemented a weighted scoring algorithm. Contradictions carry heavier penalties than unsupported claims. The formula is calculated as follows:

$$\text{Risk Score} = \min \left( 100, \sum_{i=1}^{n} (W_{i} \times C_{i}) \right)$$

Where:

  • n = Total number of extracted claims
  • W_i = Penalty weight (e.g., Contradiction: 50, Unsupported: 20, Supported: 0)
  • C_i = Confidence level of the Verification Agent (0.0 to 1.0)

Challenges we ran into

  • Agentic Infinite Loops: Initially, our verification agent would occasionally hallucinate about the hallucinations. We solved this by tightly constraining the verification prompt to a strictly boolean and citation format, forcing the agent to quote the exact Elastic document it was referencing.
  • Lexical vs. Semantic Gaps: Legal documents are notoriously tricky. A vector search for "getting my money back" might not strongly match a legal document using the term "reimbursement of funds." We overcame this by fully leaning into Elasticsearch's Hybrid Search capabilities, ensuring we captured both the semantic meaning and the exact legal keywords.
  • CI/CD Integration: Making non-deterministic AI outputs work inside a deterministic GitHub Actions environment required careful state management and robust error handling to prevent pipeline timeouts.

Accomplishments that we're proud of

  • Creating a native GitHub Action that can block a Pull Request if an LLM is misbehaving.
  • Successfully architecting a multi-agent workflow where agents do not just chat, but perform distinct, sequential functional tasks in a software pipeline.
  • Achieving highly accurate retrieval on complex policy documents by deeply integrating Elasticsearch hybrid search into our retrieval agent.

What we learned

  1. Elasticsearch is an incredible Agent Brain: Giving an LLM direct, programmatic access to highly tuned search indices makes it exponentially more reliable than standard RAG.
  2. Specialization Beats Generalization: Breaking the verification task into four distinct agents (Extract -> Retrieve -> Verify -> Score) yielded vastly superior results compared to asking a single LLM to fact-check an entire text block.
  3. RAG is not just for chatbots; it is a powerful framework for automated software testing.

What's next for DeployGuard

  • Automated Remediation: Instead of just failing the pipeline, DeployGuard will automatically suggest prompt modifications to the developer to fix the hallucination.
  • Visual Dashboard: A Kibana-powered dashboard to track an organization's "Hallucination Rate" over time across different models and departments.
  • Wider CI Support: Expanding beyond GitHub Actions to native integrations with Jenkins and GitLab CI.

Built With

Share this project:

Updates