Inspiration

As we embarked on building V-Commerce - a fully AI-native e-commerce platform where users chat with agents and virtually try on clothes - we hit a wall that every modern engineer faces: The "Black Box" Problem.

We realized that traditional observability (latency, error rates, CPU usage) is completely blind to AI failures.

  • A Chatbot can return a 200 OK status while hallucinating a 99% discount.
  • A user can "jailbreak" an AI Stylist to generate harmful content without triggering a single system error.
  • A Prompt Injection attack looks exactly like a normal query to a standard firewall.

We were inspired to build Godseye to answer a critical question: How do you monitor an application where the logic is probabilistic, the security threats are linguistic, and "success" is subjective?

We turned to Datadog not just as a monitoring tool, but as the backbone of our AI security and reliability strategy. We set out to implement an innovative end-to-end observability monitoring strategy for LLMs, proving that by combining Datadog's advanced metric tagging, APM tracing, and Incident Management, we could turn unpredictable AI behaviors into measurable, actionable signals.

What it does

Godseye is a dual-project: a cutting-edge AI Shopping Platform guarded by a sophisticated Datadog Observability Suite.

1. The V-Commerce Platform

We built a futuristic shopping experience powered by Google Gemini 2.0 Flash:

  • 💬 AI Chat Shopping: A conversational assistant that understands "I need an outfit for a summer wedding in Italy" and queries our product catalog using RAG.
  • 🪞 Virtual Try-On Service: Users upload a photo, and our AI (running on a dedicated Python microservice) generates a realistic image of them wearing the product.
  • 👗 PEAU Agent (AI Stylist): A proactive agent that nudges users with trend-based suggestions.
  • 🧠 Observability Insights Service: A dedicated "AI SRE" that doesn't just serve users, but safeguards the system. It uses Gemini to analyze Datadog metrics in real-time, predicting outages before human operators even notice a spike.

2. The Godseye Observability Suite (Powered by Datadog)

While the AI provides the magic, Datadog provides the control. We focus on streaming LLM and runtime telemetry to Datadog to power 5 Novel Detection Rules:

  1. 🛡️ Prompt Injection Detection (Security):
    • The Problem: Attackers trying to override the chatbot's instructions.
    • The Datadog Solution: We score every incoming query for "injection intent" and flush it as a custom metric (llm.security.injection_score). We use Datadog Monitors to track this score by session_id. If a specific session attacks repeatedly, Datadog triggers an actionable item—an auto-created Incident—that provides specific context (Attacker ID, Injection Intent) for an AI engineer to act on immediately.
  2. 📉 Interactions-Per-Conversion (Business):
    • The Problem: AI models that get stuck in loops, burning tokens without making sales.
    • The Datadog Solution: We correlate AI chat depth with cart additions to surface application health. We built a custom Datadog metric that spikes when users talk too much without buying - alerting us to "useless" models.
  3. 🔮 Predictive Capacity (AI Observing AI):
    • The Problem: Waiting for an outage to fix it.
    • The Datadog Solution: We built an "Insights Service" that reads Datadog metrics and feeds them into Gemini. The AI analyzes the Datadog charts to predict failure probability 2 hours in advance, turning Datadog from a dashboard into a crystal ball.
  4. 🧠 Quality Degradation:
    • The Problem: "Silent failures" where the AI becomes rude or unhelpful.
    • The Datadog Solution: Real-time quality scoring of every response, aggregated in Datadog to detect drift before users complain.
  5. 🖼️ Multimodal Security (Try-On Protection):
    • The Problem: Users uploading "decompression bombs" or malicious images to the Try-On service.
    • The Datadog Solution: We track image security violations tagged by user_id. Datadog instantly alerts us if a specific user is attempting a pixel-flood attack.

How we built it

We built a cloud-native microservices architecture designed to showcase the power of the Datadog ecosystem:

  • Core Stack: 10+ Microservices built with Go (for high-performance frontend/checkout) and Python (for AI workloads), orchestrated on Kubernetes with Skaffold.
  • Deployment on GKE: We deployed V-Commerce Studio on Google Kubernetes Engine (GKE). The Google ecosystem made it incredibly easy to integrate everything—from managing Gemini credentials to scaling our AI workloads seamlessly.
  • AI Engine: We heavily utilized Google Gemini 2.0 Flash for its speed, essential for real-time chat and styling. All of our Agents were built using ADK and LlmAgent as the Agent Category powered by MCP Servers.
  • The Datadog Integration:
    • DogStatsD: We instrumented our Python AI interceptors to stream LLM and runtime telemetry (like llm.token_usage, llm.response.quality) directly to the Datadog Agent.
    • Datadog APM: We used distributed tracing to follow a request from the Go Frontend -> Python Chatbot -> Vector DB -> Gemini API. This allowed us to pinpoint exactly where latency was introduced.
    • Tagging Strategy: This was our secret weapon. We tagged every metric with session_id and user_id. This allowed our Datadog Dashboards to slice data not just by "service", but by "individual user journey."
    • Incident Management: We defined custom JSON rules in Datadog that automatically spin up a Slack war-room or send urgent emails, and assign severity levels based on the type of AI attack detected.

Challenges we ran into

  • Attributing Attacks in a Stateless World: Mapping a malicious prompt to a persistent identity was hard. We had to ensure the session_id context propagated from the Go frontend through gRPC headers to the Python backend, so that when we emitted a metric to Datadog, it carried the "fingerprint" of the attacker. Datadog APM's trace context propagation was a lifesaver here.
  • Multimodal Security Vectors: Securing the Try-On feature was complex. Standard WAFs don't understand that a valid PNG can be a "decompression bomb" that crashes a server. We had to build custom validation logic and wire it to Datadog Alerts to catch these specific image-based attacks.
  • Defining "Failure": Datadog is great at telling you when an API fails (500 error). It's harder to tell when an AI is being "lazy." We had to invent the Interactions-Per-Conversion metric to translate "bad AI behavior" into a number that Datadog could graph and alert on.

Accomplishments that we're proud of

  • AI Predicting Its Own Death: We are incredibly proud of the "Predictive Capability." We successfully built a loop where: System Runs -> Datadog Collects Metrics -> AI Analyzes Datadog Metrics -> AI Predicts Failure. It feels like the future of AIOps.
  • Defining Actionable Items: Instead of generic alerts, our Datadog setup identifies the specific session_id responsible. We can block one attacker while keeping the store open for everyone else.
  • Full Observability Visibility: Looking at our Datadog LLM Dashboard and seeing real-time charts of "Prompt Injection Attempts" vs. "Purchase Conversion" gave us unprecedented confidence in deploying GenAI to production.

What we learned

  • Tags are everything: In AI observability, aggregate metrics are useless. You need high-cardinality tagging (User IDs, Session IDs, Model Versions) to find the needle in the haystack. Datadog's ability to handle these tags effortlessly was crucial.
  • Security is a Data Problem: You can't write if/else statements for Prompt Injection. You have to treat security as a statistical anomaly problem - something Datadog Monitors are perfectly suited for.
  • Multimodal Apps need Multimodal Observability: You can't just monitor text logs when users are uploading images. Monitoring the properties of the uploaded media (size, entropy, format) via custom metrics is mandatory.

What's next for Godseye

  • Automated Self-Healing via Webhooks: Currently, Datadog creates an incident. Next, we want to use Datadog Webhooks to trigger a serverless function that automatically updates the firewall rules to block the attacker's IP.
  • Cost-Aware Routing: Using Datadog's metric capabilities to track token spend per user, and dynamically downgrading free-tier users to cheaper models if they exceed a cost threshold.
  • Voice Commerce: Expanding the platform to support voice inputs, and adding "Audio Quality" metrics to our Datadog dashboards.

Built With

Share this project:

Updates