Skip to content

datadog-labs/agent-skills

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Datadog Skills for AI Agents

Datadog skills for Claude Code, Codex CLI, Gemini CLI, Cursor, Windsurf, OpenCode, and other AI agents.

Skills

Skill Description
dd-pup Primary CLI - commands, auth, PATH setup
dd-monitors Create, manage, mute monitors
dd-logs Search logs
dd-apm Traces, services, performance
dd-docs Search Datadog documentation
dd-llmo LLM Observability: experiments, eval RCA, evaluator generation, session classification

Install

Setup Pup

# Homebrew (macOS/Linux) — recommended
brew tap datadog-labs/pack
brew install datadog-labs/pack/pup

# Or build from source
git clone https://github.com/datadog-labs/pup.git && cd pup
cargo build --release
cp target/release/pup ~/.local/bin

Pre-built binaries are also available from the latest release.

# Authenticate
pup auth login

Add Skill(s)

For JUST dd-pup:

npx skills add datadog-labs/agent-skills \
  --skill dd-pup \
  --full-depth -y
npx skills add datadog-labs/agent-skills \
  --skill dd-pup \
  --skill dd-monitors \
  --skill dd-logs \
  --skill dd-apm \
  --skill dd-docs \
  --full-depth -y

LLM Observability (LLMO)

The dd-llmo directory contains four skills for working with LLM Observability data:

Skill Purpose
experiment-analyzer Analyze and compare offline LLM experiments
eval-trace-rca Root-cause production failures using eval judge signal or runtime errors
eval-bootstrap Generate evaluator code from traces, optionally seeded by RCA output
eval-session-classify Classify whether user intent was satisfied in a session (trace + RUM signals)

Eval pipeline flow:

eval-session-classify          eval-trace-rca → eval-bootstrap
 (classify sessions)           (diagnose why)   (build evals)

Run eval-trace-rca to understand why an app is failing by analyzing eval judge verdicts or runtime errors across production traces. Then run eval-bootstrap to generate evaluator code that captures those failure patterns. Pass the RCA output directly to eval-bootstrap to seed it with the discovered failure taxonomy.

Use eval-session-classify independently to evaluate whether individual assistant sessions satisfied user intent, combining LLM Obs trace data with RUM behavioral signals.

Install

# Claude Code — copy any or all skills
cp -r dd-llmo/experiment-analyzer ~/.claude/skills
cp -r dd-llmo/eval-trace-rca ~/.claude/skills
cp -r dd-llmo/eval-bootstrap ~/.claude/skills
cp -r dd-llmo/eval-session-classify ~/.claude/skills

MCP Requirements

All four skills require the LLMO toolset:

claude mcp add --scope user --transport http "datadog-llmo-mcp" 'https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=llmobs'

experiment-analyzer uses the core toolset for notebook export (optional). eval-session-classify requires it for RUM behavioral analysis and efficient batched fetches of trace session spans:

claude mcp add --scope user --transport http "datadog-mcp-core" 'https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=core'

Usage

# Analyze experiments
experiment-analyzer <experiment_id>                         # single experiment
experiment-analyzer <baseline_id> <candidate_id>            # compare two experiments
experiment-analyzer <id(s)> <question>                      # ask a specific question
experiment-analyzer <id(s)> [question] --output notebook    # export to Datadog notebook

# Root-cause why an app is failing
What's wrong with <ml_app> based on its evals over the last 24h
Analyze eval failures for <eval_name> over the last week
Look at the errors on <ml_app> over the last 24h

# Generate evaluator code from production traces
/eval-bootstrap <ml_app>                                    # cold start
/eval-bootstrap <ml_app> [paste eval-trace-rca output here] # seeded from RCA
/eval-bootstrap <ml_app> --data-only                        # emit JSON spec instead of Python SDK code

# Classify a session
/eval-session-classify <session_id>

Quick Reference

Task Command
Search error logs pup logs search --query "status:error" --from 1h
List monitors pup monitors list
Schedule monitor downtime pup downtime create --file downtime.json
Find slow traces pup traces search --query "service:api @duration:>500ms" --from 1h
Query metrics pup metrics query --query "avg:system.cpu.user{*}"
List services for an env (required) pup apm services list --env <env> --from 1h --to now
Check auth pup auth status
Refresh token pup auth refresh

More commands for pup are found in the official pup docs.

Auth

# Check auth first (includes token time remaining)
pup auth status

# If commands fail with 401/403, try refresh first
pup auth refresh

# If refresh fails or no session exists, do full OAuth login
pup auth login

# Non-default site/org
pup auth login --site datadoghq.eu --org <org>

If the browser opens the wrong profile/window, use the one-time URL printed by pup auth login and open it manually in the correct session.

More Skills

Additional skills available soon.

# List all available
npx skills add datadog-labs/agent-skills --list --full-depth

License

MIT

About

Public repository for Datadog Agent Skills

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors