Datadog skills for Claude Code, Codex CLI, Gemini CLI, Cursor, Windsurf, OpenCode, and other AI agents.
| Skill | Description |
|---|---|
| dd-pup | Primary CLI - commands, auth, PATH setup |
| dd-monitors | Create, manage, mute monitors |
| dd-logs | Search logs |
| dd-apm | Traces, services, performance |
| dd-docs | Search Datadog documentation |
| dd-llmo | LLM Observability: experiments, eval RCA, evaluator generation, session classification |
# Homebrew (macOS/Linux) — recommended
brew tap datadog-labs/pack
brew install datadog-labs/pack/pup
# Or build from source
git clone https://github.com/datadog-labs/pup.git && cd pup
cargo build --release
cp target/release/pup ~/.local/binPre-built binaries are also available from the latest release.
# Authenticate
pup auth loginFor JUST dd-pup:
npx skills add datadog-labs/agent-skills \
--skill dd-pup \
--full-depth -ynpx skills add datadog-labs/agent-skills \
--skill dd-pup \
--skill dd-monitors \
--skill dd-logs \
--skill dd-apm \
--skill dd-docs \
--full-depth -yThe dd-llmo directory contains four skills for working with LLM Observability data:
| Skill | Purpose |
|---|---|
experiment-analyzer |
Analyze and compare offline LLM experiments |
eval-trace-rca |
Root-cause production failures using eval judge signal or runtime errors |
eval-bootstrap |
Generate evaluator code from traces, optionally seeded by RCA output |
eval-session-classify |
Classify whether user intent was satisfied in a session (trace + RUM signals) |
Eval pipeline flow:
eval-session-classify eval-trace-rca → eval-bootstrap
(classify sessions) (diagnose why) (build evals)
Run eval-trace-rca to understand why an app is failing by analyzing eval judge verdicts or
runtime errors across production traces. Then run eval-bootstrap to generate evaluator code
that captures those failure patterns. Pass the RCA output directly to eval-bootstrap to seed
it with the discovered failure taxonomy.
Use eval-session-classify independently to evaluate whether individual assistant sessions
satisfied user intent, combining LLM Obs trace data with RUM behavioral signals.
# Claude Code — copy any or all skills
cp -r dd-llmo/experiment-analyzer ~/.claude/skills
cp -r dd-llmo/eval-trace-rca ~/.claude/skills
cp -r dd-llmo/eval-bootstrap ~/.claude/skills
cp -r dd-llmo/eval-session-classify ~/.claude/skillsAll four skills require the LLMO toolset:
claude mcp add --scope user --transport http "datadog-llmo-mcp" 'https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=llmobs'experiment-analyzer uses the core toolset for notebook export (optional). eval-session-classify
requires it for RUM behavioral analysis and efficient batched fetches of trace session spans:
claude mcp add --scope user --transport http "datadog-mcp-core" 'https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=core'# Analyze experiments
experiment-analyzer <experiment_id> # single experiment
experiment-analyzer <baseline_id> <candidate_id> # compare two experiments
experiment-analyzer <id(s)> <question> # ask a specific question
experiment-analyzer <id(s)> [question] --output notebook # export to Datadog notebook
# Root-cause why an app is failing
What's wrong with <ml_app> based on its evals over the last 24h
Analyze eval failures for <eval_name> over the last week
Look at the errors on <ml_app> over the last 24h
# Generate evaluator code from production traces
/eval-bootstrap <ml_app> # cold start
/eval-bootstrap <ml_app> [paste eval-trace-rca output here] # seeded from RCA
/eval-bootstrap <ml_app> --data-only # emit JSON spec instead of Python SDK code
# Classify a session
/eval-session-classify <session_id>
| Task | Command |
|---|---|
| Search error logs | pup logs search --query "status:error" --from 1h |
| List monitors | pup monitors list |
| Schedule monitor downtime | pup downtime create --file downtime.json |
| Find slow traces | pup traces search --query "service:api @duration:>500ms" --from 1h |
| Query metrics | pup metrics query --query "avg:system.cpu.user{*}" |
| List services for an env (required) | pup apm services list --env <env> --from 1h --to now |
| Check auth | pup auth status |
| Refresh token | pup auth refresh |
More commands for pup are found in the official pup docs.
# Check auth first (includes token time remaining)
pup auth status
# If commands fail with 401/403, try refresh first
pup auth refresh
# If refresh fails or no session exists, do full OAuth login
pup auth login
# Non-default site/org
pup auth login --site datadoghq.eu --org <org>If the browser opens the wrong profile/window, use the one-time URL printed by pup auth login and open it manually in the correct session.
Additional skills available soon.
# List all available
npx skills add datadog-labs/agent-skills --list --full-depthMIT