Datadog Skills for AI Agents

Datadog skills for Claude Code, Codex CLI, Gemini CLI, Cursor, Windsurf, OpenCode, and other AI agents.

Skills

Skill	Description
dd-pup	Primary CLI - commands, auth, PATH setup
dd-monitors	Create, manage, mute monitors
dd-logs	Search logs
dd-apm	Traces, services, performance
dd-docs	Search Datadog documentation
dd-llmo	LLM Observability: experiments, eval RCA, evaluator generation, session classification

Install

Setup Pup

# Homebrew (macOS/Linux) — recommended
brew tap datadog-labs/pack
brew install datadog-labs/pack/pup

# Or build from source
git clone https://github.com/datadog-labs/pup.git && cd pup
cargo build --release
cp target/release/pup ~/.local/bin

Pre-built binaries are also available from the latest release.

# Authenticate
pup auth login

Add Skill(s)

For JUST dd-pup:

npx skills add datadog-labs/agent-skills \
  --skill dd-pup \
  --full-depth -y

npx skills add datadog-labs/agent-skills \
  --skill dd-pup \
  --skill dd-monitors \
  --skill dd-logs \
  --skill dd-apm \
  --skill dd-docs \
  --full-depth -y

LLM Observability (LLMO)

The dd-llmo directory contains four skills for working with LLM Observability data:

Skill	Purpose
`experiment-analyzer`	Analyze and compare offline LLM experiments
`eval-trace-rca`	Root-cause production failures using eval judge signal or runtime errors
`eval-bootstrap`	Generate evaluator code from traces, optionally seeded by RCA output
`eval-session-classify`	Classify whether user intent was satisfied in a session (trace + RUM signals)

Eval pipeline flow:

eval-session-classify          eval-trace-rca → eval-bootstrap
 (classify sessions)           (diagnose why)   (build evals)

Run eval-trace-rca to understand why an app is failing by analyzing eval judge verdicts or runtime errors across production traces. Then run eval-bootstrap to generate evaluator code that captures those failure patterns. Pass the RCA output directly to eval-bootstrap to seed it with the discovered failure taxonomy.

Use eval-session-classify independently to evaluate whether individual assistant sessions satisfied user intent, combining LLM Obs trace data with RUM behavioral signals.

Install

# Claude Code — copy any or all skills
cp -r dd-llmo/experiment-analyzer ~/.claude/skills
cp -r dd-llmo/eval-trace-rca ~/.claude/skills
cp -r dd-llmo/eval-bootstrap ~/.claude/skills
cp -r dd-llmo/eval-session-classify ~/.claude/skills

MCP Requirements

All four skills require the LLMO toolset:

claude mcp add --scope user --transport http "datadog-llmo-mcp" 'https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=llmobs'

experiment-analyzer uses the core toolset for notebook export (optional). eval-session-classify requires it for RUM behavioral analysis and efficient batched fetches of trace session spans:

claude mcp add --scope user --transport http "datadog-mcp-core" 'https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=core'

Usage

# Analyze experiments
experiment-analyzer <experiment_id>                         # single experiment
experiment-analyzer <baseline_id> <candidate_id>            # compare two experiments
experiment-analyzer <id(s)> <question>                      # ask a specific question
experiment-analyzer <id(s)> [question] --output notebook    # export to Datadog notebook

# Root-cause why an app is failing
What's wrong with <ml_app> based on its evals over the last 24h
Analyze eval failures for <eval_name> over the last week
Look at the errors on <ml_app> over the last 24h

# Generate evaluator code from production traces
/eval-bootstrap <ml_app>                                    # cold start
/eval-bootstrap <ml_app> [paste eval-trace-rca output here] # seeded from RCA
/eval-bootstrap <ml_app> --data-only                        # emit JSON spec instead of Python SDK code

# Classify a session
/eval-session-classify <session_id>

Quick Reference

Task	Command
Search error logs	`pup logs search --query "status:error" --from 1h`
List monitors	`pup monitors list`
Schedule monitor downtime	`pup downtime create --file downtime.json`
Find slow traces	`pup traces search --query "service:api @duration:>500ms" --from 1h`
Query metrics	`pup metrics query --query "avg:system.cpu.user{*}"`
List services for an env (required)	`pup apm services list --env <env> --from 1h --to now`
Check auth	`pup auth status`
Refresh token	`pup auth refresh`

More commands for pup are found in the official pup docs.

Auth

# Check auth first (includes token time remaining)
pup auth status

# If commands fail with 401/403, try refresh first
pup auth refresh

# If refresh fails or no session exists, do full OAuth login
pup auth login

# Non-default site/org
pup auth login --site datadoghq.eu --org <org>

If the browser opens the wrong profile/window, use the one-time URL printed by pup auth login and open it manually in the correct session.

More Skills

Additional skills available soon.

# List all available
npx skills add datadog-labs/agent-skills --list --full-depth

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
dd-apm		dd-apm
dd-docs		dd-docs
dd-llmo		dd-llmo
dd-logs		dd-logs
dd-monitors		dd-monitors
dd-pup		dd-pup
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SKILL.md		SKILL.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Datadog Skills for AI Agents

Skills

Install

Setup Pup

Add Skill(s)

LLM Observability (LLMO)

Install

MCP Requirements

Usage

Quick Reference

Auth

More Skills

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Datadog Skills for AI Agents

Skills

Install

Setup Pup

Add Skill(s)

LLM Observability (LLMO)

Install

MCP Requirements

Usage

Quick Reference

Auth

More Skills

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Packages