Arize AI

Arize AX Adds Native Support for NVIDIA NIM as AI Model Provider

Richard Young — Mon, 16 Mar 2026 08:30:52 +0000

We’re excited to announce that Arize AX now supports NVIDIA NIM as a native AI model provider. Enterprises running NIM-deployed models can now connect them directly to the Arize platform and access them from the playground, run experiments, online evaluations, and enable production monitoring — now as a dedicated first-class integration within Arize.

What is NVIDIA NIM?

NVIDIA NIM microservices, part of NVIDIA AI Enterprise, are easy-to-use microservices designed for secure, reliable deployment of high-performance AI model inferencing across clouds, data centers, and workstations. Each NIM deployment includes a model paired with a production-optimized inference engine and exposes an API for application integration.

You can deploy NIM anywhere NVIDIA GPUs are available — whether in the cloud, an on-prem data center, or a local workstation.

NVIDIA NIM supports an extensive catalog of model families including NVIDIA Nemotron, Meta Llama, Mistral, Mixtral, and others. It’s designed for teams that need high-performance, self-hosted inference with full control over where their data lives. For enterprises with data residency requirements or compliance constraints, self-hosted NIM is often the only viable path to production. This complements Arize’s enterprise-ready self-hosted deployment option for Arize AX.

Our native integration now also enables seamless access to models through build.nvidia.com, NVIDIA’s developer platform for exploring, testing, and deploying NIM microservices and NVIDIA AI Blueprints. As a central hub for experimentation, it gives developers a fast path from concept to deployment using production-ready APIs and workflows.

Why Native NIM Support Matters

Deploying a model through NIM is fast, but understanding how that model behaves in production takes much longer. This integration bridges that gap, accelerating the time to discover which prompts fail, which responses drift, and where edge cases slip through.

For enterprise teams building agentic systems, the challenge is even greater. Model outputs trigger downstream actions, creating complex systems with feedback loops where silent failures can carry real consequences. Pre-deployment benchmarks aren’t enough — teams need continuous evaluation against production data, not just a snapshot from last month’s test suite.

With NVIDIA NIM natively integrated in Arize AX, teams get both layers in one place: NVIDIA’s inference performance and model access, plus Arize’s evaluation and improvement workflows. No custom endpoint configuration. No wrapper code. Simply connect your NIM endpoint under Settings → AI Providers, and your models are immediately available across playground, experiments, and evaluations.

Arize AX’s Role in the NVIDIA Ecosystem

With this integration, the collaboration between Arize and NVIDIA grows even stronger.

The Arize integration with NVIDIA NeMo enables teams to build and manage continuously improving agentic systems by making it seamless to implement data flywheels. Arize surfaces production failures through online evaluations, routes problem cases for human annotation, and triggers fine-tuning jobs through NVIDIA NeMo Customizer. NVIDIA NIM serves as the inference layer that closes the loop, deploying the improved model back into production.

With NIM now a native provider in Arize, the full cycle runs through a connected workflow:

Deploy a model via NIM
Observe and evaluate production traffic in Arize
Curate failure examples into labeled datasets
Fine-tune an improved model with NeMo Customizer
Validate with Arize experiments, redeploy via NIM, and repeat

Built on Arize’s OpenTelemetry-based observability architecture, Arize integrates seamlessly with any orchestration framework or agent stack. When NIM serves as the inference layer, Arize provides full visibility and continuous evaluation across all workflows.

The Path Forward

As more enterprise teams move AI inference on-premises — driven by compliance requirements, latency needs, or data residency rules — the NVIDIA NIM-powered infrastructure layer and the Arize evaluation layer need to work together natively. This integration is a step toward making that the default, not the exception.

For organizations building on NVIDIA AI Infrastructure, Arize provides the observability foundation needed to deploy models confidently, evaluate them continuously, and improve them systematically.

Learn More

The post Arize AX Adds Native Support for NVIDIA NIM as AI Model Provider appeared first on Arize AI.

How We Used Evals (and an AI Agent) to Iteratively Improve an AI Newsletter Generator

Laurie Voss — Tue, 10 Mar 2026 20:56:39 +0000

We love building little AI-powered tools that accelerate our workflows. One we built recently is a tool that takes our recent tweets and uses Claude to create a draft of our newsletter. It worked, sort of. The writing was good, but the details were a mess. So we pointed a coding agent at the problem: “run the evals, fix the issues, iterate.” Here’s what happened, and what it taught us about both evaluating LLM applications and working with AI agents.

The app (and the agent)

The setup is straightforward: fetch tweets from the Twitter API, format them with metadata (dates, engagement metrics, URLs), pass them to Claude Sonnet with a system prompt describing the desired newsletter format, and get back a Markdown newsletter. The prompt asks the LLM to group tweets by theme, highlight popular content, include source links, and write in an engaging tone. You can try it yourself, and the entire project is open source on GitHub.

The entire improvement process described in this post was carried out by a Claude Code agent. We gave it the evals, told it to iterate, and watched. The agent wrote code changes, ran experiments, recorded results, and proposed next steps, all autonomously. This turned out to be as instructive about working with agents as it was about working with evals.

Try the tweet-to-newsletter AI generator (repo here)

The eval suite

Before changing anything, we needed to know what “good” looks like. We built four evaluators using the Phoenix experiment framework, each checking a different dimension of quality:

faithfulness_and_quality (LLM judge): Does the newsletter stay faithful to the tweets? Is it well-written and synthesized, not just a list of blockquotes?
structure_adherence (code): Does the output have the right Markdown structure (headings, sections, dividers)?
no_hallucinated_links (code): Are all URLs in the output actually from the source data?
link_completeness (code): Does every tweet permalink and third-party URL appear in the output?

We ran these against a fixed dataset of 100 real tweets from @arizeai, split into 5 batches of 20. Each eval run produces an experiment in Phoenix, so we can compare results across iterations.

The baseline: good writing, broken links

Our first run told a clear story:

Evaluator	Pass Rate
faithfulness_and_quality	5/5
structure_adherence	3/5
no_hallucinated_links	1/5
link_completeness	0/5

The LLM was a good writer but a terrible librarian. It faithfully synthesized tweet content into coherent narratives, but it copied shortened t.co URLs from the raw tweet text instead of using the expanded URLs we provided, it dropped most source links entirely, and it sometimes omitted required sections.

This is a common pattern with LLM applications: the core capability (writing) works well out of the box, but the details that make the output usable (correct links, proper structure) need engineering work.

Iteration 1: Fix the data, not the prompt

Our first instinct might have been to add a prompt instruction like “don’t use t.co URLs.” But the real problem was upstream: we were passing raw tweet text containing t.co shortened URLs to the LLM. The LLM was faithfully copying what it saw.

The fix was in our formatTweetForPrompt function. Before sending tweet text to the LLM, we replaced every t.co URL with its expanded version from the Twitter API’s entity data, then stripped any remaining t.co URLs that had no mapping (self-referential Twitter links that were filtered during fetch).

Result: no_hallucinated_links jumped from 1/5 to 4/5. But we also saw faithfulness_and_quality drop from 5/5 to 3/5. The LLM judge caught the model fabricating tweet permalink IDs (typing wrong digits in the status URL). This wasn’t caused by our change; it was LLM non-determinism surfacing a pre-existing issue.

Lesson: When the LLM produces bad output, check whether you’re giving it bad input. Prompt engineering is often the wrong first move. Data preprocessing, i.e. fixing what the model sees, is more reliable and more debuggable.

Iteration 2: Be explicit about what you want

The permalink fabrication problem needed addressing. The LLM was retyping 19-digit tweet IDs from memory instead of copying them, and occasionally getting digits wrong. We added explicit prompt instructions: “Copy the URL exactly from the Tweet URL field. Do not modify, truncate, or retype the tweet ID digits.”

We also bumped max_tokens from 2048 to 4096. With 20 tweets per batch plus all their URLs, 2048 tokens wasn’t enough room, and the output was likely truncating and dropping links.

Result: no_hallucinated_links hit 5/5 and faithfulness_and_quality recovered to 5/5. But link_completeness remained stubbornly at 0/5.

Iteration 3: The structural shortcut (and why it was a trap)

The link completeness problem was structural: the LLM groups tweets by theme, and when it merges three related tweets into one paragraph, it naturally doesn’t include three separate source links. Telling it “include all permalinks” in the prompt wasn’t working because the instruction conflicted with the instruction to synthesize and group.

The agent tried a structural solution: add a “Tweet Sources” section at the bottom of every newsletter listing every tweet’s permalink. This got link_completeness from 0/5 to 3/5 immediately.

The remaining 2 failures turned out to be evaluator bugs. The Twitter API reports domain-like text such as “Booking.com” and “agent.py” as URLs in its entity data, and our evaluator was counting those as missing links. The agent fixed the evaluator to filter out bare-domain artifacts and got to 5/5 across the board.

All green. The agent reported success. Ship it, right?

Iteration 5: The human says no

This is where human judgment entered the picture. We looked at the actual newsletter output. The “Tweet Sources” section was a wall of 20 URLs at the bottom of an otherwise well-written newsletter. It existed purely to satisfy the evaluator. No human reader would find it useful. It was machine-readable link dumping masquerading as content.

The agent had done exactly what we asked: make the evals pass. And it found an efficient way to do it. The problem wasn’t the agent’s execution. It was that the eval was measuring the wrong thing, and the agent had no way to know that. It climbed the hill we pointed it at, but it was the wrong hill.

We told the agent to remove the Tweet Sources section. link_completeness dropped back to 0/5.

Iteration 6: Measure what actually matters

We told the agent what we actually cared about: “create an LLM judge that determines whether every tweet is referenced in some way, even if it’s not explicitly linked to. Summarizing related tweets is okay.” This was a human judgment call. We decided what the eval should measure, and delegated the implementation to the agent.

The agent replaced link_completeness with tweet_coverage, an LLM-as-a-judge evaluator. Instead of checking URLs, it reads each source tweet and the newsletter, then determines whether each tweet’s content is represented: directly referenced, summarized into a theme, or grouped with related tweets. It also allows legitimate skips for trivial tweets (bare retweets with no commentary, link-only posts, “thanks” replies).

The evaluator returns a coverage ratio (e.g., 19/20 = 0.95) and passes if coverage is at least 80%. Crucially, it also explains its reasoning for each tweet: “covered,” “skipped (acceptable),” or “missing (not covered).” This makes debugging far easier than a list of missing URLs.

Result: 5/5 across all evaluators, with coverage scores of 1.0, 0.95, 0.95, 0.84, and 1.0. The lowest batch (84%) missed 3 bare retweets with no commentary, exactly the kind of editorial judgment we want the newsletter to exercise.

The final scorecard

Evaluator	Baseline	Final
faithfulness_and_quality	5/5	5/5
structure_adherence	3/5	5/5
no_hallucinated_links	1/5	5/5
tweet_coverage	n/a	5/5

Seven experiments, three files changed (lib/newsletter.ts, evals/newsletter-eval.ts, and the eval dataset helpers). Every experiment is stored in Phoenix, so we can go back and compare any two iterations side by side.

What we learned about evals

Half the work was fixing the evaluators. Of our six iterations, three involved changing the evaluators rather than the application code. We fixed an evaluator that contradicted the prompt (requiring “Upcoming Events” when the prompt said to omit it). We filtered out spurious URLs from Twitter’s entity parser. And we ultimately replaced a mechanical URL-counting evaluator with an LLM judge that measures content coverage. Bad evaluators don’t just give you wrong numbers. They send you down wrong paths.

Fix the data before fixing the prompt. Our biggest single improvement came from preprocessing tweet text to replace t.co URLs with expanded URLs, a data transformation, not a prompt change. When an LLM produces bad output, the first question should be “what did we feed it?” not “how should we instruct it?”

Beware optimizing for the metric. We got to 5/5 on link_completeness by appending a dump of URLs to the newsletter. The eval was green but the product was worse. If your optimization makes the output less useful to humans, the eval is measuring the wrong thing.

LLM non-determinism is real. faithfulness_and_quality bounced between 3/5 and 5/5 across runs with identical code. One run, the LLM fabricated an event date; next run, it didn’t. This isn’t a bug. It’s the nature of working with language models. Evals need to account for this variance, and a single run doesn’t tell you much.

What we learned about agents

Agents will climb whatever hill you point them at. The agent was excellent at the mechanical loop: read eval results, diagnose the failure, write a fix, run the evals again. It went from 1/5 to 5/5 on hallucinated links in two iterations, methodically fixing the data pipeline and then the prompt. When the task is well-defined (“make this eval pass”), an agent is fast and thorough.

But agents can’t tell you if it’s the right hill. The “Tweet Sources” shortcut was a perfectly rational solution to the problem as stated. The agent found an efficient way to get link_completeness to 5/5. It took a human looking at the output to say “this is technically correct but makes the product worse.” Agents optimize; humans decide what’s worth optimizing for.

The human role shifts from writing code to setting direction. Across this entire process, the human contributions were: “run the evals,” “this Tweet Sources thing is a cheat, strip it out,” and “link completeness is the wrong thing to be optimizing for, create an LLM judge that checks content coverage instead.” That’s three sentences of direction that shaped six iterations of autonomous work. The agent wrote all the code, ran all the experiments, and debugged all the failures. The human decided what mattered.

Evals are how you steer agents. Without evals, you’d have to read every line of the agent’s output to know if it’s doing the right thing. With evals, you can let the agent iterate autonomously on mechanical improvements (data preprocessing, prompt wording, max_tokens) and only intervene when the direction is wrong. The evals are the agent’s objective function, which means getting them right is the most important thing the human does.

Eval-driven development works even better with agents. The loop of “change something, run evals, read results, decide what to fix next” is tedious for a human but natural for an agent. It never gets tired of re-running the suite, never skips reading the output, and never forgets to update the log. The human gets to focus on the judgment calls: is this metric right? Does this output actually look good? The agent handles the grind.

Try it yourself

The newsletter generator is open source and you can try the live app to see the final result in action. The eval suite, experiment history, and all the code changes described in this post are in the repo. Fork it, swap in a different Twitter handle, and run the evals yourself.

The post How We Used Evals (and an AI Agent) to Iteratively Improve an AI Newsletter Generator appeared first on Arize AI.

Arize Skills: Coding Agent Workflows for Traces, Evals, and Instrumentation

Aparna Dhinakaran — Tue, 10 Mar 2026 18:06:11 +0000

Two weeks ago we launched Alyx 2.0, the AI engineering agent inside Arize AX. Last week we launched the AX CLI, which made your trace data headless and agent-readable.

Today we’re shipping the next piece: Arize Skills.

The days of writing syntax are over

We’re firmly in the agent era. You tell software what you want it to do, and it does it. You shouldn’t have to explain your observability platform to your coding agent every time you open a new session.

“Here’s how Arize works. Here’s what a trace is. Here’s how to instrument this app.”

Same wall of context, every time.

Your agent should already know.

What Arize Skills are

Skills are pre-built instruction sets that give your coding agent native knowledge of Arize workflows. Install them once and your agent already knows how to export traces, add instrumentation, manage datasets, run experiments, and optimize prompts, without you explaining any of it first.

They work with Cursor, Claude Code, Codex, Windsurf, and 40+ other agents.

One command to install everything:

npx skills add Arize-ai/arize-skills --skill '*' --yes

What’s available now

Skill	What it does
arize-trace	Export traces and spans by trace ID, span ID, or session ID for debugging
arize-instrumentation	Analyze your codebase, then implement tracing — two-phase, agent-assisted
arize-dataset	Create, manage, and download datasets and examples
arize-experiment	Run and analyze experiments against datasets
arize-prompt-optimization	Optimize prompts using trace data and meta-prompting
arize-link	Generate deep links to traces, spans, and sessions in the Arize UI

What this looks like in practice

We asked Claude Code to build a financial agent using Anthropic, with tools for financial advice, payments, loan calculations, and fraud detection. It built a working agent while we walked through the skills.

Next: “set up tracing to Arize, project FinBot.” The arize-instrumentation skill loaded automatically and ran its two-phase flow.

Phase 1 analyzed the codebase: detected the language, the model provider, the framework. It proposed a plan before touching anything.

We approved it, and Phase 2 implemented instrumentation across LLM calls, tool calls, and chain spans. FinBot appeared in Arize with live spans.

Then we had Claude generate 10 complex queries against the agent covering loans, fraud detection, and account info. While those ran, we used the arize-trace skill to start pulling spans back into the editor.

We asked it to find recent financial advice questions FinBot had answered, surface what it was doing well, and flag gaps.

From there: create more advice-focused queries, build a dataset from those traces, and use it to start evaluating how well FinBot actually handles financial advice.

The bigger picture

Alyx brought natural language to your observability data in the browser. The AX CLI made that data headless, pulling it into any agent or automation. Skills encode the workflows so your agent can act on it without hand-holding.

An agentic workflow embeds Arize directly into your engineering loop, autonomously improving your AI software.

Get started

npx skills add Arize-ai/arize-skills --skill '*' --yes

View on GitHub
Arize Agent Skills Docs

The post Arize Skills: Coding Agent Workflows for Traces, Evals, and Instrumentation appeared first on Arize AI.

How to Build Planning Into Your Agent (The Architecture That Actually Works)

Chris Cooning — Thu, 05 Mar 2026 22:31:19 +0000

2025 was supposed to be the year of agents. And for the most part, it wasn’t. The industry was full of hype, demos looked incredible, but when you actually tried to get an agent to do something meaningful, it would fall apart.

As we started digging into the few systems that did handle complex workflows (like Claude Code and Cursor in late 2025), one pattern kept showing up: planning.

This post walks through how we built planning into Alyx 2.0. We cover the iterations that failed, the architectural decisions that made the difference, and the specific patterns you can apply to your own agents.

To try Alyx, check out our docs or book a meeting for a custom demo

TL;DR — the four things that made planning work:

Planning as a structured tool call, not a prompt instruction
A dedicated PlanMessage injected at a fixed position in every iteration
Four task statuses (not two) — pending, in_progress, completed, blocked
A hard enforcement gate that prevents the agent from finishing with incomplete tasks

The problem we kept running into

Before we added planning to Alyx, we hit the same wall over and over again. A user would ask something like “sort the LLM spans by latency, identify bottlenecks, and suggest improvements for my agent” and Alyx would return a beautifully sorted table… and then call finish. Two thirds of the request just evaporated.

This wasn’t a capability issue. Alyx had the tools. It had access to the data. The problem was attention and working memory. As the context window filled up with tool call results and API responses and pages of JSON, the original request got buried. By the time the model was deciding what to do next, it had effectively forgotten what “next” was supposed to be.

We tried to fix this with hardcoded tool sequences. Do this first, then this, then this. It technically worked, but it had zero flexibility. If the user wanted to go off script at all, the whole thing broke down. We were building an on-rails experience when what we needed was an actual agent.

We also tried prompt engineering – “always plan before executing,” few-shot examples of good planning behavior, chain-of-thought instructions. Academic work like Plan and Solve and Reflexion showed real gains on benchmarks, but when we tried applying those ideas in production with real tool calls generating real data flooding the context window, prompting alone wasn’t enough. The problem is that prompt-based planning produces free-form text the system can’t inspect, persist, or enforce.

How Claude Code and Cursor changed our thinking

When we started studying how Claude Code and Cursor were architecting their agents, we noticed both agents have a planning stage – where the agent pre-emptively builds itself a plan for larger requests. This allows the agent to continuously monitor its progress and hit intended goals. Claude Code, for example, uses over 40 prompts that are dynamically strung together, with dedicated tools for planning and the ability to spin up sub-agents. Cursor takes a similar approach with a lot of sophistication around when and how to plan.

That was the unlock for us. We realized planning could let Alyx handle the kind of complex, multi-step workflows that had been completely out of reach.

How to build it: structured planning tools

We expose three tools to the LLM for planning: todo_write, todo_update, and todo_read. Simple names, simple concepts. todo_write creates the plan. todo_update changes the status of individual tasks. todo_read lets the agent review where things stand.

Each task carries one of four statuses: pending, in_progress, completed, or blocked. That might sound like an implementation detail, but the jump from two statuses to four was one of the most impactful changes we made. In our original version we only had pending and completed, which meant the agent had no way to say “I’m currently working on this.” Adding in_progress gave it a working pointer, a concrete anchor for “what am I doing right now?” And blocked came later when we needed to handle human-in-the-loop scenarios where the agent needs user input before it can move forward.

One important design decision: planning is a first-class tool call, not a prompt instruction. We’re not telling the model to “think step by step.” We’re giving it structured tools that produce structured objects the system can inspect, persist, and enforce. That distinction matters a lot.

Keeping the plan visible on every iteration

Giving Alyx tools to write, read, and update plans was a big step forward. But early on, even with structured planning tools, Alyx would still drift off-plan after a handful of tool calls.

The issue was subtle. Because Alyx calls todo_write and we store all tool calls, the plan was already in the session history. As more tool results piled into the conversation, the plan got pushed further and further from the model’s attention. It was technically there, but buried under JSON and API responses.

We made three changes:

We extracted the plan into its own dedicated message type, separate from the session history entirely.
We wrapped it with instructions explaining what the plan is and that Alyx needs to follow it.
And we pinned it to a fixed position right after the system prompt, instead of letting it float somewhere in the middle of the conversation.

After that, Alyx started following and updating its plan much more reliably.

The PlanMessage

On every single iteration of the agent loop, the plan is injected into the context window. We call this the PlanMessage, and it sits right after the system prompt, before any conversation history.

The ordering looks like this: [System prompt] → [Plan] → [Session history] → [Current turn]. This gives Alyx a clear progression of context.

System prompt: here’s what you can do

Plan: here’s what you need to do to accomplish the user’s task

Session history: here’s how you’ve helped the user so far

Current Turn: here’s what you are doing for the user right now

No matter how much tool output has accumulated, the plan is always right there, always current.

And the plan message does more than just display a checklist. It actively coaches the agent. It shows visual status markers, tells the agent exactly which API call to make next, and reminds it which task is in progress and what to do when it finishes. It looks something like this:


[x] Review the trace data and sort LLM spans by latency
[~] Identify bottlenecks and suggest changes  ← CURRENT
[ ] Generate a summary report for the user

Those little markers, the [x] and [~] and [ ], give the LLM an instant visual read on progress. The contextual reminders create natural forward momentum through the plan.

You can’t finish until you’re actually finished

The other critical piece is enforcement. We have a hard gate on the finish tool: if the agent tries to finish with pending or in-progress tasks, it gets bounced back with an error that lists exactly which tasks are incomplete.

This came from a very specific frustration. We kept finding that when Alyx had a three-step plan, it would execute all three steps but forget to mark off the last one before calling finish. The user still got what they asked for, but the experience felt incomplete because there was this hanging task that looked like it was never done.

We enforced this through code. The agent has to either complete every task before it can end the loop. If Alyx wants to end the loop before every task is completed, we inform Alyx that it still has incomplete tasks and should look back at its plan. The one exception is that Alyx can mark tasks as “blocked” if it cannot continue. For example, this might happen if the task requires user input or some missing information. This ensures Alyx reviews every task and explicitly decides whether to complete it or mark it as blocked.

Using tool-level validation to enforce behavioral contracts, rather than relying on prompts alone, is one of the most powerful patterns in our system. The LLM might want to finish early, but the architecture prevents it.

Knowing when to plan (and when not to)

One thing we learned is that you don’t want planning on every interaction. If a user just asks “find me these spans,” you don’t need a todo list for that. It feels overengineered. Answer the question, check off the one-item todo, announce that you checked it off. Nobody wants that.

So we use a heuristic: when the user’s prompt includes two or more action steps or verbs, that automatically triggers the planning step. Simple queries get handled directly. This keeps the experience feeling fast and natural for straightforward requests while still bringing the full planning machinery to bear on complex ones.

Human in the loop

Things get interesting when the agent needs to pause and ask for user input mid-plan. This happens a lot in our playground, where Alyx might need permission before changing a prompt or running a specific action.

We went back and forth on implementation here. We had the todos in the database, so it would have been easy to just hardcode a flag and restore from whatever task number the agent left off at. But we ultimately went with a prompt-based approach instead.

Why? Because rigid plan restoration breaks the natural flow of a conversation. We took a lot of inspiration from how Cursor handles this. When Cursor asks permission to run a bash command and you look at it and realize the whole approach is going in the wrong direction, you just type something else. You don’t have to formally reset the plan or explicitly cancel the previous todos. You just redirect.

We wanted that same flexibility for Alyx. If a user says “actually, do something else,” Alyx should be smart enough to realize those incomplete todos are no longer relevant. The user is taking a different path. That flexibility makes the interaction feel natural instead of robotic.

What planning unlocked

The results have been significant. Before planning, Alyx would complete only the first step of multi-step requests, get confused after five or more tool calls, and sometimes burn through 20+ iterations calling todo tools in circles without actually making progress.

After planning, those same complex requests complete successfully within our iteration budget.

But the most exciting part has been the emergent capabilities. We set out to build prompt optimization. We knew Alyx needed to interact with the playground, run experiments, and build datasets. What we didn’t expect was that Alyx could chain prompts together on its own. It could build a prompt from a dataset, run it, get the output, and feed that into a second prompt. This was a product feature we were planning to build manually. Instead, Alyx just… did it, because it could plan.

Planning is what turned Alyx from a tool executor into a workflow orchestrator. And that opened up an entire class of capabilities that simply weren’t possible before.

The pattern, summarized

If you’re building agents and struggling with multi-step task completion, here’s the playbook:

Give the agent structured tools for planning: todo_write, todo_update, todo_read. Not prompt instructions. Tool calls that produce structured objects the system can enforce.
Use four statuses: pending, in_progress, completed, blocked. The in_progress status is the one most people miss and it matters the most.
Inject the plan at a fixed position on every iteration: right after the system prompt, before session history. Rebuild it from in-memory state every time so it’s always current.
Put a hard gate on finish: if there are incomplete tasks, bounce the agent back. Make completion a structural requirement, not a behavioral suggestion.

That’s what makes the difference between an agent that handles one-shot questions and one that can orchestrate genuinely complex workflows.

This is part one of a deep dive series on how we built Alyx. We’re dropping three more deep dives over the coming weeks:

Context management

How we keep the context window useful as tool outputs pile up. Compression strategies, what to prune, what to keep, and how to avoid the “lost in the middle” problem that kills agent performance after 10+ iterations.

Testing and eval for agents

How do you write tests for a system that’s non-deterministic by design? Our framework for evaluating planning quality, task completion, and tool selection accuracy across hundreds of runs.

Using Alyx to debug Alyx

The meta one. How we use Arize AX to trace Alyx’s planning behavior, catch regressions, and close the loop between what the agent does and what it should have done.

If you want to follow along, subscribe to our blog in the sidebar to the right or come find us on our community slack. If you’re building agents and running into the same problems we did, we’d love to hear what’s working and what isn’t.

The post How to Build Planning Into Your Agent (The Architecture That Actually Works) appeared first on Arize AI.

From UI to Terminal: Bringing Alyx’s Superpowers Into Your Coding Agent

Aparna Dhinakaran — Wed, 04 Mar 2026 20:44:13 +0000

Last week we launched Alyx 2.0, the in-app AI engineering agent for Arize AX.

Alyx replaced clicking through the UI with natural language intent. The AX CLI takes it a step further: making that same data machine readable so your coding agent can work with it directly.

Here’s a quick demo of what that looks like in practice.

https://arize.com/wp-content/uploads/2026/03/Project-Overview-financial_agent-4-March-2026.mp4

In the Arize UI, Alyx can surface things like “what are the most common questions users are asking?” or “which tool calls are failing?” It’s powerful, but you have to be in the browser to use it.

We dogfood everything. Alyx is traced to Arize. I used the CLI to pull recent spans into a local file:

“Hey Claude, can you use the ax cli to figure out what are the most common questions asked in the project [id]?”

Then I dropped that file into Cursor and asked: “Look at spans.csv and surface what the most common questions users are asking.”

Same analysis Alyx does in the UI, but now it’s running inside my editor against a local file using whatever coding agent I want: Cursor, Claude Code, Codex, whatever.

This is the idea behind the AX CLI. The data in Arize is valuable. The CLI makes it programmable. Your coding agent does the rest.

What’s available

Copy Code

pip install arize-ax-cli ax config init

The developer preview ships with spans, experiments, datasets, and projects. Every command supports structured output (JSON, CSV, Parquet) ready for agents and automation, because it’s time that humans stop writing syntax by hand.

What’s next

We’re building toward full headless debugging: traces, experiments, prompts, and higher-level skills that bundle common workflows like “instrument my app,” “debug my traces,” “help me build an eval.” Think of today as the wiring. The workflows come next.

Try it. Pull spans for your own project. Run an analysis in your editor. Tell us what workflows you’d want as built-in skills.

Get started

The post From UI to Terminal: Bringing Alyx’s Superpowers Into Your Coding Agent appeared first on Arize AI.

How to Evaluate Tool-Calling Agents

Elizabeth Hutton — Mon, 02 Mar 2026 22:35:39 +0000

When you give an LLM access to tools, you introduce a new surface area for failure — and it breaks in two distinct ways:

The model selects the wrong tool (or calls a tool when it should have answered directly).
The model selects the right tool, but calls it incorrectly — wrong arguments, missing parameters, or hallucinated values.

These are different problems with different fixes. Catching them requires measuring them separately.

As tool use becomes central to production LLM systems, developers need a systematic way to measure tool calling behavior, understand where it fails, and iterate quickly.

Phoenix includes two prebuilt LLM-as-a-judge evaluators specifically for this — plus a full evaluation workflow in the UI that lets you write prompts, run experiments, add evaluators, and compare results without writing any code.

This tutorial walks through the full workflow using a travel assistant demo: what the evaluators measure, how to validate alignment, and how to use the results to improve both your assistant prompt and your evaluators. The exact dataset, prompts, and code to get started can be found in this notebook if you want to follow along.

The Demo: An AI Travel Assistant

To make this concrete, we’ll evaluate a travel planning assistant with access to six tools:

Tool	Description
`search_flights`	Search available flights between two cities on a given date
`get_weather`	Get current weather or forecast for a location
`search_hotels`	Find hotels in a city for given dates and guest count
`get_directions`	Get travel directions and estimated time between two locations
`convert_currency`	Convert an amount from one currency to another
`search_restaurants`	Find restaurants in a location by cuisine or criteria

Our evaluation dataset has 30 queries covering three scenarios:

Pattern	Count	Description
Single-tool	18	One tool needed; tests parameter extraction, implicit dates, ambiguous phrasing
Parallel (2 tools)	10	Two tools needed simultaneously; all 10 two-tool combinations represented
No tool needed	2	General travel knowledge questions the assistant should answer directly

The dataset is labeled — each query has an expected tool call (or an empty list, for no-tool cases). This lets us evaluate both with and without ground truth.

Example queries:

“I need to get from Narita Airport to my hotel in Shinjuku — what’s the best transit route?” → get_directions
“I’m planning a trip to Lisbon in September — find me hotels from the 8th to the 12th and check the weather on the 8th.” → search_hotels + get_weather
“What’s the best way to handle tipping in Japan?” → no tool needed

The Evaluation Workflow

The overall loop looks like this:

Upload dataset and prompt to Phoenix (done via setup notebook)
Run an experiment using the assistant prompt in the Phoenix playground
Add evaluators from prebuilt templates or write your own
Inspect failures at the example level with per-example explanations
Iterate — update the assistant prompt or evaluator prompt, rerun, and compare

Steps 2–5 all happen directly in the Phoenix UI.

Two Prebuilt, Reference-Free Tool Evaluators

Phoenix ships with two LLM-as-a-judge evaluators designed specifically for tool calling. Both are reference-free — they reason from conversational context rather than comparing against labeled ground truth, which means you can run them on any tool-calling LLM without a labeled dataset.

https://arize.com/wp-content/uploads/2026/03/add-evaluators.mp4

1. Tool Selection Evaluator

Answers: Did the model choose the correct tool — or correctly choose no tool?

The evaluator looks at:

The user query
The list of available tools
The model’s output, including any tool calls

It handles single tool calls, parallel tool calls, and cases where no tool should be called. The template includes formatting logic that transforms the structured experiment output into a readable format for the judge — in most cases you don’t need to modify it.

What it returns: A correct / incorrect label and an explanation.

Example explanation for a failure:

“The model called search_flights with the correct origin and destination, but used 2023 dates rather than the current year, making the selection contextually incorrect.”

The only configuration step is mapping your dataset’s input column (e.g., input.query) so the template pulls the right field.

2. Tool Invocation Evaluator

Tool selection is only half the story. Even when the right tool is chosen, arguments can be wrong.

Answers: Was the tool called correctly?

This evaluator checks:

Are all required parameters present?
Were any arguments hallucinated or inconsistent with user intent?
Do argument values match what the user actually said?

Critically, it evaluates invocation independently of selection — a useful separation when you want to distinguish “wrong tool” failures from “right tool, bad arguments” failures.

You can preview the template with example data filled in to verify the mappings before running.

Adding A Custom Evaluator

Reference-free evaluators are broadly useful, but if you have labeled data, you can go further.

Our travel assistant dataset includes expected tool calls for every query. Rather than a simple string-match code evaluator, we use a custom LLM evaluator that:

Compares actual tool calls to expected tool calls
Ignores argument ordering differences
Reasons about semantic equivalence

An LLM judge is the right tool here because tool call arguments don’t lend themselves to exact matching — values like “CDG Airport” and “Charles De Gaulle Airport” are equivalent in ways that string comparison can’t capture.

The custom evaluator prompt asks the judge to compare the output tool calls to the reference calls and return correct or incorrect with an explanation.

First Results: Reading the Experiment

After the first experiment run, here’s what the scores looked like:

Tool Selection: 100% — the model consistently chose the right tool(s)
Tool Invocation: Lower — some arguments were wrong or hallucinated
Matches Expected: 0.36 — only 36% of calls matched the ground truth

A 36% ground truth match sounds alarming, but the per-example explanations tell a more nuanced story.

https://arize.com/wp-content/uploads/2026/03/explanation-shot.mp4

Iteration: What the Failures Revealed

The real value of evaluation is in the failure cases. Three distinct patterns emerged.

Failure 1: The year bug (prompt issue → fix the system prompt)

For a flight search query, the judge flagged a mismatch:

“The expected tool calls use 2025 dates, but the model called search_flights with dates in 2023.”

The model was defaulting to a year embedded in its training data rather than the current year. The fix: add one line to the system prompt:

“Assume the current year is 2025 for all searches that require dates.”

Since the base Tool Invocation evaluator didn’t catch that the year was wrong, we also updated it’s prompt template to explicitly check for this: adding a line specifying that date arguments should use 2025. This is an example of customizing a prebuilt template to enforce use-case-specific constraints.

Failure 2: CDG Airport vs. Paris (evaluator calibration issue → loosen the evaluator)

For the query “I’m staying near the Eiffel Tower and need to find my way to CDG Airport — also find hotels near the airport,” the ground truth expected city: "Paris" for the hotel search. The model used city: "CDG Airport".

The judge flagged this as a mismatch. But looking at it: the user explicitly asked for hotels near the airport. CDG Airport is arguably a more faithful interpretation of the request than just Paris.

This is a case where the evaluator was being too strict, not the model. The right fix isn’t to change the assistant — it’s to adjust the custom evaluator prompt to allow for reasonable semantic equivalence in location arguments.

This is a common pattern: early experiment results reveal where your evaluators need calibration, not just where your model needs improvement.

Failure 3: “This weekend” (capability gap → add a tool)

For queries like “I need a hotel in Kyoto for this weekend,” the Tool Invocation evaluator flagged the date arguments as incorrect:

“The model assumed specific dates for ‘this weekend’ without knowing the current date. This is an ungrounded assumption.”

The model isn’t wrong to try — it just doesn’t have the information it needs. The fix isn’t to change the prompt; it’s to add a get_current_date tool to the assistant so it can make accurate date calculations when users say things like “this weekend” or “next Friday.” This would also fix the first failure we observed, where the LLM didn’t know the current year.

After Iteration: Comparing Experiments

After updating the system prompt (year fix) and the tool invocation evaluator (year check), we reran the experiment and compared results side by side in Phoenix.

https://arize.com/wp-content/uploads/2026/03/experiment-compare.mp4

What changed:

Matches Expected improved — the year fix resolved the most common failure case
Tool Invocation scores decreased slightly — but this is expected and correct. By making the evaluator more specific about date validation, it now catches cases it previously missed. The decrease reflects better measurement, not worse performance.

Side-by-side experiment comparison makes these changes visible and interpretable — you can see exactly which examples improved, which regressed, and why.

When to Customize the Templates

The prebuilt evaluators are designed to be general-purpose. They’ll catch most tool calling issues without modification. But high-quality systems often need evaluators that understand your domain.

Common reasons to customize:

Use-case specific constraints: e.g., “all date searches should use the current year”
Looser semantic equivalence: allow “CDG Airport” and “Charles De Gaulle Airport” to match
Stricter parameter validation: require a specific field format or value range

Because the built-in templates are fully editable, you can adapt them while preserving the underlying structure. The workflow is: duplicate the template, add your constraints, and rerun.

Before (default Tool Invocation evaluator): Evaluates whether required parameters are present and values are grounded in user intent.

After (customized for this use case): Same evaluation, plus: “Assume the current year is 2025. Flag any date arguments that use a different year as incorrect.”

Wrapping Up

This tutorial covered the main building blocks for evaluating tool-calling behavior in Phoenix: the two prebuilt evaluators (tool selection and tool invocation), how to add a custom ground-truth evaluator when you have labeled data, and how to use the results to iterate on both your assistant prompt and your evaluators.

The prebuilt templates are designed to be general-purpose starting points — they’ll work out of the box for most tool-calling setups, but they’re straightforward to customize when your use case has specific requirements.

The dataset and prompts from this demo are available in this notebook if you want to run through it yourself or use them as a starting point for your own evaluation setup.

The post How to Evaluate Tool-Calling Agents appeared first on Arize AI.

Best AI Observability Tools for Autonomous Agents in 2026

Aryan Kargwal — Fri, 27 Feb 2026 17:45:33 +0000

The shift from simple chat interfaces to autonomous agents has broken the traditional monitoring stack. Agentic systems fail in ways that look like success: incorrect but well-formed outputs, unnecessary tool calls, or actions that are syntactically valid but semantically wrong.

In this blog, we are looking at some of the best AI observability tools available to secure these production reasoning loops. We cover a spectrum of needs, from rapid prototyping with proxies to deep orchestration tracing for multi-agent pipelines and scaled enterprise deployments.

Agent Observability and the Architecture of Trust

You cannot fix AI failures with standard logs because the error lives in the reasoning and not necessarily in the code execution. This mismatch creates a massive operational risk for engineering teams used to systems built for repeatability and a need for AI observability and tracing. Given the same input, agents introduce a layer of variability that traditional software cannot handle.

Identifying common AI agent failures requires moving beyond basic logging. Without a way to track these conversations, your agents are just API calls in the air. To extract fundamental business value, you must treat agent traces as durable business assets.

The missing link is rarely model quality or the orchestration framework – the missing link is visibility. We focus on a “glass box” approach using distinct, interoperable tools to trace and govern agents at scale.

Effective systems now rely on top AI prompt management tools to maintain version control. Establishing a rigorous agent evaluation framework is the only way to build the architecture of trust.

Understanding AI Observability in the Age of LLMs

To understand where observability fits, we must look at how the stack has evolved. In the DevOps era, we monitored server health. In the MLOps era, we monitored model drift and training loss. In the Agent Era, we monitor decisions. Understanding What is Observability? in this context means capturing the probabilistic “chain of thought” that drives an action.

For teams building on top of model APIs, which is most of us, the foundation of a reliable agent system is not the model itself. It is the agent harness: the orchestration logic, runtime, and telemetry that wraps around the model and governs how it operates. As Anthropic’s engineering team has documented, even a frontier model running in a bare loop will fall short of production quality without structure. The harness imposes that discipline.

This reframes how you should think about your production stack. Observability is not something you wire in after the system is built. It must instrument every component from day one:

Agent Telemetry: This is the observability layer, but calling it a layer undersells it. As Arize’s own research has documented, traces are the source of truth for what an agentic system actually does, as opposed to what the code says it should do. Every operation developers traditionally performed on code, including debugging, testing, optimizing, and monitoring, must now be performed on traces.
Agent Orchestration: This is where behavior is defined. AI agent frameworks like Google ADK or CrewAI handle the state machine, memory persistence, and routing logic. Telemetry embedded here captures why a decision was made before a single token is generated.
Agent Inference: This is the execution engine that powers the agent. Inference layers like vLLM or provider APIs handle throughput and caching. Without instrumentation at this layer, you cannot distinguish a fast agent from a correct one.

Without telemetry wired into the full harness, you are running a black box. You might know your agent costs $50 a day, but you and your stakeholders do not know if it is solving problems or apologizing in expensive loops. Using an AI agent evaluation framework is not an afterthought. It is the mechanism by which the system verifies its own behavior and improves over time.

What to Look For in an AI Agent Observability Tool

Selecting a tool for autonomous systems requires a shift from monitoring metrics to monitoring logic. If a platform treats an agent like a simple chatbot, it will fail to capture the branching paths of a multi-step reasoning loop. You need a system that prioritizes trace-level LLM evaluations to verify end-to-end reliability.

One often-overlooked capability is context graph management: how platforms treat and store your agent traces as long-term business assets rather than ephemeral debug logs.

A context graph is simply the persistent record of why your agent made each decision. It does that by recording the reasoning it considered, the context it retrieved, the tools it called, and the outcome.

Most platforms let this data disappear after debugging. The best platforms retain it, make it queryable, and enable feedback loops where past decisions inform future behavior. This distinction of whether traces become durable assets or vanish can directly impact your ability to improve agents over time.

Capability	What to Consider
Agent Decision Graph	Visualizes the internal state machine showing how agents, tools, and components interact step-by-step. Makes debugging agent loops 10x faster than reading raw logs.
Context Graph Ownership	Determines whether your traces become durable business assets or ephemeral telemetry. Ask: Will you own the data in your own warehouse, or does the platform own the queryable context graph? This choice affects long-term autonomy and feedback loop effectiveness.
Session-Level Evaluations	Measures coherence and goal achievement across full multi-turn conversations, not just isolated responses. Essential for understanding actual user experience.
Natural Language Search	Find specific failure patterns across millions of traces using queries like “agent hallucinated tool arguments” instead of writing complex SQL. Turns weeks of investigation into minutes.
Trajectory Mapping	Automatically detects inefficient patterns: recursive loops, repeated failures, wasted tokens. Directly impacts costs and user experience.
MCP Tracing	Debug Model Context Protocol tools directly in your IDE. Bridges the gap between development and production observability.
Regression Suite Builder	Promote production failures to versioned test datasets with one click. Ensures the same failure never happens twice.
OpenTelemetry Support	Standardizes telemetry collection and prevents vendor lock-in. Your data remains portable as the ecosystem evolves.

Establishing this criteria ensures your stack is built for the complexity of the “Agent Era” rather than the simpler “Chatbot Era.” By prioritizing these few principles, you move to a proactive improvement cycle that turns every production trace into a learning opportunity for your system.

The critical question when evaluating any platform: Does it own the context graph, or do you? Arize AX differentiates by retaining context graphs within ADB (Arize Database), an AI native datastore that unifies observability and evaluation data in open formats, enabling zero-copy access across your AI and data stack. This enables real-time cross-trace analysis without the cost and latency of re-warehousing.

Top AI Agent Observability Tools for Production

Every tool in this list is built on one of two architectural patterns and that distinction should be the first filter you apply before evaluating anything else. Proxy-based tools sit between your application and your model provider. You redirect your API calls through their gateway and observability happens automatically with zero instrumentation overhead.

SDK-based tools instrument your code directly. There is no middleman, which means deeper visibility into your agent’s reasoning and no single point of failure if the observability backend goes down. We have tagged every tool below with its architecture type so you can filter immediately.

Tool	Best For	Primary Value
OpenTelemetry	Experiments	Provides a vendor-neutral foundation. Use this to ensure your data remains portable during the R&D phase.
Arize AX	Enterprise Growth	Teams who need deep control over their product. Ideal for large-scale production with massive trace volumes.
Braintrust	Eval-First Engineering	Engineers who prioritize testing. Best for teams with emphasis on prompt engineering as a software development lifecycle.
LangSmith	LangChain Ecosystems	Teams already using LangChain for orchestration. Trace collection, session replay, and custom evaluators come pre-wired to LangChain components.
LangFuse	Early-Stage Prototyping	Developer-friendly open-source platform for teams getting started. Best suited to prototyping and small-scale deployments.
Portkey	Provider Reliability	Systems that require high uptime through automatic fallbacks and load balancing across multiple LLM providers. Note: introduces a single point of failure and centralized key storage risk.
Galileo	AI Logic and Safety	Teams focused on hallucination detection and RAG optimization using specialized Luna-2 models.
AgentOps	Agent Orchestration	Developers using multi-agent frameworks like CrewAI. Best for tracking recursive loops and thought processes.

1. OpenTelemetry

SDK Instrumentation

OpenTelemetry is the CNCF standard for distributed tracing that provides vendor-neutral instrumentation. It standardizes how telemetry data gets collected and exported without locking you into any specific backend.

The key contribution is semantic conventions for LLM spans. These define exactly how to capture prompts, completions, token counts, and model parameters in a way every observability platform accepts. Without this, each vendor builds their own schema, and migration becomes a rewrite instead of a config change.

OTel serves as the foundation for more specialized telemetry libraries. Projects like OpenInference, built by Arize AI, extend the base specification with AI-specific semantics, but the core value remains the same: portable instrumentation that works across backends.

OTel itself only handles collections but does not analyze, evaluate, or visualize anything. You instrument your code, point traces at a collector, and send data to any compatible backend like Datadog, Grafana, Arize, or your own database.

2. Arize AX

Arize AX is an enterprise observability platform built on the Arize Phoenix open-source foundation. Standardizing on OpenInference, it removes the burden of proprietary instrumentation, allowing engineers to focus on improving agent reasoning rather than debugging data pipelines.

The core value is making agent behavior visible at the decision level. When an agent fails, the Agent Graph visualization shows exactly which step broke and why. This graph is an execution tree rather than a linear trace, showing how agents delegated to sub-agents, which tools fired, and where the state changed. This works automatically across a wide range of AI agent frameworks.

Because it instruments your code directly via SDK, your agents continue functioning even during an observability backend outage. You get decision-level visibility with no middleman introducing latency or failure risk.

What sets Arize apart is Data Fabric, powered by ADB. While most observability platforms export traces to external warehouses for long-term storage, ADB keeps context graphs within the platform itself. Every trace, evaluation, and annotation syncs to your cloud warehouse in Iceberg format every 60 minutes, but the backend OLAP database runs on ADB, built specifically for billions of agent traces.

This architecture eliminates the re-warehousing tax: instead of periodically re-indexing historical traces to answer new questions, teams query context graphs in real-time. Over months of operation, these compounds significantly in cost and analysis latency. The trade-off is vendor coupling: you gain query speed but sacrifice data portability. For mission-critical agent systems, this is often the right choice.

Evaluation happens at both the trace and session levels. Session-level evals measure coherence across multi-turn conversations, answering questions like “did the agent maintain context” and “did it complete the user’s goal”. While trace-level evals pinpoint individual reasoning failures.

Alyx is the AI assistant built directly into the platform. It handles trace analysis, eval building, and debugging through natural language. Alyx connects via Model Context Protocol, so you can instrument and debug agents from Cursor, Claude Code, or other IDEs without context switching. The MCP Tracing Assistant unifies client-server traces in the same hierarchy, which matters when agents call external MCP tools.

3. Braintrust

Braintrust is an evaluation-first observability platform that merges testing directly with production monitoring. It treats prompts as versioned objects to eliminate the typical trial-and-error cycle of engineering. The system relies on Brainstore, an OLAP database purpose-built for AI interactions.

The platform includes Loop, an AI assistant that analyzes production data to automate the hardest parts of observability. Loop generates custom scorers from natural language descriptions and applies them to live traffic to catch hallucinations.

SDK-based instrumentation means traces are captured within your own infrastructure. No proxy means no credential exposure and no single point of failure between your application and your model provider.

Beyond technical metrics like latency, teams use Loop to query logs for product roadmap decisions by identifying common failure patterns in user requests. This shifts the focus from simple uptime to the actual quality of the reasoning paths.

Developer experience is central to the Braintrust workflow through native SDK support for Java, Go, Ruby, and C#. This expansion allows enterprise teams to instrument their existing production stacks using OpenTelemetry standards. It handles automatic caching and trace logging at the gateway level, which reduces both costs and implementation complexity.

4. LangSmith

LangSmith is LangChain’s native observability platform. It provides trace collection, session replay, and evaluation capabilities optimized for teams already using LangChain for orchestration.

SDK instrumentation but the architecture advantage is limited here. The observability is so deeply coupled to the LangChain stack that you are effectively locked in. Migrating to a different orchestration framework means re-instrumenting your entire observability layer from scratch, which is a similar kind of lock-in to what proxy-based tools create, just at the framework level rather than the network level.

The platform excels at rapid debugging through its execution timeline view, which shows the exact sequence of LLM calls, tool invocations, and state changes. Custom evaluators allow you to score agent outputs against your specific criteria using either deterministic logic or LLM-based grading.

LangSmith’s strength lies in frictionless integration; teams using LangChain agents get observability with minimal additional setup. The platform supports automatic tracing for LangChain components, reducing instrumentation overhead.

For teams prioritizing ecosystem fit over platform independence, LangSmith removes context switching between orchestration and monitoring. The primary trade-off is vendor coupling. Deep integration with LangChain means migrating to a different orchestration framework would require re-instrumenting your observability layer.

Teams report this as a key concern in adoption, given LangChain is a vast library that has its core value derived from something for everything rather than a more focused “everything for something” approach.

5. LangFuse

Langfuse is an open-source observability platform emphasizing developer experience and debugging UX. It provides trace collection, session management, and prompt versioning in an accessible interface.

The platform’s strength is its approachability for early-stage teams. Features like prompt management, user session tracking, and cost attribution work out-of-the-box with minimal configuration. Native SDKs for JavaScript, Python, and other languages make integration straightforward

Primarily SDK-based, however, Langfuse’s recent acquisition by Clickhouse (Jan 2026) introduces uncertainty into the platform’s future trajectory. The acquisition is shifting their architecture toward a hybrid model. Teams with strict data residency requirements should clarify exactly where traces are routed before committing.

Teams considering Langfuse for production deployments should carefully review current feature roadmaps and clarify support commitments before committing to long-term usage. The integration with Clickhouse’s OLAP infrastructure may eventually provide advantages for large-scale analytics, but this remains unproven.

For prototyping and small-scale deployments, Langfuse remains a solid choice. For mission-critical production systems, the acquisition introduces enough uncertainty that evaluating alternatives is prudent.

6. Portkey

Portkey is an AI Gateway and observability suite built specifically for production reliability. While it offers a proxy-based integration similar to Helicone, it functions more as an intelligent routing layer than a passive logger. By changing your API base URL to Portkey, you gain access to an AI Gateway that manages interactions with various LLM providers.

The technical mechanism involves a “Control Panel” approach where Portkey acts as a programmable middleman. When you send a request, integrate it by changing your baseURL to the Portkey endpoint and passing your Portkey API key in the request headers. This simple switch provides immediate, token-level observability across cost, performance, and accuracy metrics without needing to refactor your core logic for different model schemas.

The proxy model creates a critical single point of failure and a concentrated security risk. If the Portkey gateway goes down, your entire agent fleet loses connectivity to every model provider simultaneously. Storing sensitive API keys in a Virtual Vault enables features like automatic fallbacks, but it also creates a massive target for attackers.

This risk is concrete. In version 1.14.0 Portkey had to patch CVE-2025-66405, a Server-Side Request Forgery vulnerability where attackers used custom host headers to trick the gateway into hitting internal network resources. Teams must use strict egress filtering to prevent a compromised gateway from scanning their private infrastructure.

Beyond security, Portkey provides a Prompt Library to decouple prompt engineering from application code. When combined with their reliability suite, the platform turns raw observability into an active governance system for production AI.

7. Galileo AI

Galileo AI has shifted from simple hallucination detection to an evaluation intelligence platform. The system is built on Luna – 2 foundation models released in early 2026.

SDK instrumentation with evaluation logic running inside the platform rather than between your application and your model provider, meaning no added latency on inference calls.

The flagship release from Galileo is Galileo Signals. This engine automates failure mode analysis by scanning millions of production traces. It identifies why agents drift and prescribes specific fixes for prompt engineering or retrieval strategies. The system works with an updated Agent Graph that includes traffic analytics. These visuals show the most frequently used paths in a multi – agent reasoning loop.

Developer experience now centers on the Agent Evals MCP. This protocol allows engineers to run production-grade evaluations inside Cursor or Claude Code. For enterprise governance, the platform supports “Composite Metrics.” These metrics combine multiple scores into a single threshold for automated gatekeeping. If an agent’s score drops, the system kills the session or flags a human before the LLM generates a response.

8. AgentOps

AgentOps is a governance and observability platform built for autonomous agents and multi-step reasoning chains. It tracks the entire lifecycle of an agent from initialization to task completion rather than just logging individual requests. It provides dedicated tracking for tool usage, self-correction loops, and planning stages. You integrate it using a single decorator that wraps your existing agents.

AgentOps is an SDK-based observability architecture with a single decorator integration that runs entirely within your own infrastructure, keeping credential exposure risk inside your own security perimeter. It provides PII redaction and audit trails for compliance.

The session replay dashboard provides “time-travel” capabilities to rewind an agent’s execution. It pinpoints exactly where a reasoning path diverged from the goal. The platform identifies recursive thought patterns to prevent agents from burning tokens in infinite loops. It triggers alerts or pauses when a cycle is detected.

The Human-in-the-Loop (HITL) module allows an agent to pause execution. It requests human approval for high-stakes tool calls like processing payments or deleting files.

Choosing the Right AI Observability Tool

Selecting a stack depends on your tolerance for architectural risk and your engineering capacity. There is no “best” tool, only the right trade-offs for your specific deployment stage. The primary architectural decision is whether you want a “Man-in-the-Middle” proxy or native SDK instrumentation.

A breach of this vault or the gateway exposes your spending power and data across all configured AI services. This risk is concrete, in version 1.14.0 Portkey had to patch CVE-2025-66405. This was a Server-Side Request Forgery (SSRF) vulnerability where attackers used custom host headers to trick the gateway into hitting internal network resources. Teams must use strict egress filtering to prevent a compromised gateway from scanning their private infrastructure.

Choose an SDK for mission-critical agents where security and deep reasoning visibility are non-negotiable. SDKs like Arize AX or Braintrust give you “decision-level” visibility. They show exactly how an agent’s internal state changed between tool calls. Because there is no middleman, your application remains resilient. Even if the observability backend has an outage, your agent continues to function.

Before committing to a vendor, I leave you with four questions:

Does our security policy allow third-party proxies to handle raw PII and API keys?

Do we need to trace complex, multi-step agent reasoning or just log simple prompt-response pairs?

What happens to our user experience if the observability layer adds 100ms of latency or suffers a 10-minute outage?

Can the platform promote production failures to our evaluation suite with a single click?

FAQs

What is AI agent observability?

Add Observability to Your Open Agent Spec Agents with Arize Phoenix

Laurie Voss — Fri, 27 Feb 2026 17:00:55 +0000

Open Agent Specification lets you define an agent once and run it on any compatible runtime: LangGraph, WayFlow, CrewAI, and others. That portability solves a real problem in production AI systems. But it raises a follow-up question: once your agent is running, how do you know what it’s actually doing?

Observability gives you the answer. Rather than relying on print statements or log files, observability captures structured traces of every step your agent takes: each LLM call, each tool invocation, each decision point, with full inputs, outputs, and timing. When something goes wrong, or when you need to understand why an agent chose one path over another, traces give you a complete, inspectable record.

Arize Phoenix is an open-source observability and evaluation platform built on OpenTelemetry. It provides tracing, evaluation, and debugging capabilities for LLM applications. Connecting Phoenix to an Agent Spec agent takes a single line of code, and because both Agent Spec and Phoenix are built on open standards, the instrumentation works identically regardless of which runtime executes your agent.

In this post, we take the Operations Assistant agent from the Agent Spec tutorial, instrument it with Phoenix, run it on two different runtimes (LangGraph and WayFlow), and then run programmatic evaluations against the captured traces. The companion repository contains all the code shown here.

One line of code connects Agent Spec to Phoenix

With the agent defined and the openinference-instrumentation-agentspec package installed, adding observability requires one line of setup code:

Copy Code


from phoenix.otel import register
 
tracer_provider = register(
    project_name="ops-agent",
    auto_instrument=True,
)

The register() function creates an OpenTelemetry tracer provider pointed at Phoenix Cloud. The auto_instrument=True flag tells Phoenix to scan for any installed OpenInference instrumentors (in this case, the AgentSpecInstrumentor) and activate them automatically. From this point on, every agent execution emits structured traces to Phoenix.

The key property of this approach is that the instrumentation is runtime-agnostic. The same setup code works whether you load your agent with LangGraph or WayFlow. Below, we load the Operations Assistant from its exported agent.json file. This is the portable Agent Spec configuration that describes the agent’s tools, system prompt, and LLM settings without binding it to any particular runtime:

Copy Code


from pyagentspec.serialization import AgentSpecDeserializer
 
with open("agent.json", "r") as f:
    agent_config = AgentSpecDeserializer().from_json(f.read())

With the agent configuration loaded, we can pass it to any compatible runtime. The runtime handles execution; Phoenix handles observability. Neither needs to know about the other.

Running with LangGraph

Copy Code


from pyagentspec.adapters.langgraph import AgentSpecLoader
 
langgraph_agent = AgentSpecLoader(
    tool_registry=tool_registry
).load_component(agent_config)
response = langgraph_agent.invoke(
    input={"messages": [{"role": "user", "content": user_input}]},
    config={"configurable": {"thread_id": "1"},
            "recursion_limit": 50},
)

Running with WayFlow

Copy Code


from wayflowcore.agentspec import AgentSpecLoader
 
wayflow_agent = AgentSpecLoader(
    tool_registry=tool_registry
).load_component(agent_config)
conversation = wayflow_agent.start_conversation()
conversation.append_user_message(user_input)
# ... execute conversation loop

In both cases, the instrumentation code at the top of the file is identical. The traces that flow into Phoenix share the same structure regardless of which runtime produced them. This means you set up observability once and it follows your agent wherever it runs.

Every LLM call, tool invocation, and decision is visible in Phoenix

After running the agent on a few test inputs, open Phoenix Cloud. The project view shows all captured traces with summary statistics: total trace count, latency percentiles, and cost.

Phoenix trace list view showing the ops-agent-langgraph project with agent, LLM, and tool spans.

Each row in the trace list represents a span, a single unit of work within a trace. The “kind” column distinguishes between agent spans, LLM generation spans, and tool execution spans. You can see the full execution pattern of the Operations Assistant: an initial LLM call decides which tool to invoke, the tool executes, another LLM call processes the result and decides the next step, and so on.

Click into any trace to see its full execution tree.

Trace detail view showing the full execution tree with system prompt, tool calls, and LLM responses.

The trace detail view shows the parent-child relationship between spans. At the top is the AgentExecution[Operation_Assistant_Agent] span encompassing the entire run. Beneath it, each LLM generation and tool execution appears as a child span. You can inspect the full input messages (including the system prompt), tool call arguments, tool outputs, and the final agent response.

This level of visibility is particularly useful for debugging. In the trace above, you can see the agent calling read_logs multiple times with different parameters. This is the retry behavior specified in the system prompt, made visible through tracing.

Traces enable programmatic evaluation and runtime comparison

Traces are the foundation for evaluation. Once execution data is in Phoenix, you can run programmatic evaluations against it using Phoenix’s eval framework.

We built an evaluation harness that runs the Operations Assistant on 10 test inputs across both LangGraph and WayFlow, then evaluates the results using two categories of evaluators:

Code-based evaluators (deterministic, no API key required): whether the agent produced output, whether the output contains a structured incident report, whether it references data gathered from tools, whether it includes actionable recommendations, and output length.
LLM-as-judge evaluators (using Claude as the evaluator): helpfulness, completeness of the investigation workflow, and factual consistency.

The full evaluation harness is available in the companion repository. Here are the results across 15 evaluated traces per runtime:

Metric	LangGraph	WayFlow	Delta
Traces	15	15
Latency (mean)	35,395 ms	34,700 ms	−2.0%
Latency (P50)	38,040 ms	35,879 ms	−5.7%
has_output	100%	100%	—
has_structured_report	93.3%	100%	+6.7 pp
mentions_tools_used	100%	100%	—
has_actionable_recommendation	93.3%	100%	+6.7 pp
helpfulness	100%	100%	—
completeness	86.7%	80.0%	−6.7 pp
factual_consistency	100%	100%	—

Both runtimes produce high-quality output: helpfulness and factual consistency are perfect across the board, and the remaining metrics are consistently above 80%. This validates Agent Spec’s portability promise. The same agent definition produces comparable results regardless of the underlying runtime.

Because Phoenix captures the same trace format regardless of runtime, this pattern extends to any change in your agent system. Swap a runtime, change an LLM provider, revise a prompt, restructure your tools. Run the same evaluation harness and compare. If you can trace it, you can evaluate it.

Get started today

Agent Spec provides portable agent definitions. Phoenix provides portable observability. Together, a single line of instrumentation code gives you full tracing and programmatic evaluation across any supported runtime.

To get started:

The post Add Observability to Your Open Agent Spec Agents with Arize Phoenix appeared first on Arize AI.

AI Agent Debugging: Four Lessons from Shipping Alyx to Production

Laurie Voss — Wed, 25 Feb 2026 22:43:09 +0000

Building AI systems that actually work in production is harder than it sounds. Not demo-ware, not “it worked once in a notebook.” Real systems that keep working after week two.

We built Alyx, Arize’s agent for AX, and it broke in ways we didn’t expect. This post is about what broke, what surprised us, and the patterns that actually worked — with enough implementation detail that you can reuse them.

Alyx is an AI assistant that helps people use Arize AX, a platform for observing and evaluating AI systems. Users ask questions in natural language:

“What’s the bottleneck in this trace?”
“Which experiment had better accuracy?”
“Why did this eval score drop?”

Alyx is an LLM-powered agent with tools that map to product features — trace analysis, experiment comparison, dataset operations, eval scoring. Under the hood, it’s an orchestrator loop: the LLM receives a user question, decides which tools to call, processes the results, and either calls more tools or responds.

To try Alyx, check out our docs or book a meeting for a custom demo

Here are four lessons we learned, mostly the hard way.

Lesson 1: If you want agents to follow rules, put the rules in code

Early on, Alyx had a wandering mind. You’d ask it to do three things. It would do the first one, then spiral deep into that task and forget the other two.

We have a finish tool it is supposed to call when done. It called it early, constantly. In one memorable session, we asked it to summarize multiple traces. It made 27 LLM calls — almost all of them reorganizing its own to-do list, going in circles, never actually getting anywhere.

This wasn’t a hallucination problem. It was an attention problem. Tool outputs, JSON blobs, and intermediate results flood the context window. The original user request gets buried. By the time the agent decides what to do next, it has forgotten what “next” was.

The fix: make planning a first-class tool

The fix wasn’t a “think step by step” instruction in the prompt — it was a structured tool the system can inspect and enforce. Before Alyx touches any data, it writes a plan using three tools:

Tool	Purpose
`todo_write`	Create a new plan or restore a previous one
`todo_update`	Change a task’s status
`todo_read`	Fetch the current plan state

Each task is a simple structure:

Copy Code


class Todo(BaseModel):
    id: int
    description: str
    status: Literal["pending", "in_progress", "completed", "blocked"]

The four statuses matter more than you’d expect. Our first version only had pending and completed, and it didn’t work well — the agent had no “working pointer” — it knew what it hadn’t started and what it had finished, but not what it was currently doing. Adding in_progress gave it a concrete anchor: “this is my current task, everything else can wait.” Task completion rates improved immediately.

The blocked status handles human-in-the-loop scenarios. Sometimes the agent genuinely can’t proceed without user input — “I found three possible bottlenecks, which one should I focus on?” When a task is marked blocked, the agent is allowed to finish, because the ball is in the human’s court.

The plan lives outside conversation history

This was our critical architectural decision: the plan is not stored in the conversation history where it can be buried or truncated. It’s stored on disk, and on every single LLM call we dynamically regenerate a PlanMessage from the current state and inject it right after the system prompt, before the noisy tool call history.

Here’s what the agent actually sees:

Copy Code

# Current Plan 0. [x] Review the trace data and sort LLM spans by latency 1. [~] Identify bottlenecks and suggest changes ← CURRENT 2. [ ] Generate a summary report for the user --- ## Plan Management Instructions **Current focus:** Identify bottlenecks and suggest changes **REMINDER:** Task 1 is marked `in_progress`. When you finish it, call `todo_update(id=1, status="completed")`. Then mark task 2 as `in_progress` when you start it.

The plan isn’t just displayed — it coaches the agent through execution. It tells the agent exactly which API call to make when it finishes the current task. Visual status markers ([x], [~], [ ], [!]) give it an instant read on progress. The ← CURRENT marker draws attention to the active task. No matter how deep the agent gets into tool calls, the plan is always visible, always current.

The finish gate

If the agent tries to call finish with incomplete tasks, it gets an error:

Copy Code


def finish(llm_args, todos, messages):
    incomplete_todos = [
        todo for todo in todos.todo_list
        if todo.status not in {"completed", "blocked"}
    ]
    if incomplete_todos:
        raise RecoverableException(
            message_to_llm=f"You must complete or mark as blocked all todos "
                           f"before finishing. Incomplete: {incomplete_todos}"
        )

Not a suggestion, not a reminder: a RecoverableException — a structured error that lists which tasks are unfinished and bounces the agent back into the work loop. The agent can try to finish early. The system won’t let it.

This is the big pattern from lesson one: prompts are suggestions; code is constraints. If you want the agent to follow a rule, build the rule into your tools. Tool validation is your friend.

What we learned getting here

This planning system went through four iterations before it worked well. A few specific things we learned:

Few-shot examples beat abstract instructions. “Always use todo_write to plan your tasks” was in the prompt for a while and basically nothing changed. What worked was showing the agent a concrete example: here’s a real user request, here’s how to decompose it into todos, here’s what to do when each one is done. Show, don’t tell.
Status granularity matters more than you’d think. Going from 2 statuses to 4 had an outsized impact. in_progress prevents the agent from getting distracted mid-task. blocked gives it an escape hatch when it genuinely needs human input.
Return the full plan on every mutation. Both todo_write and todo_update return the complete TodoList, not just the modified item. This ensures the model always sees the full state after any change.

Lesson 2: Context engineering is the real work

AX experiments can be enormous: hundreds of rows, each with LLM outputs, eval scores, latencies, and metadata. A single experiment can easily hit 100,000 tokens. Two experiments could overflow the context window entirely.

Our first “solution” was a line in the system prompt: “DO NOT TRY TO COMPARE MORE THAN 2 EXPERIMENTS AT A TIME.” That’s not engineering. That’s giving up.

The file system insight

Think about how Cursor or Claude Code navigates a large codebase. They don’t dump entire files into context — they read a preview, use grep to find what they need, and read specific line ranges. The file lives on disk; the agent holds a reference to it and queries incrementally.

We needed the same pattern for structured data.

LargeJson: store data out-of-band, give the agent a handle

When a tool returns a huge dataset, Alyx doesn’t put it all in context. It stores the full data in server-side memory and gives the LLM a preview plus a stable handle:

Copy Code

LargeJson ├── json_id: "experiment_a1b2c3d4" ← handle for the agent to reference ├── json_data: { ... full data ... } ← stored in server memory, never sent to LLM ├── partial_data: { ... preview ... } ← structure-preserving preview shown to LLM ├── is_truncated: true ├── total_tokens: 85000 ├── partial_tokens: 980 └── note_to_llm: "Use jq() tool to query this data."

The agent can reference that handle for the entire session, making targeted queries as it needs specific slices.

Compress values, not structure

This was our key insight about previews. The obvious approach is to take the first N tokens of the JSON and cut off. But that shows the agent a few complete rows with no idea what the rest looks like.

What we do instead is compress the values inside the JSON, not the structure. We recursively walk the JSON tree and truncate every string value to 100 characters, but preserve all keys, all nesting, all array lengths. The agent sees the full shape of the data — every field name, every level of nesting, every array size — with shortened values. That’s exactly what it needs to write a useful query.

If the structure-preserving compression is still over budget, we fall back to brute-force character slicing. But phase one almost always gets it small enough.

Here’s what it looks like in practice:

Copy Code


async def get_experiments(llm_args, large_response_memory) -> LargeJson:
    exp_data = await get_experiments_data(experiment_ids=llm_args.experiment_ids)
    experiments_data = {
        "experiment_ids": experiment_ids,
        "experiment_names": experiment_names,
        "rows": rows,
    }
    return LargeJson.from_json_data(
        data=experiments_data,
        memory=large_response_memory,
        partial_data_token_limit=20000,
        prefix="experiment",
    )

Query tools: jq and grep_json

Now the agent has two ways to drill into the data:

jq — the same jq you’d use at the command line. The agent writes jq expressions to slice, filter, aggregate, and transform:

Copy Code


.experiments[0].rows[:5]                          # first 5 rows
[.rows[] | select(.eval_score < 0.5)]             # all failing rows
[.rows[].latency_ms] | add / length               # average latency
.rows[] | select(.example_run_id == "abc123")      # find a specific row

grep_json — regex search across the serialized data, like grep -B2 -A2 but over JSON. Useful for exploratory analysis when you don’t know the structure yet.

Both tools have a hard output limit of 10,000 characters per call, enforced in code with no exceptions. If a query returns too much, the tool throws a RecoverableException suggesting a more specific query. This creates a feedback loop: the agent tries, the tool guides, the agent refines.

The result

Before LargeJson and jq, comparing experiments meant context overflow. After the change, the agent fetches a preview, identifies which rows need closer inspection, runs targeted jq queries, calculates statistics, and answers the question. Ten targeted tool calls instead of one context-exploding dump. The two-experiment cap was removed entirely. Users can now compare ten experiments at once.

Design your tools like Unix commands

A theme you’ll see across all of these lessons: small, composable tools. jq is an old idea. grep is even older. These tools do one thing well, they accept input and produce output, and the output of one can feed the input of another. That’s exactly right for agents. When you design tools for an LLM agent, think like a Unix programmer. The agent is the shell script; your tools are the commands.

What we learned about context management

Hard token budgets on every tool output. Every call should have a maximum, enforced in code rather than guidelines.
Don’t paper over problems with artificial limits. The old “max 2 experiments” rule was a workaround, not a solution. The right fix is a better data access pattern.
RecoverableExceptions create guided exploration. The agent tries a query, gets an error with suggestions, and refines. This is the same feedback loop that makes IDE agents effective at code navigation.
Tool responses may contain customer data. Our initial LargeJson implementation logged the full JSON to application logs — hundreds of rows of experiment data, visible to anyone with log access. Caught in code review before it shipped. Be careful about what gets logged.

Lesson 3: You can’t vibe-check a production agent

Traditional software tests assert deterministic outputs: given input X, expect output Y. Agents break this completely — the same prompt produces different responses every run, multiple responses might be correct, and you can’t assert exact text matches or even the same sequence of tool calls.

Most teams end up doing vibe checks: watch the agent run, eyeball the output, decide it looks roughly right. We did too, and it doesn’t scale, doesn’t catch regressions, and doesn’t run in CI.

Golden sessions: let production define “good”

Our key insight was to stop writing expected outputs by hand and let production tell us what good looks like.

When Alyx has a great session — when it does exactly the right thing — we capture that session as a golden trace. Arize traces give us everything: what the LLM saw, what it produced, which tools it called and in what order. That becomes ground truth, verified by a real user rather than invented by an engineer.

We test against this golden dataset in two ways.

Level 1: decision-point tests

These are pytest-based tests that validate specific agent decisions. The pattern: build message history up to a decision point, run the actual orchestrator, assert the output is correct.

The assertion framework handles the reality that LLM output varies in format:

Copy Code


class OutputAssertion(BaseModel):
    contains_any: list[str] | list[list[str]] | None  # OR / AND-of-ORs
    contains_all: list[str] | None                     # AND
    not_contains: list[str] | None                     # NOT

The powerful pattern is AND-of-ORs — assert that the output mentions certain facts, each of which could appear in multiple formats:

Copy Code


OutputAssertion(
    contains_any=[
        ["2000 ms", "2000ms", "2.0 seconds", "two seconds"],  # latency
        ["OpenAIChat.invoke", "LLM Span", "LLM span"],         # span reference
    ],
)

This asserts: the output mentions the latency (in any format) AND mentions the span (by any identifier). Deterministic, but flexibly so. You’re matching facts, not phrasing.

For testing conversation memory, you provide prior message history — including tool call signatures, assistant responses, even TodoList state — and then test what the agent does at that specific decision point:

Copy Code


message_history = [
    UserChatMessage(input=UserInput(question="What is the bottleneck of this trace?")),
    AssistantChatMessage(content=[
        ToolCallSignature(name="get_trace_preview", ...),
        TextMessage(content="The bottleneck is OpenAIChat.invoke at 2.00 seconds..."),
        TodoList(todo_list=[Todo(id=0, description="...", status="completed")]),
    ]),
]

# Now test: given this history, what does the agent do next?
user_input = UserInput(question="Can you list all the questions I asked?")
expected = OutputAssertion(
    contains_all=["bottleneck", "most tokens"],
)

Tests run against the real orchestrator with real API clients — not mocks. This catches integration bugs that unit tests miss. The tradeoff is speed, so tests are gated behind a --evals flag.

Level 2: trajectory tests

While decision-point tests validate individual choices, trajectory tests validate entire sessions end-to-end.

The pipeline has three steps:

Extract production traces into a dataset. A CLI tool pulls a successful session from Arize and converts it to a CSV where each row is one orchestrator turn — the user’s input, the expected text output, expected UX events, trace IDs, and session context.
Replay through the real orchestrator. Each row is fed through the actual agent. The test runner maintains session memory across rows, so multi-turn conversations replay accurately. If the agent triggers an experiment, the framework actually runs it via the Arize API. IDs get remapped between environments so the same test can run in dev or prod.
Score with LLM-as-a-judge. A GPT-4o evaluator compares expected vs. actual output and classifies each turn as CORRECT or INCORRECT.

The evaluation prompt matters a lot. We tuned it extensively with domain-specific rules:

Numeric tolerance: “2000ms” and “two seconds” are both correct. If the test was captured in prod but replayed in dev, different data means different counts — check the logic, not the absolute values.
Ordering tolerance: A and B doesn’t have to mean A before B.
Tool-only completions: Sometimes the agent completes via tool call and says nothing at all. That can be correct.
Multi-instance tasks: Ignore assignment ordering (A/B/C) unless the user specified it.

An LLM is the only thing capable of semantic evaluation when outputs can vary this much.

Prompt-tool drift: a subtle CI problem

Here’s one we didn’t anticipate: Alyx’s prompts reference tools by name. “Call get_trace_preview to examine the trace.” What happens when a tool gets renamed? The agent tries to call a function that doesn’t exist. Runtime failure, completely invisible in unit tests.

We now run structured validation that cross-references tool names in prompts against actual @llm_tool decorators in the code:

Copy Code

# Extract tool references from prompts grep -r "call \`" alyx/prompts/ # Cross-reference against actual tool definitions grep -r "@llm_tool" alyx/tools/*.py | grep "tool_name"

Claude Code Review runs on every PR with validation rules in CLAUDE.md, catching prompt-tool mismatches before they ship. Natural language bugs caught by natural language review.

The dog-food loop

And yes, we dog-food everything. Our tests run inside Arize, test results are logged as Arize experiments, and every trace is debuggable in the same UI our customers use. If Alyx’s own evals can live in Arize, that’s a pretty good validation that our product is doing its job.

What we learned about testing agents

Capture good sessions; don’t invent expected outputs. Build the infrastructure to capture golden sessions early. Don’t wait until you need tests to figure out how to write them.
Match facts, not phrasing. The AND-of-ORs pattern in OutputAssertion lets you assert semantic correctness without brittleness.
LLM-as-judge is necessary for trajectory evaluation. When the agent can take different (but valid) paths to the same goal, only a semantic evaluator can judge correctness.
Real APIs, not mocks. Mocks are fast but they miss the class of bugs that only show up when real systems talk to each other.

Lesson 4: Debugging an agent is an agent-shaped problem

You deploy, and something breaks — of course it does. When Alyx misbehaves — calls the wrong tool, hallucinates data, gets stuck in a loop — the bug isn’t in a stack trace. It’s distributed across three systems:

Arize AX shows the agent’s perspective: what the LLM saw each turn, what tools it called, what it produced.
Datadog APM shows the server’s perspective: latencies, errors, HTTP status codes for each tool call’s backend processing.
GCP Cloud Logging shows the infrastructure perspective: OOMKills, container restarts, gRPC deadline errors that never made it into a span.

Debugging requires all three. A human doing this manually opens Arize, finds the trace, copies a span ID, switches to Datadog, writes a query in different syntax, notes a timestamp, switches to GCP, writes yet another query syntax, and correlates the results. It’s brutal — and it’s also exactly the kind of tedious, cross-system, structured work that LLMs happen to be very good at.

Skills: markdown runbooks for coding agents

So we built debugging skills. A skill is a markdown file that teaches a coding agent how to perform a specific task — structured instructions the LLM reads and follows, the same way an engineer reads a runbook:

Copy Code

.agents/skills/ ├── alyx-traces/ # Export and query Arize trace data │ ├── SKILL.md │ └── scripts/arize_cli.py ├── datadog-debug/ # Search Datadog spans, fetch traces │ ├── SKILL.md │ ├── references/REFERENCE.md │ └── scripts/dd-search-spans.sh ├── gcloud-logs/ # Query GCP Cloud Logging │ ├── SKILL.md │ └── scripts/safe-gcloud-logs.sh └── env-context/ # Resolve environment aliases

Skills are discovered automatically by Cursor, Claude Code, and Codex via symlinks from a single canonical directory. Write once, debug everywhere.

The three-pronged debugging loop

The three skills chain together naturally:

1. alyx-traces: Start here. Pull the production session from Arize. The skill wraps a Python CLI that exports the complete trace as structured JSON — every span, every tool call, every LLM input/output. It can also open a specific span in the Arize Playground so you can replay the exact moment the agent went wrong.

2. datadog-debug: Search backend spans for the time window. The skill includes jq recipes for reconstructing call trees, extracting errors, and identifying bottlenecks:

Copy Code


# Timeline reconstruction from Datadog spans
jq '[.[] | {
  start: .attributes.start_timestamp,
  duration_ms: ((.attributes.custom.duration // 0) / 1000000 | round),
  service: .attributes.service,
  name: .attributes.resource_name,
  status: .attributes.status
}] | sort_by(.start)' .agents/tmp/*-dd-trace-*.json

3. gcloud-logs: Correlate with Kubernetes pod logs at the same timestamp. Some bugs — OOMKills, container restarts, gRPC deadline exceeded — don’t show up in spans at all. They only appear in infrastructure logs.

Each skill’s output feeds the next. Here’s a real example:

“Alyx gave the wrong answer in session XYZ.”

The coding agent reads the alyx-traces skill, pulls the full session trace, and identifies the failing tool call. It notes the trace ID. Then it reads datadog-debug and searches backend spans — finds a 500 error on a GraphQL resolver. Then it reads gcloud-logs and finds an OOMKill two minutes before the 500. Root cause in minutes, not half an hour.

Safety lives in the wrappers

Every skill uses wrapper scripts that enforce read-only access. safe-kubectl.sh rejects all mutating verbs (apply, delete, edit, patch, scale, exec, rollout restart), and safe-gcloud-logs.sh only allows gcloud logging read. In production environments, the skill generates the command and tells the engineer to review it before running.

All bash scripts source a shared library (common.sh) that handles environment resolution, credential management, and safety enforcement. There’s a single source of truth for mapping environment aliases (like “prod”) to GCP project IDs and Kubernetes contexts. Individual scripts never hardcode environment details.

LLMs will follow whatever instructions you give them, so the wrapper is where you put the guardrails — not the prompt, the code. Same lesson as lesson one.

What we learned about debugging agents

Skills are just markdown — low cost, high value. Writing a markdown file is not traditional engineering work, but it dramatically expands what your coding agent can do. We have 13 in our repo.
Composability through shared conventions. Skills reference each other by name and assume shared conventions — output to .agents/tmp/, use jq for analysis, resolve environments through env-context — so the LLM chains them naturally without orchestration code.
Build observability before you need it. When things go wrong at 2am, you want the debugging tools already in place.
Agent debugging is itself an agent-shaped problem. Sifting through traces, spans, and logs, correlating IDs and timestamps, following a procedure to narrow down a root cause — that’s exactly what agents are built for. Use the thing you built to fix the thing you built.

The patterns that actually matter

If we boil this down to two themes, they are:

Context engineering is the work of deciding what the agent knows, when it knows it, how much of it fits, and what happens when it doesn’t. The todo plan injected on every call, the LargeJson handle with structure-preserving previews, the 10,000-character hard budgets on tool output, the RecoverableExceptions that guide the agent toward better queries — all of it is context engineering, and it is most of the work.

Testing nondeterministic systems is the work of defining what “correct” looks like when you can’t assert exact outputs: golden sessions captured from production, OutputAssertions that match facts rather than phrasing, LLM-as-a-judge for semantic evaluation, and CI pipelines that catch prompt-tool drift. Get this right and you’ve got a system you can actually iterate on with confidence.

Final takeaways

Four big lessons:

Enforce behavior in code, not prompts. Prompts are suggestions. Tool validation is constraints. The finish gate works because finishing throws an error when tasks are incomplete.
Provide just enough context. The right preview, the right handle, the right query tools. Not so much that attention degrades. Not so little that the agent has to guess.
Crystallize good behavior into tests. When something works, capture it, formalize it, test it. Good behavior should be permanent, not lucky.
Build your debugging loop before you need it. Skills, observability, the three-pronged debugging chain — none of it is hard to build. But you want it in place before you’re staring at a broken agent in production.

And the meta-lesson: the way you debug an agent is with an agent. Use the thing you built to fix the thing you built.

Some of this is specific to Alyx, but most of it applies to any production agent. If you’re building one, we hope this saves you at least one 2am debugging session.

The post AI Agent Debugging: Four Lessons from Shipping Alyx to Production appeared first on Arize AI.

Alyx 2.0: The AI Agent That Actually Plans

Sally-Ann DeLucia — Tue, 24 Feb 2026 15:00:32 +0000

Two years ago, we started building Alyx with GPT-3.5, a vision, and honestly, no clear path forward. Agents were a buzzword. The models were rough. Tool calling was just emerging. But we had a hypothesis: the future wouldn’t be clicking through UIs or even just chatting with an assistant. You’d say what you want, and an agent just does it.

Then we watched Cursor and Claude Code show us what it feels like when a real agent actually works for you. That changed everything.

Today, we’re releasing Alyx 2.0. A true planning agent for AI engineering that can reason across your entire AI lifecycle in Arize AX, break down complex tasks, and execute autonomously across the platform.

To try Alyx, check out our docs or book a meeting for a custom demo

Planning Changes Everything

The biggest unlock here isn’t better models or more tool calls. It’s planning.

Most “AI assistants” in observability and development tools are either glorified chatbots or rigid, pre-defined workflows. They follow one decision tree from the top. They can’t adapt. They can’t compose complex actions. They definitely can’t surprise you.

Alyx 2.0 is different. It’s built on a true orchestrator that can reason about multi-step tasks, maintain context across actions, interact with the UI when needed, and ask for approval at critical decision points.

What Alyx Can Actually Do

Error analysis without the grind

Most error analysis workflows are manual by design. You sift through traces, annotate issues, collapse them into labels, guess what matters, then wire up evaluations after the fact. It’s slow, subjective, and brittle.

Alyx collapses that entire workflow into a single question. If you already have traces flowing into Arize AX, you can ask:
“Review my reasoning annotations, identify the most critical issue and turn it into an eval.”

Alyx does the rest. It synthesizes annotations into discrete labels, determines what’s actually critical (not just most frequent), generates evaluation templates, and spins up a live evaluation task automatically.

No manual review. No label taxonomy debates. No wiring things together by hand. Just answers and the evals to back them up.

Prompt engineering, without staring at a blank page

Prompt experimentation usually starts in the worst possible state: a blank playground, no dataset, no baseline, and no idea where to begin. With Alyx, you can delegate.

Ask Alyx to generate a dataset for your use case. It creates it, populates it with realistic examples, and loads it directly into the playground. You’re ready to iterate immediately.

But Alyx doesn’t stop at setup. Ask for something more complex:
“Create two prompt variants, attach an evaluation, and run an experiment.”

Alyx plans the work, interacts with the UI, requests approval when needed, runs the experiment end-to-end, analyzes the results, and recommends concrete improvements.
You’re not clicking through interfaces or manually running experiments anymore; you’re directing outcomes and Alyx handles execution.

Trace debugging that actually debugs

Most “debugging tools” just show you more data. Alyx explains why things broke.

We use Alyx to build Alyx. When we noticed spans returning evaluation templates without any reasoning (a bad UX that felt incomplete), we asked:
“Find spans where we returned a final eval template without reasoning.”

Alyx identified the spans instantly. From there, we clicked into a trace and asked:
“Why didn’t this return reasoning?”

Alyx traced the failure to a specific guideline, pointed to the exact decision path, and surfaced the root cause. No guesswork. No spelunking. We knew exactly what to fix.

Why This Matters

If you’re an AI PM or AI engineer, the hardest part of your job isn’t writing prompts, it’s everything around them.

Managing context across long-running sessions. Turning vague failures into reproducible test cases. Figuring out why an agent broke after a seemingly harmless prompt change. Manually setting up experiments just to compare two prompt variants. Stitching together traces, annotations, evals, and metrics across half a dozen tools.

These workflows are fragmented, manual, and slow and they don’t scale as systems become more agentic.

Alyx changes the unit of work.

Instead of operating at the level of prompts, traces, or individual evaluations, Alyx operates across the entire AI engineering lifecycle. It can synthesize datasets from real failures, derive evaluations from failure patterns, optimize prompts against annotations, run and analyze experiments, annotate traces, compute metrics, and surface what actually matters – all in one continuous loop.

You’re no longer orchestrating tools. You’re delegating intent.

That’s what makes Alyx different. It’s not a loop, a chat interface, or a smarter playground. It’s an agent grounded in your Arize data that can plan and carry out multi-step AI engineering workflows, from diagnosis to execution, without you stitching everything together by hand.

The response from our customers has been explosive. And honestly, building Alyx has surprised us too. There have been multiple moments where it accomplished things we didn’t explicitly design for. That’s the fun – and the challenge – of building true agents.

The Hard Parts No One Talks About

Building something this good is genuinely difficult. Some lessons learned:

Context management is brutal. Message buses, UI state, context window bloat. Keeping everything coherent without losing critical information is an ongoing challenge.

Testing an agent to prevent regressions is unsolved. How do you make sure prompt or architecture changes don’t break previous workflows when the system is adaptive by design? We’ve had to build custom evaluation frameworks just for Alyx itself.

UI integration is harder than you think. Programmatic actions versus sub-agents, splitting tools smartly so they’re small and reusable. Every decision compounds.

Alyx isn’t perfect. But it delivers real value. And we’re learning constantly.

What’s Next

This is just the beginning. We’re doubling down on capability and on making Alyx fit naturally into how you already work, including using Alyx directly inside tools like Claude Code and Cursor. Not forcing you into a new workflow.

Longer term, our vision is for Alyx to become a true partner for AI PMs and engineers. One that can reason, plan, and act across your entire AI lifecycle in Arize AX. Cursor for AI engineering.

We’ve been using Arize AX and Alyx to build Alyx. Those lessons are shaping where this goes next.

Watch the intro video to see Alyx 2.0 in action, and stay tuned for deep dives on specific workflows over the coming weeks.

The post Alyx 2.0: The AI Agent That Actually Plans appeared first on Arize AI.