Agentic AI Observability: What to Measure So ‘It Works’ Doesn’t Become ‘It Drifted’
By Shayan Ghasemnezhad on December 5, 2025 · 3 min read
AI agents degrade silently. The observability stack and eval framework that catches drift before users do.
Traditional software fails loudly. An API returns a 500, a test turns red, a metric crosses a threshold. AI agents fail quietly. The agent still responds, still takes actions, still produces output that looks plausible. But the quality degrades—subtly, gradually, and invisibly until a user reports that “the AI seems worse lately.” By that point, the drift has been compounding for weeks.
Why Agents Are Hard to Observe
A REST API has a contract: given this input, return that output. You can write assertions against it. An AI agent has a goal, not a contract. It reasons about inputs, decides which tools to call, and produces outputs that are correct-ish rather than correct or incorrect. Standard monitoring—latency, error rate, throughput—tells you whether the agent is running. It does not tell you whether the agent is producing good outcomes.
Agents compound the problem by making multi-step decisions. An agent that retrieves documents, synthesises information, and generates a response has three points where quality can degrade: retrieval relevance, synthesis accuracy, and response quality. A drop in retrieval relevance (caused by a change in the vector index or new document types) silently degrades every downstream step.
The Metrics That Matter
Build observability around four layers:
- Operational metrics: Latency per step, token usage, tool call frequency, error rates. These are table stakes—they tell you the agent is running.
- Behavioural metrics: Which tools does the agent choose, and in what order? A shift in tool selection patterns—the agent stops using the search tool and starts hallucinating answers—is an early drift signal.
- Quality metrics: Task completion rate, user satisfaction (thumbs up/down), factual accuracy on known-answer queries. These require eval infrastructure.
- Cost metrics: Cost per task completion, not just cost per API call. An agent that retries three times before succeeding costs 3x and signals a quality problem.
Building an Eval Pipeline
Evals are automated tests for AI quality. Unlike unit tests, they produce scores rather than pass/fail. A practical eval pipeline runs on every deployment and on a daily schedule against production data.
Start with a golden dataset: 50–100 queries with known-good answers, covering the common cases and the edge cases your agent handles. Run the agent against this dataset after every model update or prompt change. Score outputs on relevance, factual accuracy, and format compliance. Track scores over time. A 5% drop in accuracy across two consecutive runs warrants investigation.
# Minimal eval runner with scoring
from dataclasses import dataclass
@dataclass
class EvalResult:
query: str
expected: str
actual: str
relevance_score: float # 0.0 - 1.0
factual_score: float
def run_eval(agent, golden_dataset: list[dict]) -> list[EvalResult]:
results = []
for item in golden_dataset:
response = agent.run(item['query'])
results.append(EvalResult(
query=item['query'],
expected=item['expected'],
actual=response,
relevance_score=score_relevance(
response, item['expected']
),
factual_score=score_factual(response, item['facts']),
))
return results
Drift Detection
Drift comes from three sources: model updates (the provider ships a new version), data changes (the retrieval index is updated with new content), and prompt changes (a teammate edits the system prompt). Each source needs its own detection mechanism.
For model updates: pin model versions in production. Run evals against new versions in staging before promoting. For data changes: track retrieval quality metrics (precision@k, recall) and alert when they drop. For prompt changes: version-control prompts like code, require review, and run evals on every change.
Decision Framework
Invest in observability proportional to the agent’s blast radius. An internal summarisation tool needs basic operational metrics and a weekly eval. A customer-facing agent that takes actions (creates tickets, sends emails, modifies data) needs real-time quality monitoring, automated evals on every deploy, and human-in-the-loop review for edge cases.
Failure Modes
The most common failure: relying on user feedback as the primary quality signal. Users report catastrophic failures but not gradual degradation. By the time “the AI seems worse” becomes a support ticket, the quality has been declining for weeks. Proactive evals catch what users tolerate.
Another failure: evals that test the easy cases. If your golden dataset only includes straightforward queries, it will not catch degradation on the edge cases where agents struggle most. Include adversarial inputs, ambiguous queries, and multi-step reasoning tasks in your dataset.
AI observability is not a dashboard—it is a practice. Measure what matters, automate the evals, and treat quality as a metric that ships with the feature, not one that gets added after the first incident.