Your AI Observability Problem Is Probably Not Technical

When teams describe an AI observability gap, they almost always describe a tooling gap. They want more dashboards, richer traces, better eval harnesses, deeper model logs. The conversation rarely makes it to the question that actually matters: when one of those signals goes wrong at 3am on a Tuesday, who picks up the page, and what does their runbook tell them to do?

That is the missing layer. AI systems have inherited the metrics culture of modern production engineering and almost none of the operational scaffolding. Databases have DBAs. Networks have NOCs. Search systems have search relevance teams with explicit on-call rotations. Generative systems, in most organisations, have a Slack channel and a vague sense that “the ML team probably knows.”

The result is predictable. Behavioural anomalies surface, drift around for days, and eventually get raised by a customer-success manager rather than detected by an operator. There was no operator.

A representative week

A B2B copilot starts producing slightly longer answers on Monday. Average output length climbs 22%. Token spend rises with it. By Wednesday, the daily LLM bill is up 18%. By Friday, two enterprise customers have separately complained that responses feel “verbose and less direct.”

Here is what happens internally over that week:

Monday. A finance dashboard auto-flags the cost anomaly and emails an alert to a shared inbox no one owns. It is not read.
Tuesday. An ML engineer notices the longer outputs while spot-checking traces, assumes it is a known model behaviour, and moves on.
Wednesday. The on-call SRE sees nothing wrong. Latency, error rate, and uptime are all fine. The on-call rotation does not include behavioural metrics; nothing pages.
Thursday. The product manager sees the customer complaints in the support queue and files them as “tone feedback.”
Friday. Someone connects the three threads in a meeting. The cause turns out to be a prompt change shipped the previous Friday afternoon by a fourth person, who was not in the meeting.

No tool was missing. The cost alert fired. The traces existed. The customer signal arrived. The prompt change was logged in git. Five different people held one piece of the picture each, and the system had no role whose job was to assemble them.

This is the shape of most AI observability problems. They are not technical. They are organisational gaps wearing technical clothing.

Four patterns to recognise

Diffuse ownership. Every relevant skill — model behaviour, prompt content, retrieval quality, cost, infrastructure, product semantics — lives in a different team. Nobody owns “the system” as a whole, so nobody owns its drift — and nobody is positioned to challenge the definition of healthy the launch deck baked in.
Pager gap. Infrastructure pages someone. Behavioural drift pages no one. The on-call rotation was copied from the web-services playbook and never updated for a system whose primary failure mode is qualitative.
Runbook absence. When an anomaly does surface, the responder has no playbook. There is no documented sequence for “groundedness dropped 10%” or “fallback rate doubled overnight” the way there is for “database CPU at 95%.” The anomaly turns into an ad-hoc investigation that takes days.
Cost without an owner. Token spend is visible to finance, accountable to engineering, and tracked against no one’s KPI. It drifts upward quietly until it becomes a quarterly conversation rather than an operational signal.

The shared structure: each of these is a missing role, not a missing metric. Adding more telemetry to a system with diffuse ownership produces more unread alerts, not better outcomes.

Heuristics for actually closing the gap

The question to start from is not “what should we measure?” It is “for each thing we already measure, who is paged, and what do they do?”

A workable starting set:

Write the ownership matrix. For each layer — prompt content, retrieval, model selection, eval set, cost, behavioural quality, infrastructure — name a single accountable owner and a single team. If any cell has more than one name, it has none.
Define the AI on-call rotation explicitly. Separate it from the SRE rotation if necessary. The skill set is different. The signals are different. The mean-time-to-investigate for a behavioural anomaly is measured in hours and a half-asleep SRE is the wrong responder.
Write the first three runbooks. At minimum: groundedness drop on regression set, fallback rate spike on a single intent, cost anomaly per resolved intent. Each runbook should answer: what to check first, who to escalate to, what to roll back, and what counts as resolved.
Make behavioural alerts page someone. If a metric is worth tracking, it is worth waking someone up for at the right threshold. Anything else is dashboard furniture.
Give cost a behavioural owner, not a finance owner. Cost per resolved intent is a quality metric in disguise; rising cost without rising volume usually means the system is working harder for the same outcome.
Schedule a weekly behavioural review. Thirty minutes, fixed attendees, fixed agenda: top three intents by escalation rate, any prompt changes shipped this week, any cost-per-outcome movement, any open behavioural runbook executions. Boring, repetitive, and the single highest-leverage operational practice most AI teams are missing.

None of this requires new tooling. All of it requires the team to decide that operating an AI system is a job, not a side effect of having shipped one.

What to watch for

Signs your observability problem is operational rather than technical:

A behavioural regression in the last quarter was first reported by a customer, not by an internal signal.
Asked “who owns output quality?”, three different people give three different answers.
The on-call rotation does not include anyone who can read a prompt diff and reason about its likely effect.
Alerts fire into shared inboxes or Slack channels rather than to a named person.
There is no runbook for any behavioural anomaly, only for infrastructure ones.
Token spend is reviewed monthly by finance and never weekly by engineering.
The team can name the dashboard for the system but cannot name the on-call engineer.
Post-incident reviews for behavioural issues end with “we should add a metric for that” rather than “we should change who responds and how.”

If three or more of these are true, more tooling will not help. The next dashboard will join the others in being unread by the person who would not have been paged anyway.

Most AI observability problems are management problems wearing a technical costume. The fix is not a better stack. It is naming the operator, writing the runbook, and making the alert page a human who knows what to do when it arrives.