Product

Resources

Pricing

Book Demo

🚀 Check out our customer stories >

All Posts

Put your content here

The agent development stack nobody talks about: observable tools, not just observable agents

All Posts

Why Your Agent Observability Stack Is Incomplete: Every AI engineering team knows they need observability. They instrument their LLM calls, track token usage, log prompts and completions.

Ka Ling Wu

Co-Founder & CEO, Upsolve AI

Nov 14, 2025

10 min

grayscale photo of binoculars on the ground

Welcome to Part 2 of our series on building production-grade analytics agents:

In Part 1, we established why analytics agents are fundamentally harder than general-purpose agents: the data beneath them constantly changes. But understanding the problem is only the first step. In this post, we're diving into the infrastructure layer that makes data-aware agents possible: tool-level observability.

Part 1: Why Git-Style Versioning Breaks for Data Analytics Agents
Part 2: The Agent Deployment Stack Nobody Talks About: Observable Tools, Not Just Observable Agents (you are here)
Part 3: How to QA an Agent When the Ground Truth Changes Daily

Most teams instrument their agents but remain blind to what their tools are actually doing. This is the gap that causes production failures. Let's fix it.

Why Your Agent Observability Stack Is Incomplete

Every AI engineering team knows they need observability. They instrument their LLM calls, track token usage, log prompts and completions. They can tell you exactly what their agent said.

But they can't tell you WHY it said it.

The missing layer: tool-level observability. Because your agent is only as good as the tools it calls, and most teams are flying blind at that layer.

The Tool Visibility Gap

Here's what happens in production:

User asks: "What were our top-performing products last quarter?"

Agent says: "$1.2M in Widget Pro sales led the quarter."

CEO replies: "That's wrong. It was $890K."

Now what? You check your agent logs. The LLM prompt was fine. The response looked confident. The RAG retrieved relevant context. Everything in your observability dashboard is green.

But buried three layers deep, one of your tools:

Queried the wrong table (staging instead of production)
Hit a rate limit and returned cached data from 2 months ago
Successfully ran a query that had a subtle WHERE clause bug
Retrieved a schema that had just changed 10 minutes prior

Your agent observability saw the symptom. Tool observability would have caught the cause.

What Tool-Level Observability Actually Means

Most teams think tool observability means "logging which tools were called." That's like saying car diagnostics means checking if the engine is running.

Real tool-level observability requires visibility into:

1. Input Validation & Sanitization

What parameters did the agent pass to this tool?
Were they within expected ranges?
Did type coercion happen silently?
Were any security filters applied?

Example: Your agent calls query_revenue(region='North America') but your tool silently converts it to query_revenue(region='North_America') (underscore instead of space). The query returns empty. Your agent confidently says "No revenue in North America."

Without input observability, you're debugging ghosts.

2. Execution Context

Which data sources did this tool actually access?
What was the state of those sources (updated 5 min ago? 5 hours ago?)
Were any fallbacks or retries triggered?
What was the query plan/execution path?

Example: Your get_customer_metrics tool is supposed to hit your real-time database. But that database is under load, so your tool silently falls back to the 6-hour-delayed replica. Your agent just gave the CEO stale data, and your observability shows "tool executed successfully."

This is the data drift problem we introduced in Part 1—and without tool-level observability, you'll never catch it until it's too late.

3. Output Structure & Quality

Did the tool return the expected schema?
Were there any null values or missing fields?
How does this output compare to historical patterns?
What's the confidence/quality score of this data?

Example: Your tool successfully retrieves "revenue by region" but the EMEA row has NULL values because of a data pipeline failure. Your agent sees the data structure is correct and happily tells the user "EMEA had zero revenue"—which is technically true but catastrophically wrong.

The RAG Visibility Problem

RAG adds another layer where most teams are blind. You're not just calling tools—you're retrieving context from vector stores, knowledge bases, semantic layers.

Standard observability shows you:

Which documents were retrieved
Their similarity scores
How they were ranked

But that's not enough. You need:

Retrieval Path Visibility

What was the embedding of the original query?
What reranking happened?
Were any filters applied (time-based, access-control, data quality)?
What chunks were retrieved but NOT used in the final context window?

That last one is critical. Often the most relevant chunk gets retrieved but then dropped due to context window limits. Your agent gives a wrong answer, your logs show "high-quality retrieval," and you never know the right answer was retrieved but discarded.

Semantic Drift Detection

How has the embedding space shifted over time?
Are similar queries now retrieving different documents?
Has the ranking of documents changed for the same query?

This is especially critical for data analytics where your documentation, schema definitions, and business logic are constantly evolving (remember the semantic drift problem from Part 1?). Your RAG system needs to detect when "active customer" starts retrieving different definitions because the business meaning changed.

The Architecture: Observable Tools, Not Just Observable Agents

Here's what a proper instrumentation stack looks like:

┌─────────────────────────────────────────┐
│         Agent Orchestration Layer       │
│  (Prompts, LLM calls, response gen)     │
└─────────────┬───────────────────────────┘
              │
              │ ← Standard agent observability ends here
              │
┌─────────────▼───────────────────────────┐
│         Tool Execution Layer            │
│                                         │
│  ┌────────────────────────────────────┐ │
│  │  Tool: query_database()            │ │
│  │  • Input validation logging        │ │
│  │  • Query plan capture              │ │
│  │  • Data source state check         │ │
│  │  • Output schema validation        │ │
│  │  • Result quality scoring          │ │
│  └────────────────────────────────────┘ │
│                                         │
│  ┌────────────────────────────────────┐ │
│  │  Tool: retrieve_context()          │ │
│  │  • Embedding vector logging        │ │
│  │  • Retrieval path tracing          │ │
│  │  • Reranking decision capture      │ │
│  │  • Context window allocation log   │ │
│  └────────────────────────────────────┘ │
└─────────────────────────────────────────┘

Every tool becomes a fully instrumented black box that you can crack open when things go wrong.

Why This Matters for Debugging

Real production scenario:

Symptom: Agent giving inconsistent answers to the same question across different days.

Agent-level observability shows: Same prompt, same model, similar confidence scores.

Tool-level observability reveals:

Monday: Tool queried table prod.sales (updated 2 hours ago)
Tuesday: Tool queried table prod.sales (updated 18 hours ago—pipeline delay)
Data staleness wasn't surfaced to the agent
Agent had no signal that confidence should be lower

Fix: Add data freshness signals to tool outputs, teach agent to caveat answers when data is stale.

You can't fix what you can't see. And most teams can't see their tools.

This also becomes critical for the testing and evaluation approach we cover in Part 3—you can't effectively QA an agent if you don't know what your tools are actually doing.

The Build Tax for Tool Observability

If you're building this yourself, here's what you're signing up for:

Instrumentation layer for every tool type
- Database query tools need query plan capture
- API tools need rate limit & latency tracking
- RAG tools need embedding and retrieval path logging
- Calculation tools need input/output validation
Centralized observability aggregation
- Collecting logs from distributed tool executions
- Correlating tool traces with agent traces
- Building a UI that lets you drill down from agent → tool → data source
Alert & anomaly detection
- Detecting when tools start behaving differently
- Catching silent failures (successful execution, wrong result)
- Identifying data quality degradation

Most teams budget 2-3 weeks for "observability." Then they spend 6 months building this infrastructure and still have blind spots.

What Great Tool Observability Enables

Once you have true tool-level visibility, you unlock:

Root Cause Analysis in Minutes, Not Days

User reports wrong answer → You trace to the specific tool call → You see the exact input/output → You identify the data quality issue that caused it.

Proactive Quality Monitoring

You detect that your calculate_churn tool is returning suspiciously low numbers before any user notices. Turns out a schema change broke a JOIN.

Continuous Improvement Feedback Loops

You can analyze which tools are underperforming, which data sources are unreliable, which retrieval patterns need optimization—all with data, not guesswork.

This is also what makes the evaluation strategies in Part 3 actually actionable—you need tool-level data to understand what's degrading and why.

The Real Question

Before you build an agent, ask yourself:

"If my agent gives a wrong answer at 2 AM on a Saturday, can I debug it without waking up an engineer?"

If the answer is no, your observability stack isn't ready for production.

Most teams instrument their agents like they're debugging a monolith. But agents are distributed systems—with LLMs, tools, databases, APIs, and RAG all working in concert.

You need distributed systems observability. Not a glorified logger.

Next in this series: Part 3 - How to QA an Agent When the Ground Truth Changes Daily, where we tackle the hardest problem in analytics agents: testing against data that won't sit still.

Key Takeaways

Hire once: Add an employee in Payroll and they’re synced to Time automatically.
A named manager, clear escalation paths with time commitments.
Reconcile faster: Payment deposits and fees auto‑post to your GL.
Hire once: Add an employee in Payroll and they’re synced to Time automatically.
A named manager, clear escalation paths with time commitments.
Reconcile faster: Payment deposits and fees auto‑post to your GL.

Pros

Hire once: Add an employee in Payroll and they’re synced to Time automatically.
A named manager, clear escalation paths with time commitments.
Reconcile faster: Payment deposits and fees auto‑post to your GL.

Cons

Hire once: Add an employee in Payroll and they’re synced to Time automatically.
A named manager, clear escalation paths with time commitments.
Reconcile faster: Payment deposits and fees auto‑post to your GL.

Try Upsolve for Embedded Dashboards & AI Insights

Embed dashboards and AI insights directly into your product, with no heavy engineering required.

Fast setup

Built for SaaS products

30‑day free trial

Book a Demo

See Upsolve in Action

Launch customizable dashboards and AI‑powered insights inside your app, fast and with minimal engineering effort. No code.

Book a Demo

Share this post

Latest Articles

Ka Ling Wu

Dec 5, 2025

How to QA an agent when the ground truth changes daily

The Testing Problem Nobody Prepared You For: Software QA is built on a simple premise: correct behavior is stable. You write a test, it passes, and if the test fails tomorrow, you know something broke. This doesn't work for data analytics agents.

Ka Ling Wu

Dec 5, 2025

How to QA an agent when the ground truth changes daily

Ka Ling Wu

Dec 1, 2025

The agent development stack nobody talks about: observable tools, not just observable agents

Why Your Agent Observability Stack Is Incomplete: Every AI engineering team knows they need observability. They instrument their LLM calls, track token usage, log prompts and completions.

Ka Ling Wu

Dec 1, 2025

The agent development stack nobody talks about: observable tools, not just observable agents

Why Your Agent Observability Stack Is Incomplete: Every AI engineering team knows they need observability. They instrument their LLM calls, track token usage, log prompts and completions.

Ka Ling Wu

Nov 28, 2025

Why Git-Style versioning breaks for data analytics agents

The Problem Everyone Underestimates: When engineering teams build their first AI agent, they typically think it's a harder version of building a microservice. Add some LLM calls, implement retry logic, maybe throw in RAG, ship it. They're wrong, but they don't know it yet.

Ka Ling Wu

Nov 28, 2025

Why Git-Style versioning breaks for data analytics agents

Ka Ling Wu

Dec 5, 2025

How to QA an agent when the ground truth changes daily

Ka Ling Wu

Dec 1, 2025

The agent development stack nobody talks about: observable tools, not just observable agents

Why Your Agent Observability Stack Is Incomplete: Every AI engineering team knows they need observability. They instrument their LLM calls, track token usage, log prompts and completions.

Ka Ling Wu

Nov 28, 2025

Why Git-Style versioning breaks for data analytics agents

Ka Ling Wu

Nov 27, 2025

In-Depth Domo Review (2025): 100+ User Experience Demystified

Read our 2026 Domo review based on 100+ user experiences. Explore features, pricing, pros and cons, and how Domo compares with Tableau, Power BI, and Upsolve AI.