Back

The agent development stack nobody talks about: observable tools, not just observable agents

Dec 1, 2025

Ka Ling Wu

Co-Founder & CEO, Upsolve AI

grayscale photo of binoculars on the ground
grayscale photo of binoculars on the ground
grayscale photo of binoculars on the ground
grayscale photo of binoculars on the ground

Table of Contents

Welcome to Part 2 of our series on building production-grade analytics agents:

In Part 1, we established why analytics agents are fundamentally harder than general-purpose agents: the data beneath them constantly changes. But understanding the problem is only the first step. In this post, we're diving into the infrastructure layer that makes data-aware agents possible: tool-level observability.

Most teams instrument their agents but remain blind to what their tools are actually doing. This is the gap that causes production failures. Let's fix it.

Why Your Agent Observability Stack Is Incomplete

Every AI engineering team knows they need observability. They instrument their LLM calls, track token usage, log prompts and completions. They can tell you exactly what their agent said.

But they can't tell you WHY it said it.

The missing layer: tool-level observability. Because your agent is only as good as the tools it calls, and most teams are flying blind at that layer.

The Tool Visibility Gap

Here's what happens in production:

User asks: "What were our top-performing products last quarter?"

Agent says: "$1.2M in Widget Pro sales led the quarter."

CEO replies: "That's wrong. It was $890K."

Now what? You check your agent logs. The LLM prompt was fine. The response looked confident. The RAG retrieved relevant context. Everything in your observability dashboard is green.

But buried three layers deep, one of your tools:

  • Queried the wrong table (staging instead of production)

  • Hit a rate limit and returned cached data from 2 months ago

  • Successfully ran a query that had a subtle WHERE clause bug

  • Retrieved a schema that had just changed 10 minutes prior

Your agent observability saw the symptom. Tool observability would have caught the cause.

What Tool-Level Observability Actually Means

Most teams think tool observability means "logging which tools were called." That's like saying car diagnostics means checking if the engine is running.

Real tool-level observability requires visibility into:

1. Input Validation & Sanitization

  • What parameters did the agent pass to this tool?

  • Were they within expected ranges?

  • Did type coercion happen silently?

  • Were any security filters applied?

Example: Your agent calls query_revenue(region='North America') but your tool silently converts it to query_revenue(region='North_America') (underscore instead of space). The query returns empty. Your agent confidently says "No revenue in North America."

Without input observability, you're debugging ghosts.

2. Execution Context

  • Which data sources did this tool actually access?

  • What was the state of those sources (updated 5 min ago? 5 hours ago?)

  • Were any fallbacks or retries triggered?

  • What was the query plan/execution path?

Example: Your get_customer_metrics tool is supposed to hit your real-time database. But that database is under load, so your tool silently falls back to the 6-hour-delayed replica. Your agent just gave the CEO stale data, and your observability shows "tool executed successfully."

This is the data drift problem we introduced in Part 1β€”and without tool-level observability, you'll never catch it until it's too late.

3. Output Structure & Quality

  • Did the tool return the expected schema?

  • Were there any null values or missing fields?

  • How does this output compare to historical patterns?

  • What's the confidence/quality score of this data?

Example: Your tool successfully retrieves "revenue by region" but the EMEA row has NULL values because of a data pipeline failure. Your agent sees the data structure is correct and happily tells the user "EMEA had zero revenue"β€”which is technically true but catastrophically wrong.

The RAG Visibility Problem

RAG adds another layer where most teams are blind. You're not just calling toolsβ€”you're retrieving context from vector stores, knowledge bases, semantic layers.

Standard observability shows you:

  • Which documents were retrieved

  • Their similarity scores

  • How they were ranked

But that's not enough. You need:

Retrieval Path Visibility

  • What was the embedding of the original query?

  • What reranking happened?

  • Were any filters applied (time-based, access-control, data quality)?

  • What chunks were retrieved but NOT used in the final context window?

That last one is critical. Often the most relevant chunk gets retrieved but then dropped due to context window limits. Your agent gives a wrong answer, your logs show "high-quality retrieval," and you never know the right answer was retrieved but discarded.

Semantic Drift Detection

  • How has the embedding space shifted over time?

  • Are similar queries now retrieving different documents?

  • Has the ranking of documents changed for the same query?

This is especially critical for data analytics where your documentation, schema definitions, and business logic are constantly evolving (remember the semantic drift problem from Part 1?). Your RAG system needs to detect when "active customer" starts retrieving different definitions because the business meaning changed.

The Architecture: Observable Tools, Not Just Observable Agents

Here's what a proper instrumentation stack looks like:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         Agent Orchestration Layer       β”‚
β”‚  (Prompts, LLM calls, response gen)     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚
              β”‚ ← Standard agent observability ends here
              β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         Tool Execution Layer            β”‚
β”‚                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  Tool: query_database()            β”‚ β”‚
β”‚  β”‚  β€’ Input validation logging        β”‚ β”‚
β”‚  β”‚  β€’ Query plan capture              β”‚ β”‚
β”‚  β”‚  β€’ Data source state check         β”‚ β”‚
β”‚  β”‚  β€’ Output schema validation        β”‚ β”‚
β”‚  β”‚  β€’ Result quality scoring          β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  Tool: retrieve_context()          β”‚ β”‚
β”‚  β”‚  β€’ Embedding vector logging        β”‚ β”‚
β”‚  β”‚  β€’ Retrieval path tracing          β”‚ β”‚
β”‚  β”‚  β€’ Reranking decision capture      β”‚ β”‚
β”‚  β”‚  β€’ Context window allocation log   β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Every tool becomes a fully instrumented black box that you can crack open when things go wrong.

Why This Matters for Debugging

Real production scenario:

Symptom: Agent giving inconsistent answers to the same question across different days.

Agent-level observability shows: Same prompt, same model, similar confidence scores.

Tool-level observability reveals:

  • Monday: Tool queried table prod.sales (updated 2 hours ago)

  • Tuesday: Tool queried table prod.sales (updated 18 hours agoβ€”pipeline delay)

  • Data staleness wasn't surfaced to the agent

  • Agent had no signal that confidence should be lower

Fix: Add data freshness signals to tool outputs, teach agent to caveat answers when data is stale.

You can't fix what you can't see. And most teams can't see their tools.

This also becomes critical for the testing and evaluation approach we cover in Part 3β€”you can't effectively QA an agent if you don't know what your tools are actually doing.

The Build Tax for Tool Observability

If you're building this yourself, here's what you're signing up for:

  1. Instrumentation layer for every tool type

    • Database query tools need query plan capture

    • API tools need rate limit & latency tracking

    • RAG tools need embedding and retrieval path logging

    • Calculation tools need input/output validation

  2. Centralized observability aggregation

    • Collecting logs from distributed tool executions

    • Correlating tool traces with agent traces

    • Building a UI that lets you drill down from agent β†’ tool β†’ data source

  3. Alert & anomaly detection

    • Detecting when tools start behaving differently

    • Catching silent failures (successful execution, wrong result)

    • Identifying data quality degradation

Most teams budget 2-3 weeks for "observability." Then they spend 6 months building this infrastructure and still have blind spots.

What Great Tool Observability Enables

Once you have true tool-level visibility, you unlock:

Root Cause Analysis in Minutes, Not Days

User reports wrong answer β†’ You trace to the specific tool call β†’ You see the exact input/output β†’ You identify the data quality issue that caused it.

Proactive Quality Monitoring

You detect that your calculate_churn tool is returning suspiciously low numbers before any user notices. Turns out a schema change broke a JOIN.

Continuous Improvement Feedback Loops

You can analyze which tools are underperforming, which data sources are unreliable, which retrieval patterns need optimizationβ€”all with data, not guesswork.

This is also what makes the evaluation strategies in Part 3 actually actionableβ€”you need tool-level data to understand what's degrading and why.

The Real Question

Before you build an agent, ask yourself:

"If my agent gives a wrong answer at 2 AM on a Saturday, can I debug it without waking up an engineer?"

If the answer is no, your observability stack isn't ready for production.

Most teams instrument their agents like they're debugging a monolith. But agents are distributed systemsβ€”with LLMs, tools, databases, APIs, and RAG all working in concert.

You need distributed systems observability. Not a glorified logger.

Next in this series: Part 3 - How to QA an Agent When the Ground Truth Changes Daily, where we tackle the hardest problem in analytics agents: testing against data that won't sit still.

Ready to Upsolve Your Product?

Unlock the full potential of your product's value today with Upsolve AI's embedded BI.

Start Here

Subscribe to our newsletter

By signing up, you agree to receive awesome emails and updates.

Ready to Upsolve Your Product?

Unlock the full potential of your product's value today with Upsolve AI's embedded BI.

Start Here

Subscribe to our newsletter

By signing up, you agree to receive awesome emails and updates.

Ready to Upsolve Your Product?

Unlock the full potential of your product's value today with Upsolve AI's embedded BI.

Start Here

Subscribe to our newsletter

By signing up, you agree to receive awesome emails and updates.

Ready to Upsolve Your Product?

Unlock the full potential of your product's value today with Upsolve AI's embedded BI.

Start Here

Subscribe to our newsletter

By signing up, you agree to receive awesome emails and updates.