Back
The agent development stack nobody talks about: observable tools, not just observable agents
Dec 1, 2025

Ka Ling Wu
Co-Founder & CEO, Upsolve AI
Welcome to Part 2 of our series on building production-grade analytics agents:
In Part 1, we established why analytics agents are fundamentally harder than general-purpose agents: the data beneath them constantly changes. But understanding the problem is only the first step. In this post, we're diving into the infrastructure layer that makes data-aware agents possible: tool-level observability.
Part 1: Why Git-Style Versioning Breaks for Data Analytics Agents
Part 2: The Agent Deployment Stack Nobody Talks About: Observable Tools, Not Just Observable Agents (you are here)
Part 3: How to QA an Agent When the Ground Truth Changes Daily
Most teams instrument their agents but remain blind to what their tools are actually doing. This is the gap that causes production failures. Let's fix it.
Why Your Agent Observability Stack Is Incomplete
Every AI engineering team knows they need observability. They instrument their LLM calls, track token usage, log prompts and completions. They can tell you exactly what their agent said.
But they can't tell you WHY it said it.
The missing layer: tool-level observability. Because your agent is only as good as the tools it calls, and most teams are flying blind at that layer.
The Tool Visibility Gap
Here's what happens in production:
User asks: "What were our top-performing products last quarter?"
Agent says: "$1.2M in Widget Pro sales led the quarter."
CEO replies: "That's wrong. It was $890K."
Now what? You check your agent logs. The LLM prompt was fine. The response looked confident. The RAG retrieved relevant context. Everything in your observability dashboard is green.
But buried three layers deep, one of your tools:
Queried the wrong table (staging instead of production)
Hit a rate limit and returned cached data from 2 months ago
Successfully ran a query that had a subtle WHERE clause bug
Retrieved a schema that had just changed 10 minutes prior
Your agent observability saw the symptom. Tool observability would have caught the cause.
What Tool-Level Observability Actually Means
Most teams think tool observability means "logging which tools were called." That's like saying car diagnostics means checking if the engine is running.
Real tool-level observability requires visibility into:
1. Input Validation & Sanitization
What parameters did the agent pass to this tool?
Were they within expected ranges?
Did type coercion happen silently?
Were any security filters applied?
Example: Your agent calls query_revenue(region='North America') but your tool silently converts it to query_revenue(region='North_America') (underscore instead of space). The query returns empty. Your agent confidently says "No revenue in North America."
Without input observability, you're debugging ghosts.
2. Execution Context
Which data sources did this tool actually access?
What was the state of those sources (updated 5 min ago? 5 hours ago?)
Were any fallbacks or retries triggered?
What was the query plan/execution path?
Example: Your get_customer_metrics tool is supposed to hit your real-time database. But that database is under load, so your tool silently falls back to the 6-hour-delayed replica. Your agent just gave the CEO stale data, and your observability shows "tool executed successfully."
This is the data drift problem we introduced in Part 1βand without tool-level observability, you'll never catch it until it's too late.
3. Output Structure & Quality
Did the tool return the expected schema?
Were there any null values or missing fields?
How does this output compare to historical patterns?
What's the confidence/quality score of this data?
Example: Your tool successfully retrieves "revenue by region" but the EMEA row has NULL values because of a data pipeline failure. Your agent sees the data structure is correct and happily tells the user "EMEA had zero revenue"βwhich is technically true but catastrophically wrong.
The RAG Visibility Problem
RAG adds another layer where most teams are blind. You're not just calling toolsβyou're retrieving context from vector stores, knowledge bases, semantic layers.
Standard observability shows you:
Which documents were retrieved
Their similarity scores
How they were ranked
But that's not enough. You need:
Retrieval Path Visibility
What was the embedding of the original query?
What reranking happened?
Were any filters applied (time-based, access-control, data quality)?
What chunks were retrieved but NOT used in the final context window?
That last one is critical. Often the most relevant chunk gets retrieved but then dropped due to context window limits. Your agent gives a wrong answer, your logs show "high-quality retrieval," and you never know the right answer was retrieved but discarded.
Semantic Drift Detection
How has the embedding space shifted over time?
Are similar queries now retrieving different documents?
Has the ranking of documents changed for the same query?
This is especially critical for data analytics where your documentation, schema definitions, and business logic are constantly evolving (remember the semantic drift problem from Part 1?). Your RAG system needs to detect when "active customer" starts retrieving different definitions because the business meaning changed.
The Architecture: Observable Tools, Not Just Observable Agents
Here's what a proper instrumentation stack looks like:
Every tool becomes a fully instrumented black box that you can crack open when things go wrong.
Why This Matters for Debugging
Real production scenario:
Symptom: Agent giving inconsistent answers to the same question across different days.
Agent-level observability shows: Same prompt, same model, similar confidence scores.
Tool-level observability reveals:
Monday: Tool queried table
prod.sales(updated 2 hours ago)Tuesday: Tool queried table
prod.sales(updated 18 hours agoβpipeline delay)Data staleness wasn't surfaced to the agent
Agent had no signal that confidence should be lower
Fix: Add data freshness signals to tool outputs, teach agent to caveat answers when data is stale.
You can't fix what you can't see. And most teams can't see their tools.
This also becomes critical for the testing and evaluation approach we cover in Part 3βyou can't effectively QA an agent if you don't know what your tools are actually doing.
The Build Tax for Tool Observability
If you're building this yourself, here's what you're signing up for:
Instrumentation layer for every tool type
Database query tools need query plan capture
API tools need rate limit & latency tracking
RAG tools need embedding and retrieval path logging
Calculation tools need input/output validation
Centralized observability aggregation
Collecting logs from distributed tool executions
Correlating tool traces with agent traces
Building a UI that lets you drill down from agent β tool β data source
Alert & anomaly detection
Detecting when tools start behaving differently
Catching silent failures (successful execution, wrong result)
Identifying data quality degradation
Most teams budget 2-3 weeks for "observability." Then they spend 6 months building this infrastructure and still have blind spots.
What Great Tool Observability Enables
Once you have true tool-level visibility, you unlock:
Root Cause Analysis in Minutes, Not Days
User reports wrong answer β You trace to the specific tool call β You see the exact input/output β You identify the data quality issue that caused it.
Proactive Quality Monitoring
You detect that your calculate_churn tool is returning suspiciously low numbers before any user notices. Turns out a schema change broke a JOIN.
Continuous Improvement Feedback Loops
You can analyze which tools are underperforming, which data sources are unreliable, which retrieval patterns need optimizationβall with data, not guesswork.
This is also what makes the evaluation strategies in Part 3 actually actionableβyou need tool-level data to understand what's degrading and why.
The Real Question
Before you build an agent, ask yourself:
"If my agent gives a wrong answer at 2 AM on a Saturday, can I debug it without waking up an engineer?"
If the answer is no, your observability stack isn't ready for production.
Most teams instrument their agents like they're debugging a monolith. But agents are distributed systemsβwith LLMs, tools, databases, APIs, and RAG all working in concert.
You need distributed systems observability. Not a glorified logger.
Next in this series: Part 3 - How to QA an Agent When the Ground Truth Changes Daily, where we tackle the hardest problem in analytics agents: testing against data that won't sit still.


