Back
How to QA an agent when the ground truth changes daily
Dec 5, 2025

Ka Ling Wu
Co-Founder & CEO, Upsolve AI
Welcome to Part 3 of our series on building production-grade analytics agents:
In Part 1, we established that data analytics agents operate on shifting ground—your data changes, schemas evolve, and historical facts get restated. In Part 2, we covered the observability infrastructure needed to understand what your agent is actually doing when things go wrong.
Now we tackle the hardest problem: How do you test an agent when the correct answers keep changing?
Part 1: Why Git-Style Versioning Breaks for Data Analytics Agents
Part 2: The Agent Deployment Stack Nobody Talks About: Observable Tools, Not Just Observable Agents
Part 3: How to QA an Agent When the Ground Truth Changes Daily (you are here)
Traditional software testing assumes stability. Analytics agent testing requires embracing continuous change. Let's break down how to actually do this.
The Testing Problem Nobody Prepared You For
Software QA is built on a simple premise: correct behavior is stable. You write a test, it passes, and if the test fails tomorrow, you know something broke.
This doesn't work for data analytics agents.
Because in analytics, correct answers have an expiration date.
Why Traditional Testing Breaks
Standard agent testing looks like this:
Then accounting finds an error and restates Q3. The correct answer is now "$2.1M."
Your test still passes. Your agent is now confidently wrong. Your CEO makes decisions based on bad data.
This is the fundamental problem: You're testing against a snapshot when you need to be testing against a stream.
The Three Types of Ground Truth Drift
Remember the three layers of instability we introduced in Part 1? They directly translate into three types of testing challenges:
Type 1: Historical Corrections
Past data gets corrected. Revenue gets restated. Transactions get reclassified. Your agent learned the wrong history.
Type 2: Definition Evolution
"Active customer" meant something different in Q1 than Q4. Your agent's understanding is frozen in time.
Type 3: Schema Changes
Tables get renamed, columns get merged, relationships change. Your agent is querying a world that no longer exists.
Most teams only plan for Type 3. Types 1 and 2 destroy them in production.
What Doesn't Work (And Why Teams Try It Anyway)
❌ Approach 1: Freeze Your Test Dataset
"We'll use a fixed test dataset that never changes."
Problem: Now you're testing your agent's ability to answer questions about frozen data, not live data. You have high test coverage and zero production confidence.
❌ Approach 2: Manual Test Updates
"We'll update tests when we notice things changed."
Problem: You notice things changed when users complain. You're QA-ing in production with your customers as testers.
❌ Approach 3: Ignore It
"We'll rely on LLM improvements and hope for the best."
Problem: LLMs don't magically know when your revenue table got backfilled. Better models don't fix bad data.
What Actually Works: Continuous Evaluation with Drift Detection
The only QA approach that works for analytics agents is one that treats data drift as a first-class concern.
This is where the tool-level observability from Part 2 becomes critical—you can't evaluate what you can't see. You need visibility into every data access, every schema change, every quality signal to build an evaluation framework that actually works.
Component 1: Versioned Ground Truth
Don't store static expected answers. Store versioned expectations:
Your evaluation framework needs to know:
What the answer SHOULD BE right now
What the answer WAS at any point in history
When and why it changed
This lets you distinguish between:
Agent regression (it got worse)
Data evolution (the right answer changed)
Agent adaptation failure (it didn't learn the new truth)
Component 2: Automated Test Regeneration
When your underlying data changes significantly, your test suite should automatically regenerate.
Triggers:
Schema changes in tables your agent queries
Metric definitions updated in your semantic layer
Data quality alerts on critical tables
Significant backfills or restatements
Actions:
Re-run existing tests against new ground truth
Generate new tests for new data patterns
Flag tests that are no longer relevant
Create tests for edge cases introduced by changes
This requires tight integration between your data observability (from Part 2), semantic layer, and evaluation framework. Most teams don't have any of these, let alone all three.
Component 3: Differential Testing
Don't just test absolute correctness. Test consistency:
Same question, asked 1 hour apart: Should get same answer (unless data updated)
Same question, different phrasing: Should get same answer
Related questions: Should get logically consistent answers
Example:
If Q1 and Q2 give contradictory answers, you have a problem. Differential testing catches this when absolute testing doesn't.
With the tool-level tracing from Part 2, you can see exactly which data sources each question hit and why they might have diverged.
Component 4: Confidence Calibration Testing
Your agent shouldn't just be right or wrong—it should KNOW when it's uncertain.
Test for:
Appropriate confidence drops when data is stale: If the revenue table hasn't updated in 48 hours, confidence should decrease
Caveats when definitions changed: If "active user" definition changed, agent should mention it
Refusal when data quality is low: If data quality alerts are firing, agent should decline to answer
This is harder to test than accuracy, but it's what separates production-grade agents from prototypes. And it's only possible if your agent has access to the data quality signals we discussed in Part 1 and Part 2.
The A/B Testing Problem
Classic A/B testing: Show 50% of users version A, 50% version B, measure which performs better.
This breaks for analytics agents because:
Success metrics are delayed: A bad answer might not be caught for days or weeks
Ground truth keeps moving: Your control group is evaluated against yesterday's truth
Sample sizes are small: You can't run 10,000 trials when you have 200 users
What works instead:
Shadow Mode Testing
Run your new agent version in parallel with your production version, but don't show users the results. Compare:
Answer agreement rate
Confidence score distributions
Query patterns and tool usage
Error rates and failure modes
Only promote to production when shadow mode shows consistent improvement over a meaningful time period (weeks, not days).
Tool-Level A/B Testing
Instead of testing whole agent versions, test individual tool improvements:
Old query builder vs. new query builder
SQL agent vs. semantic layer agent
Vector search vs. hybrid search for RAG
This gives you faster iteration cycles and clearer signal on what's actually improving. With the tool observability infrastructure from Part 2, you can run these experiments with confidence that you're measuring real tool performance, not confounding factors.
Cohort-Based Rollouts
Don't split by random user assignment. Split by use case:
Cohort 1: Simple metric queries (low risk)
Cohort 2: Trend analysis (medium risk)
Cohort 3: Complex multi-step reasoning (high risk)
Roll out improvements to low-risk cohorts first. This prevents catastrophic failures in high-stakes scenarios.
The Evaluation Data Problem
Where do you even get ground truth for evaluation?
Source 1: Historical Q&A Pairs
Your users have been asking questions and getting answers. Those are potential test cases—if you validate them.
Problem: Many answers were wrong, and you don't know which ones.
Solution: Have domain experts label a subset (100-200 pairs) as correct/incorrect. Use these as your gold standard.
Source 2: Synthetic Question Generation
Use LLMs to generate questions based on your schema and data.
Problem: LLMs generate obvious questions, not the weird edge cases users actually ask.
Solution: Combine synthetic generation with real user query patterns. Generate synthetic questions that follow the distribution of real questions.
Source 3: Business Logic Rules
Your business has rules: revenue = quantity × price, churn rate = churned / total, etc.
Problem: Rules don't cover the long tail of actual questions.
Solution: Use rules for consistency testing (differential testing), not absolute correctness testing.
The Real Secret: Evaluation Is Never "Done"
The teams that succeed with analytics agents don't build an evaluation framework and move on. They treat evaluation as continuous infrastructure:
Daily re-evaluation against latest data
Weekly review of failed tests and drift patterns
Monthly regeneration of test suites based on schema changes
Quarterly audit of whether old tests still matter
This is expensive. It requires dedicated tooling, automation, and human-in-the-loop review.
Which is why most teams who try to build this themselves give up after the first month.
What We Had to Build
At Upsolve, our evaluation framework:
Monitors data lineage to trigger test regeneration (solving the drift problem from Part 1)
Versions ground truth alongside data versions
Runs shadow mode for every agent improvement
Automates differential testing across related questions
Surfaces confidence calibration metrics (leveraging the tool observability from Part 2)
Not because we love building infrastructure. Because we kept shipping agents that looked great in testing and failed in production.
The Build vs. Buy Math on QA
If you're building analytics agent QA in-house:
2-3 months: Basic test harness
2-3 months: Versioned ground truth system
2-3 months: Automated regeneration pipeline
2-3 months: Drift detection and alerting
Ongoing: Maintenance as your data changes
You're looking at 12+ months and 2-3 engineers before you have production-grade QA.
Or you can buy a platform that already solved this. Because the company built it already has 50 customers whose data drift patterns taught them edge cases you haven't hit yet.
The Question You Should Ask
Not "How do I test my agent?"
But "How do I test my agent when the right answers keep changing?"
If your testing strategy doesn't have an answer to that second question, you're not ready for production analytics agents.
Series Wrap-Up: The Complete Picture
Over this three-part series, we've covered the full stack of what makes analytics agents uniquely challenging:
Part 1 established that data analytics agents face a fundamental problem general-purpose agents don't: the substrate they operate on is constantly changing. You can't just version your code—you need infrastructure that understands data drift.
Part 2 showed why agent-level observability isn't enough. You need tool-level observability to understand what's actually happening when your agent queries data, retrieves context, or calls APIs. Without this, you're debugging blind.
Part 3 (this post) tackled the hardest problem: how do you QA an agent when ground truth is a moving target? Traditional testing approaches fail. You need continuous evaluation with drift detection.
The common thread: analytics agents require treating data infrastructure as a first-class concern, not an afterthought. The teams that succeed are the ones who realize they're solving a data engineering problem first, and an AI problem second.


