Product

Resources

Pricing

Book Demo

🚀 Check out our customer stories >

All Posts

Put your content here

How to QA an agent when the ground truth changes daily

All Posts

The Testing Problem Nobody Prepared You For: Software QA is built on a simple premise: correct behavior is stable. You write a test, it passes, and if the test fails tomorrow, you know something broke. This doesn't work for data analytics agents.

Ka Ling Wu

Co-Founder & CEO, Upsolve AI

Nov 14, 2025

10 min

Welcome to Part 3 of our series on building production-grade analytics agents:

In Part 1, we established that data analytics agents operate on shifting ground—your data changes, schemas evolve, and historical facts get restated. In Part 2, we covered the observability infrastructure needed to understand what your agent is actually doing when things go wrong.

Now we tackle the hardest problem: How do you test an agent when the correct answers keep changing?

Part 1: Why Git-Style Versioning Breaks for Data Analytics Agents
Part 2: The Agent Deployment Stack Nobody Talks About: Observable Tools, Not Just Observable Agents
Part 3: How to QA an Agent When the Ground Truth Changes Daily (you are here)

Traditional software testing assumes stability. Analytics agent testing requires embracing continuous change. Let's break down how to actually do this.

The Testing Problem Nobody Prepared You For

Software QA is built on a simple premise: correct behavior is stable. You write a test, it passes, and if the test fails tomorrow, you know something broke.

This doesn't work for data analytics agents.

Because in analytics, correct answers have an expiration date.

Why Traditional Testing Breaks

Standard agent testing looks like this:

Test: "What was Q3 revenue?"
Expected: "$2.3M"
Actual: "$2.3M"
Status: ✅ PASS

Test: "What was Q3 revenue?"
Expected: "$2.3M"
Actual: "$2.3M"
Status: ✅ PASS

Test: "What was Q3 revenue?"
Expected: "$2.3M"
Actual: "$2.3M"
Status: ✅ PASS

Test: "What was Q3 revenue?"
Expected: "$2.3M"
Actual: "$2.3M"
Status: ✅ PASS

Then accounting finds an error and restates Q3. The correct answer is now "$2.1M."

Your test still passes. Your agent is now confidently wrong. Your CEO makes decisions based on bad data.

This is the fundamental problem: You're testing against a snapshot when you need to be testing against a stream.

The Three Types of Ground Truth Drift

Remember the three layers of instability we introduced in Part 1? They directly translate into three types of testing challenges:

Type 1: Historical Corrections

Past data gets corrected. Revenue gets restated. Transactions get reclassified. Your agent learned the wrong history.

Type 2: Definition Evolution

"Active customer" meant something different in Q1 than Q4. Your agent's understanding is frozen in time.

Type 3: Schema Changes

Tables get renamed, columns get merged, relationships change. Your agent is querying a world that no longer exists.

Most teams only plan for Type 3. Types 1 and 2 destroy them in production.

What Doesn't Work (And Why Teams Try It Anyway)

❌ Approach 1: Freeze Your Test Dataset

"We'll use a fixed test dataset that never changes."

Problem: Now you're testing your agent's ability to answer questions about frozen data, not live data. You have high test coverage and zero production confidence.

❌ Approach 2: Manual Test Updates

"We'll update tests when we notice things changed."

Problem: You notice things changed when users complain. You're QA-ing in production with your customers as testers.

❌ Approach 3: Ignore It

"We'll rely on LLM improvements and hope for the best."

Problem: LLMs don't magically know when your revenue table got backfilled. Better models don't fix bad data.

What Actually Works: Continuous Evaluation with Drift Detection

The only QA approach that works for analytics agents is one that treats data drift as a first-class concern.

This is where the tool-level observability from Part 2 becomes critical—you can't evaluate what you can't see. You need visibility into every data access, every schema change, every quality signal to build an evaluation framework that actually works.

Component 1: Versioned Ground Truth

Don't store static expected answers. Store versioned expectations:

Question: "What was Q3 revenue?"

Expected Answers:
  - As of 2024-10-01: "$2.3M" (original)
  - As of 2024-11-15: "$2.1M" (after restatement)
  
Current Expected: "$2.1M"

Question: "What was Q3 revenue?"

Expected Answers:
  - As of 2024-10-01: "$2.3M" (original)
  - As of 2024-11-15: "$2.1M" (after restatement)
  
Current Expected: "$2.1M"

Question: "What was Q3 revenue?"

Expected Answers:
  - As of 2024-10-01: "$2.3M" (original)
  - As of 2024-11-15: "$2.1M" (after restatement)
  
Current Expected: "$2.1M"

Question: "What was Q3 revenue?"

Expected Answers:
  - As of 2024-10-01: "$2.3M" (original)
  - As of 2024-11-15: "$2.1M" (after restatement)
  
Current Expected: "$2.1M"

Your evaluation framework needs to know:

What the answer SHOULD BE right now
What the answer WAS at any point in history
When and why it changed

This lets you distinguish between:

Agent regression (it got worse)
Data evolution (the right answer changed)
Agent adaptation failure (it didn't learn the new truth)

Component 2: Automated Test Regeneration

When your underlying data changes significantly, your test suite should automatically regenerate.

Triggers:

Schema changes in tables your agent queries
Metric definitions updated in your semantic layer
Data quality alerts on critical tables
Significant backfills or restatements

Actions:

Re-run existing tests against new ground truth
Generate new tests for new data patterns
Flag tests that are no longer relevant
Create tests for edge cases introduced by changes

This requires tight integration between your data observability (from Part 2), semantic layer, and evaluation framework. Most teams don't have any of these, let alone all three.

Component 3: Differential Testing

Don't just test absolute correctness. Test consistency:

Same question, asked 1 hour apart: Should get same answer (unless data updated)
Same question, different phrasing: Should get same answer
Related questions: Should get logically consistent answers

Example:

Q1: "What was Q3 revenue?" → "$2.1M"
Q2: "What was total revenue in July, August, and September?" → Should sum to ~$2.1M
Q3: "Was Q3 revenue over $2M?" → "Yes"

Q1: "What was Q3 revenue?" → "$2.1M"
Q2: "What was total revenue in July, August, and September?" → Should sum to ~$2.1M
Q3: "Was Q3 revenue over $2M?" → "Yes"

Q1: "What was Q3 revenue?" → "$2.1M"
Q2: "What was total revenue in July, August, and September?" → Should sum to ~$2.1M
Q3: "Was Q3 revenue over $2M?" → "Yes"

Q1: "What was Q3 revenue?" → "$2.1M"
Q2: "What was total revenue in July, August, and September?" → Should sum to ~$2.1M
Q3: "Was Q3 revenue over $2M?" → "Yes"

If Q1 and Q2 give contradictory answers, you have a problem. Differential testing catches this when absolute testing doesn't.

With the tool-level tracing from Part 2, you can see exactly which data sources each question hit and why they might have diverged.

Component 4: Confidence Calibration Testing

Your agent shouldn't just be right or wrong—it should KNOW when it's uncertain.

Test for:

Appropriate confidence drops when data is stale: If the revenue table hasn't updated in 48 hours, confidence should decrease
Caveats when definitions changed: If "active user" definition changed, agent should mention it
Refusal when data quality is low: If data quality alerts are firing, agent should decline to answer

This is harder to test than accuracy, but it's what separates production-grade agents from prototypes. And it's only possible if your agent has access to the data quality signals we discussed in Part 1 and Part 2.

The A/B Testing Problem

Classic A/B testing: Show 50% of users version A, 50% version B, measure which performs better.

This breaks for analytics agents because:

Success metrics are delayed: A bad answer might not be caught for days or weeks
Ground truth keeps moving: Your control group is evaluated against yesterday's truth
Sample sizes are small: You can't run 10,000 trials when you have 200 users

What works instead:

Shadow Mode Testing

Run your new agent version in parallel with your production version, but don't show users the results. Compare:

Answer agreement rate
Confidence score distributions
Query patterns and tool usage
Error rates and failure modes

Only promote to production when shadow mode shows consistent improvement over a meaningful time period (weeks, not days).

Tool-Level A/B Testing

Instead of testing whole agent versions, test individual tool improvements:

Old query builder vs. new query builder
SQL agent vs. semantic layer agent
Vector search vs. hybrid search for RAG

This gives you faster iteration cycles and clearer signal on what's actually improving. With the tool observability infrastructure from Part 2, you can run these experiments with confidence that you're measuring real tool performance, not confounding factors.

Cohort-Based Rollouts

Don't split by random user assignment. Split by use case:

Cohort 1: Simple metric queries (low risk)
Cohort 2: Trend analysis (medium risk)
Cohort 3: Complex multi-step reasoning (high risk)

Roll out improvements to low-risk cohorts first. This prevents catastrophic failures in high-stakes scenarios.

The Evaluation Data Problem

Where do you even get ground truth for evaluation?

Source 1: Historical Q&A Pairs

Your users have been asking questions and getting answers. Those are potential test cases—if you validate them.

Problem: Many answers were wrong, and you don't know which ones.

Solution: Have domain experts label a subset (100-200 pairs) as correct/incorrect. Use these as your gold standard.

Source 2: Synthetic Question Generation

Use LLMs to generate questions based on your schema and data.

Problem: LLMs generate obvious questions, not the weird edge cases users actually ask.

Solution: Combine synthetic generation with real user query patterns. Generate synthetic questions that follow the distribution of real questions.

Source 3: Business Logic Rules

Your business has rules: revenue = quantity × price, churn rate = churned / total, etc.

Problem: Rules don't cover the long tail of actual questions.

Solution: Use rules for consistency testing (differential testing), not absolute correctness testing.

The Real Secret: Evaluation Is Never "Done"

The teams that succeed with analytics agents don't build an evaluation framework and move on. They treat evaluation as continuous infrastructure:

Daily re-evaluation against latest data
Weekly review of failed tests and drift patterns
Monthly regeneration of test suites based on schema changes
Quarterly audit of whether old tests still matter

This is expensive. It requires dedicated tooling, automation, and human-in-the-loop review.

Which is why most teams who try to build this themselves give up after the first month.

What We Had to Build

At Upsolve, our evaluation framework:

Monitors data lineage to trigger test regeneration (solving the drift problem from Part 1)
Versions ground truth alongside data versions
Runs shadow mode for every agent improvement
Automates differential testing across related questions
Surfaces confidence calibration metrics (leveraging the tool observability from Part 2)

Not because we love building infrastructure. Because we kept shipping agents that looked great in testing and failed in production.

The Build vs. Buy Math on QA

If you're building analytics agent QA in-house:

2-3 months: Basic test harness
2-3 months: Versioned ground truth system
2-3 months: Automated regeneration pipeline
2-3 months: Drift detection and alerting
Ongoing: Maintenance as your data changes

You're looking at 12+ months and 2-3 engineers before you have production-grade QA.

Or you can buy a platform that already solved this. Because the company built it already has 50 customers whose data drift patterns taught them edge cases you haven't hit yet.

The Question You Should Ask

Not "How do I test my agent?"

But "How do I test my agent when the right answers keep changing?"

If your testing strategy doesn't have an answer to that second question, you're not ready for production analytics agents.

Series Wrap-Up: The Complete Picture

Over this three-part series, we've covered the full stack of what makes analytics agents uniquely challenging:

Part 1 established that data analytics agents face a fundamental problem general-purpose agents don't: the substrate they operate on is constantly changing. You can't just version your code—you need infrastructure that understands data drift.
Part 2 showed why agent-level observability isn't enough. You need tool-level observability to understand what's actually happening when your agent queries data, retrieves context, or calls APIs. Without this, you're debugging blind.
Part 3 (this post) tackled the hardest problem: how do you QA an agent when ground truth is a moving target? Traditional testing approaches fail. You need continuous evaluation with drift detection.

The common thread: analytics agents require treating data infrastructure as a first-class concern, not an afterthought. The teams that succeed are the ones who realize they're solving a data engineering problem first, and an AI problem second.

Try Upsolve for Embedded Dashboards & AI Insights

Embed dashboards and AI insights directly into your product, with no heavy engineering required.

Fast setup

Built for SaaS products

30‑day free trial

Book a Demo

See Upsolve in Action

Launch customizable dashboards and AI‑powered insights inside your app, fast and with minimal engineering effort. No code.

Book a Demo

Ka Ling Wu

Mar 19, 2025

How to Build an AI Dashboard in Minutes Without Code

Learn how to build an AI analytics dashboard in minutes without code, using real examples, step-by-step setup, and a tool designed for fast deployment.

Ka Ling Wu

Mar 11, 2025

How to Build a Store Performance Dashboard (KPIs + 5 Examples)

Learn how to plan and build a store performance dashboard, including key KPIs, structure, and five real-world examples used by retail teams.

Ka Ling Wu

Apr 9, 2024

How to Use AI for Data Visualization (Tools and Use Cases)

Learn how to use AI for data visualization to create dashboards, detect patterns, and turn raw data into clear insights using practical tools and examples.

Ka Ling Wu

Mar 19, 2025

How to Build an AI Dashboard in Minutes Without Code

Learn how to build an AI analytics dashboard in minutes without code, using real examples, step-by-step setup, and a tool designed for fast deployment.

Ka Ling Wu

Mar 11, 2025

How to Build a Store Performance Dashboard (KPIs + 5 Examples)

Learn how to plan and build a store performance dashboard, including key KPIs, structure, and five real-world examples used by retail teams.

Ka Ling Wu

Apr 9, 2024

How to Use AI for Data Visualization (Tools and Use Cases)

Learn how to use AI for data visualization to create dashboards, detect patterns, and turn raw data into clear insights using practical tools and examples.