Back

How to QA an agent when the ground truth changes daily

Dec 5, 2025

Ka Ling Wu

Co-Founder & CEO, Upsolve AI

a machine that is sitting in a room
a machine that is sitting in a room
a machine that is sitting in a room
a machine that is sitting in a room

Table of Contents

Welcome to Part 3 of our series on building production-grade analytics agents:

In Part 1, we established that data analytics agents operate on shifting ground—your data changes, schemas evolve, and historical facts get restated. In Part 2, we covered the observability infrastructure needed to understand what your agent is actually doing when things go wrong.

Now we tackle the hardest problem: How do you test an agent when the correct answers keep changing?

Traditional software testing assumes stability. Analytics agent testing requires embracing continuous change. Let's break down how to actually do this.

The Testing Problem Nobody Prepared You For

Software QA is built on a simple premise: correct behavior is stable. You write a test, it passes, and if the test fails tomorrow, you know something broke.

This doesn't work for data analytics agents.

Because in analytics, correct answers have an expiration date.

Why Traditional Testing Breaks

Standard agent testing looks like this:

Test: "What was Q3 revenue?"
Expected: "$2.3M"
Actual: "$2.3M"
Status: PASS

Then accounting finds an error and restates Q3. The correct answer is now "$2.1M."

Your test still passes. Your agent is now confidently wrong. Your CEO makes decisions based on bad data.

This is the fundamental problem: You're testing against a snapshot when you need to be testing against a stream.

The Three Types of Ground Truth Drift

Remember the three layers of instability we introduced in Part 1? They directly translate into three types of testing challenges:

Type 1: Historical Corrections

Past data gets corrected. Revenue gets restated. Transactions get reclassified. Your agent learned the wrong history.

Type 2: Definition Evolution

"Active customer" meant something different in Q1 than Q4. Your agent's understanding is frozen in time.

Type 3: Schema Changes

Tables get renamed, columns get merged, relationships change. Your agent is querying a world that no longer exists.

Most teams only plan for Type 3. Types 1 and 2 destroy them in production.

What Doesn't Work (And Why Teams Try It Anyway)

Approach 1: Freeze Your Test Dataset

"We'll use a fixed test dataset that never changes."

Problem: Now you're testing your agent's ability to answer questions about frozen data, not live data. You have high test coverage and zero production confidence.

Approach 2: Manual Test Updates

"We'll update tests when we notice things changed."

Problem: You notice things changed when users complain. You're QA-ing in production with your customers as testers.

Approach 3: Ignore It

"We'll rely on LLM improvements and hope for the best."

Problem: LLMs don't magically know when your revenue table got backfilled. Better models don't fix bad data.

What Actually Works: Continuous Evaluation with Drift Detection

The only QA approach that works for analytics agents is one that treats data drift as a first-class concern.

This is where the tool-level observability from Part 2 becomes critical—you can't evaluate what you can't see. You need visibility into every data access, every schema change, every quality signal to build an evaluation framework that actually works.

Component 1: Versioned Ground Truth

Don't store static expected answers. Store versioned expectations:

Question: "What was Q3 revenue?"

Expected Answers:
  - As of 2024-10-01: "$2.3M" (original)
  - As of 2024-11-15: "$2.1M" (after restatement)
  
Current Expected: "$2.1M"

Your evaluation framework needs to know:

  • What the answer SHOULD BE right now

  • What the answer WAS at any point in history

  • When and why it changed

This lets you distinguish between:

  • Agent regression (it got worse)

  • Data evolution (the right answer changed)

  • Agent adaptation failure (it didn't learn the new truth)

Component 2: Automated Test Regeneration

When your underlying data changes significantly, your test suite should automatically regenerate.

Triggers:

  • Schema changes in tables your agent queries

  • Metric definitions updated in your semantic layer

  • Data quality alerts on critical tables

  • Significant backfills or restatements

Actions:

  • Re-run existing tests against new ground truth

  • Generate new tests for new data patterns

  • Flag tests that are no longer relevant

  • Create tests for edge cases introduced by changes

This requires tight integration between your data observability (from Part 2), semantic layer, and evaluation framework. Most teams don't have any of these, let alone all three.

Component 3: Differential Testing

Don't just test absolute correctness. Test consistency:

  • Same question, asked 1 hour apart: Should get same answer (unless data updated)

  • Same question, different phrasing: Should get same answer

  • Related questions: Should get logically consistent answers

Example:

Q1: "What was Q3 revenue?" "$2.1M"
Q2: "What was total revenue in July, August, and September?" Should sum to ~$2.1M
Q3: "Was Q3 revenue over $2M?" "Yes"

If Q1 and Q2 give contradictory answers, you have a problem. Differential testing catches this when absolute testing doesn't.

With the tool-level tracing from Part 2, you can see exactly which data sources each question hit and why they might have diverged.

Component 4: Confidence Calibration Testing

Your agent shouldn't just be right or wrong—it should KNOW when it's uncertain.

Test for:

  • Appropriate confidence drops when data is stale: If the revenue table hasn't updated in 48 hours, confidence should decrease

  • Caveats when definitions changed: If "active user" definition changed, agent should mention it

  • Refusal when data quality is low: If data quality alerts are firing, agent should decline to answer

This is harder to test than accuracy, but it's what separates production-grade agents from prototypes. And it's only possible if your agent has access to the data quality signals we discussed in Part 1 and Part 2.

The A/B Testing Problem

Classic A/B testing: Show 50% of users version A, 50% version B, measure which performs better.

This breaks for analytics agents because:

  1. Success metrics are delayed: A bad answer might not be caught for days or weeks

  2. Ground truth keeps moving: Your control group is evaluated against yesterday's truth

  3. Sample sizes are small: You can't run 10,000 trials when you have 200 users

What works instead:

Shadow Mode Testing

Run your new agent version in parallel with your production version, but don't show users the results. Compare:

  • Answer agreement rate

  • Confidence score distributions

  • Query patterns and tool usage

  • Error rates and failure modes

Only promote to production when shadow mode shows consistent improvement over a meaningful time period (weeks, not days).

Tool-Level A/B Testing

Instead of testing whole agent versions, test individual tool improvements:

  • Old query builder vs. new query builder

  • SQL agent vs. semantic layer agent

  • Vector search vs. hybrid search for RAG

This gives you faster iteration cycles and clearer signal on what's actually improving. With the tool observability infrastructure from Part 2, you can run these experiments with confidence that you're measuring real tool performance, not confounding factors.

Cohort-Based Rollouts

Don't split by random user assignment. Split by use case:

  • Cohort 1: Simple metric queries (low risk)

  • Cohort 2: Trend analysis (medium risk)

  • Cohort 3: Complex multi-step reasoning (high risk)

Roll out improvements to low-risk cohorts first. This prevents catastrophic failures in high-stakes scenarios.

The Evaluation Data Problem

Where do you even get ground truth for evaluation?

Source 1: Historical Q&A Pairs

Your users have been asking questions and getting answers. Those are potential test cases—if you validate them.

Problem: Many answers were wrong, and you don't know which ones.

Solution: Have domain experts label a subset (100-200 pairs) as correct/incorrect. Use these as your gold standard.

Source 2: Synthetic Question Generation

Use LLMs to generate questions based on your schema and data.

Problem: LLMs generate obvious questions, not the weird edge cases users actually ask.

Solution: Combine synthetic generation with real user query patterns. Generate synthetic questions that follow the distribution of real questions.

Source 3: Business Logic Rules

Your business has rules: revenue = quantity × price, churn rate = churned / total, etc.

Problem: Rules don't cover the long tail of actual questions.

Solution: Use rules for consistency testing (differential testing), not absolute correctness testing.

The Real Secret: Evaluation Is Never "Done"

The teams that succeed with analytics agents don't build an evaluation framework and move on. They treat evaluation as continuous infrastructure:

  • Daily re-evaluation against latest data

  • Weekly review of failed tests and drift patterns

  • Monthly regeneration of test suites based on schema changes

  • Quarterly audit of whether old tests still matter

This is expensive. It requires dedicated tooling, automation, and human-in-the-loop review.

Which is why most teams who try to build this themselves give up after the first month.

What We Had to Build

At Upsolve, our evaluation framework:

  • Monitors data lineage to trigger test regeneration (solving the drift problem from Part 1)

  • Versions ground truth alongside data versions

  • Runs shadow mode for every agent improvement

  • Automates differential testing across related questions

  • Surfaces confidence calibration metrics (leveraging the tool observability from Part 2)

Not because we love building infrastructure. Because we kept shipping agents that looked great in testing and failed in production.

The Build vs. Buy Math on QA

If you're building analytics agent QA in-house:

  • 2-3 months: Basic test harness

  • 2-3 months: Versioned ground truth system

  • 2-3 months: Automated regeneration pipeline

  • 2-3 months: Drift detection and alerting

  • Ongoing: Maintenance as your data changes

You're looking at 12+ months and 2-3 engineers before you have production-grade QA.

Or you can buy a platform that already solved this. Because the company built it already has 50 customers whose data drift patterns taught them edge cases you haven't hit yet.

The Question You Should Ask

Not "How do I test my agent?"

But "How do I test my agent when the right answers keep changing?"

If your testing strategy doesn't have an answer to that second question, you're not ready for production analytics agents.

Series Wrap-Up: The Complete Picture

Over this three-part series, we've covered the full stack of what makes analytics agents uniquely challenging:

  1. Part 1 established that data analytics agents face a fundamental problem general-purpose agents don't: the substrate they operate on is constantly changing. You can't just version your code—you need infrastructure that understands data drift.

  2. Part 2 showed why agent-level observability isn't enough. You need tool-level observability to understand what's actually happening when your agent queries data, retrieves context, or calls APIs. Without this, you're debugging blind.

  3. Part 3 (this post) tackled the hardest problem: how do you QA an agent when ground truth is a moving target? Traditional testing approaches fail. You need continuous evaluation with drift detection.

The common thread: analytics agents require treating data infrastructure as a first-class concern, not an afterthought. The teams that succeed are the ones who realize they're solving a data engineering problem first, and an AI problem second.

Ready to Upsolve Your Product?

Unlock the full potential of your product's value today with Upsolve AI's embedded BI.

Start Here

Subscribe to our newsletter

By signing up, you agree to receive awesome emails and updates.

Ready to Upsolve Your Product?

Unlock the full potential of your product's value today with Upsolve AI's embedded BI.

Start Here

Subscribe to our newsletter

By signing up, you agree to receive awesome emails and updates.

Ready to Upsolve Your Product?

Unlock the full potential of your product's value today with Upsolve AI's embedded BI.

Start Here

Subscribe to our newsletter

By signing up, you agree to receive awesome emails and updates.

Ready to Upsolve Your Product?

Unlock the full potential of your product's value today with Upsolve AI's embedded BI.

Start Here

Subscribe to our newsletter

By signing up, you agree to receive awesome emails and updates.