BookbagBookbag
Playbooks

Measuring and Improving AI Answer Accuracy

You can't improve what you don't measure. Here's exactly how to measure AI answer accuracy and build the feedback loops that drive it up.

The Bookbag Team·June 2026· 9 min read

Why accuracy measurement is non-optional

AI support accuracy doesn't maintain itself. A knowledge base that was accurate when deployed gradually drifts as policies change, new products are added, promotions launch and expire, and carriers update their timelines. Without systematic measurement, accuracy problems are invisible until customers start complaining — at which point many have already been affected.

Accuracy measurement is also the prerequisite for everything else. Threshold calibration requires accuracy data by confidence band. Knowledge base improvement prioritization requires accuracy data by question category. CSAT analysis for AI support requires knowing whether satisfaction is driven by accuracy or by other factors. Measurement is the infrastructure that makes all other improvements possible.

The cost of unmeasured inaccuracy

An AI agent giving wrong answers to 15% of questions might appear fine on CSAT if the errors are in low-stakes categories. But if those wrong answers are about return eligibility or delivery dates, each incorrect answer is potentially a support escalation, a return, or a lost customer. You need category-level accuracy data, not just an overall score.

Defining accuracy for ecommerce support AI

When grading a sample, use this three-level rubric consistently. Calculate accuracy as (correct + partially correct × 0.5) / total for an accuracy score, and also track the raw incorrect rate separately. These two numbers tell you different things: overall accuracy tells you how the agent performs on average; the incorrect rate tells you how often you're actively misleading customers.

  1. 1Correct — the answer is factually accurate, complete enough for the customer to take action, and consistent with current policy. A return eligibility answer that correctly states the window, the conditions, and the next steps is correct.
  2. 2Partially correct — the answer is directionally right but missing important information, slightly outdated, or missing a qualifier. 'You can return this within 30 days' when the current policy is 30 days from delivery date (not purchase date) is partially correct. It won't cause a disaster but it's not fully accurate.
  3. 3Incorrect — the answer is wrong in a way that matters: wrong policy, wrong date, wrong process, or wrong information that could lead the customer to make a decision they'll regret.

The sampling framework

Random sampling within categories is important — don't only sample easy questions or only look at conversations that generated escalations. Escalations are already flagged; the accuracy risk is in the autonomous resolutions that look fine but aren't. Stratified random sampling (proportional representation of each category) is the most statistically useful approach.

Sample typeWhen to useMinimum sample sizeHow to stratify
Weekly quick sampleEvery week15–20 conversations3–4 per top category
Monthly accuracy auditEvery month50 conversationsProportional to category volume
Seasonal pre-peak auditBefore BFCM / holiday75 conversationsExtra weight on categories with recent policy changes
Post-change validationAfter any policy or knowledge base update20–30 conversationsFocused on changed category

Root cause analysis for wrong answers

Every incorrect answer has a root cause. Identifying the root cause tells you exactly what to fix. In ecommerce AI support, wrong answers almost always trace to one of five sources:

  • Stale documentation — the knowledge base contains an outdated version of a policy. The agent applied the old rule correctly but the rule changed. Fix: update the knowledge base and add a same-day update process to your policy change workflow.
  • Ambiguous policy language — the policy document uses language that can be interpreted multiple ways. The agent chose one interpretation; the correct one is different. Fix: rewrite the ambiguous section with explicit conditional logic.
  • Missing knowledge source — the agent was asked about something not covered in its knowledge base at all. It either guessed or gave a generic answer. Fix: add the missing documentation.
  • Incorrect product data — the product catalog has an error (wrong size range, incorrect compatibility, outdated availability). The agent accurately reported inaccurate data. Fix: correct the data source; this is an operations issue, not an AI issue.
  • Reasoning error on complex questions — the agent misapplied a correct policy to a specific situation. Often involves multi-condition logic. Fix: rewrite the policy section as explicit conditional logic and test the specific scenario.

The weekly improvement cycle

Accuracy improvement is not a one-time project — it's a weekly cycle. Here is the minimal weekly process that drives continuous improvement:

  1. 1Monday: sample 15–20 conversations from the previous week. Grade each as correct, partially correct, or incorrect. Record the category and root cause for each incorrect or partially correct answer.
  2. 2Monday afternoon: for each incorrect answer, identify the specific knowledge source or policy section responsible. Add it to your improvement queue with the category, the error, the root cause, and the fix needed.
  3. 3Tuesday–Wednesday: fix the highest-impact items in the improvement queue. 'Highest impact' = most frequent error type × severity of being wrong. Updating a return policy date takes 5 minutes; writing a new FAQ for a missing topic takes 30.
  4. 4Thursday: run a spot-check — ask the AI 3–5 questions in the categories you just improved. Verify the answers are now correct.
  5. 5End of month: tabulate your weekly data into a monthly accuracy score per category. Track this month-over-month. Accuracy that isn't improving is accumulating — investigate why.

Accuracy benchmarks and realistic targets

The partially correct rate deserves its own target. A partially correct answer might score fine on a CSAT survey but still creates a follow-up question. Target partially correct under 10% for your most common question categories — this is the 'completeness' quality that drives first-contact resolution.

StageTypical overall accuracyCorrect rate targetKey action
First 30 days (new deployment)75–85%Build to > 85%Fix the biggest knowledge gaps from escalation log
30–90 days (calibration phase)85–92%Build to > 90%Weekly improvement cycle in place
90–180 days (optimization phase)90–94%Maintain > 92%Per-category threshold tuning, quarterly doc audit
Mature deployment (6+ months)93–96%Maintain > 93%Seasonal refreshes, proactive gap detection

Key takeaways

  • Accuracy is the prerequisite for threshold calibration, knowledge base prioritization, and CSAT analysis — it needs systematic measurement, not just reactive monitoring.
  • Use a three-level rubric: correct, partially correct, incorrect. Track both overall accuracy and the raw incorrect rate separately.
  • Sample stratified by category — proportional representation of each question type — using the sampling framework (weekly, monthly, seasonal, post-change).
  • Every wrong answer has a root cause in one of five categories: stale documentation, ambiguous language, missing knowledge, incorrect product data, or reasoning error on complex logic.
  • Run a weekly improvement cycle: sample Monday, fix Tuesday–Wednesday, spot-check Thursday. Monthly: tabulate category-level accuracy and track month-over-month.

Frequently Asked Questions

Turn support into your competitive edge

Join the ecommerce teams resolving more tickets, answering 24/7, and turning support into a revenue channel with Bookbag.