Why QA matters more with AI than with humans
When a human support agent gives a wrong answer, it's an isolated incident. When an AI agent gives a wrong answer, it typically gives that same wrong answer to every customer who asks the same question — until someone catches it. The leverage works both ways: AI scales correct answers, but it also scales incorrect ones.
This is why QA isn't optional for AI support. A human team with weak QA might let a few bad answers through. An AI team with weak QA might send the wrong return policy to 300 customers in a week before anyone notices. The monitoring process has to match the scale at which the agent operates.
With human agents, QA is about individual performance. With AI, QA is about systematic accuracy — finding knowledge gaps and policy drift that affect every customer who triggers that response, not just one.
What to measure: your AI support quality scorecard
Build a weekly QA dashboard with these five metrics. Each one is a leading indicator of a different type of quality problem:
| Metric | What it signals | Target |
|---|---|---|
| AI resolution accuracy (sampled) | Are answers factually correct? | > 92% correct on sample |
| CSAT on AI-resolved tickets | Do customers find the answers satisfying? | > 4.2 / 5.0 |
| Escalation rate | How often does the AI punt to humans? | Stable week-over-week |
| Escalation rate change week-over-week | Is something new confusing the agent? | < ±3% change |
| Human edit rate (if in assisted mode) | How often do agents change AI drafts? | < 15% of drafts edited |
The weekly QA routine (30 minutes)
The goal of the weekly routine is to catch problems early, before they affect a large volume of customers. You're looking for emerging patterns, not individual incidents.
- 1Pull last week's escalation log and scan for new clusters. If 'gift card' escalations jumped from 2 to 18 this week, you have a specific gap to investigate. It's usually either a missing knowledge source or a policy change that wasn't reflected in the agent's documentation.
- 2Check the CSAT trend line — not just the average score but the distribution. A drop in 5-star ratings with a rise in 1-star ratings is a different pattern than a general average drop. The former suggests a specific bad interaction type; the latter suggests general drift.
- 3Sample 10 AI-resolved conversations, spread across categories (order status, returns, product questions). Grade each as correct, partially correct, or incorrect. Note the category of any incorrect answer.
- 4If human edit rate is available, look at what types of changes agents are making. Tone corrections point to a brand voice issue. Factual corrections point to an outdated knowledge source.
The monthly accuracy audit
Once a month, run a deeper accuracy audit. The weekly routine catches emerging problems; the monthly audit validates overall system health.
- 1Sample 50 AI-resolved conversations, stratified by category — 15 order status, 10 returns, 10 product questions, 8 shipping/delivery, 7 other. Grade each as correct, partially correct, or incorrect. Calculate accuracy per category.
- 2Re-read all policy documents in the knowledge base against your current actual policies. Flag any discrepancies. This is the most important 30 minutes of the month for long-term accuracy.
- 3Review the 'never say' and brand voice guidelines. Read 10 recent AI responses with the guidelines next to them. Does the tone match? Are there phrases used that should be avoided?
- 4Check product catalog data freshness. If you've added new products or variants, confirm they're visible to the agent. Gaps in the product catalog are a common source of 'I don't have information on that item' escalations.
- 5Calculate your month-over-month accuracy trend and present it to whoever owns CX at your company. A written number held accountable is a number that gets maintained.
On-brand tone checking
Accuracy is only half of QA — tone is the other half. A technically correct answer delivered in the wrong voice can feel robotic, cold, or off-brand in ways that undermine customer trust.
Common tone failures in AI support include: responses that are too formal for a casual brand, responses that are too casual for a premium brand, over-long explanations for simple questions, responses that don't acknowledge the customer's emotion before providing information, and responses that use corporate-speak ('I understand your frustration') without warmth.
To check tone, pick 15 AI responses from the last week and read them aloud. Would a great human support agent at your company say this? If the answer is 'no' more than 2–3 times out of 15, your brand voice documentation needs to be updated and the agent re-configured. Bookbag allows you to update your tone and persona settings at any time.
- For direct-to-consumer casual brands: check for overly formal sentence structure, passive voice, and missing warmth openers.
- For premium or luxury brands: check for casual language, abbreviations, and any humor that doesn't fit the brand register.
- For all brands: check that frustrated customers receive acknowledgment before solutions, not just the solution alone.
Building a QA feedback loop
QA is only valuable if findings drive changes. Build a simple feedback loop so your weekly and monthly findings actually improve the agent.
- Create a 'QA action items' log — a simple spreadsheet where each weekly/monthly finding becomes a task with an owner and a due date. This prevents findings from being noted and forgotten.
- Assign knowledge base ownership — one person (or a rotation) owns the knowledge base and is responsible for updating it when QA finds a gap. Without ownership, updates don't happen.
- Set a resolution SLA — knowledge gaps found in the weekly review should be fixed within 5 business days. Policy document discrepancies found in the monthly audit should be fixed within 2 business days.
- Track your fix rate — each month, look at last month's QA action items. Were they resolved? If the backlog grows, either the review process is too aggressive or the update process is too slow — and both are fixable.
- Share QA scores broadly — the support team should see the AI accuracy scores alongside human CSAT. When AI and human performance are measured on the same scorecard, quality culture improves across the board.
Key takeaways
- AI scales correct answers and incorrect ones equally — QA is the process that ensures you're scaling the right ones.
- Build a five-metric weekly scorecard: accuracy, CSAT, escalation rate, week-over-week escalation change, and human edit rate.
- Run a 30-minute weekly routine focused on emerging escalation clusters and CSAT trends.
- Run a monthly deeper audit: sample 50 conversations, re-read all policy documents, check tone against brand guidelines.
- Findings only matter if they drive changes — assign knowledge base ownership and resolution SLAs to every QA action item.