BookbagBookbag
Playbooks

Setting Confidence Thresholds for Autonomous AI Resolution

Confidence thresholds are the dial between deflection rate and answer accuracy. Here's how to find the right setting for your store.

The Bookbag Team·June 2026· 9 min read

What confidence thresholds are — and why they matter

When an AI agent processes a customer's question, it doesn't just generate an answer — it generates an answer with an internal confidence score that reflects how certain it is. A question that closely matches well-documented policy produces a high-confidence answer. A question that's ambiguous, outside the documented policy, or involves edge cases the agent hasn't seen produces a lower-confidence answer.

Confidence thresholds are the rules you set that determine what happens based on that score. Above a threshold: the agent answers autonomously. In a middle range: the agent drafts an answer for human review. Below a threshold: the agent escalates immediately. These settings are the primary lever controlling the balance between your deflection rate and your answer accuracy — and they need to be calibrated empirically, not guessed.

Why this matters

An AI agent set too conservatively (escalates everything below 90%) will have excellent answer accuracy but 20% deflection. Set too aggressively (answers autonomously down to 50% confidence), it will have 70% deflection but unacceptable accuracy. The right setting depends on your data — not on defaults.

The three resolution modes

Assisted mode is underutilized by most teams. It's not a failure state — it's a productivity multiplier. Agents reviewing AI drafts work 2–3x faster than agents writing from scratch. Many tickets that 'can't be autonomously resolved' can still be handled much faster with an AI draft than without one.

ModeWhen it appliesWhat happensCustomer experience
Autonomous resolutionConfidence above upper thresholdAI answers without human reviewInstant answer, no wait
Assisted resolutionConfidence in middle rangeAI drafts answer, human reviews and sendsFast answer (< 5 min), human-verified
Immediate escalationConfidence below lower thresholdAI routes to human with escalation summaryHuman answers with full AI context

Starting thresholds: safe defaults for a new deployment

For a new AI deployment where you don't yet have empirical accuracy data, use conservative starting thresholds. These are intentionally safe — they favor accuracy over deflection in the first 30 days when calibration is ongoing.

  1. 1Autonomous threshold: 90% — the agent answers without human review only when highly confident. This will produce lower deflection in the first month but ensures accuracy during the calibration window.
  2. 2Assisted threshold: 70–90% — the agent drafts an answer for human review. Agents should send these drafts with edits if needed, or overrule if the draft is wrong. Track edit rates per category.
  3. 3Escalation threshold: below 70% — route directly to a human. At this confidence level, the agent is more likely to waste the customer's time than to help.
  4. 4Add hard overrides: regardless of confidence score, configure explicit topic-based escalation rules (fraud, legal language, high-value refund requests, safety concerns). These should always escalate, even if the agent is 95% confident.

Calibrating thresholds with real data

After 30 days of deployment, you have enough data to calibrate your thresholds empirically. The calibration question is: what was the actual accuracy of answers at each confidence band?

  1. 1Pull a stratified sample of autonomous answers from the first 30 days — 30 answers from the 90–95% band, 30 from the 80–90% band, and 20 from the 75–80% band (if any reached autonomous resolution).
  2. 2Grade each answer as correct, partially correct, or incorrect. Calculate accuracy per band. If answers in the 85–90% band are 92% correct, you can safely lower your autonomous threshold to 85%.
  3. 3For the assisted mode band, calculate human edit rate — what percentage of drafts did agents modify before sending? A low edit rate (< 15%) suggests the assisted band could move toward autonomous. A high edit rate (> 40%) suggests the knowledge base needs improvement before lowering the threshold.
  4. 4After calibration, lower the autonomous threshold incrementally — drop it by 3–5 percentage points, then run another 30-day calibration before lowering further. Don't drop thresholds based on one data point or intuition.
  5. 5Recalibrate quarterly — even after initial calibration, thresholds should be reviewed every 90 days. Knowledge base improvements, policy changes, and seasonal volume shifts all affect accuracy at each confidence band.

Per-category threshold tuning

Start with a global threshold and move to per-category thresholds in your second and third months as you gather category-level accuracy data. Bookbag's reporting shows accuracy by question category, making this analysis straightforward.

Question categoryWhy accuracy variesTypical optimal autonomous threshold
Order status / WISMOData-driven; high accuracy when data is live80–85%
Return eligibilityPolicy-dependent; accuracy high when docs are clear85–90%
Product questionsData-dependent; accuracy varies with catalog completeness80–90% depending on catalog quality
Shipping timelinesCarrier data + policy; mostly reliable82–88%
Promotions and discountsDocumentation-dependent; changes frequently88–92% (higher because promos change often)
Account questionsLow data access; lower confidence typically appropriate88–90%

Common calibration mistakes

The most common threshold calibration mistakes are predictable and fixable:

  • Setting thresholds once and never updating — thresholds set at launch become inaccurate as the knowledge base improves. A threshold of 90% that made sense at launch might be appropriate at 82% after 3 months of knowledge base iteration. Review quarterly.
  • Conflating confidence with accuracy — a high confidence score doesn't guarantee a correct answer; it means the agent is confident in the answer it generated. The only way to validate confidence calibration is to sample answers and measure actual accuracy. Don't trust confidence scores without empirical validation.
  • Using one global threshold for all categories — order status questions with live data access are more reliably accurate at lower confidence levels than policy interpretation questions with vague documentation. Per-category thresholds are more efficient.
  • Lowering thresholds too aggressively to hit deflection targets — if leadership sets a deflection target and the team lowers confidence thresholds to hit it without improving the knowledge base, accuracy suffers. Deflection rate and accuracy rate are co-dependent metrics. Set targets for both.
  • Not measuring CSAT separately for each confidence band — if accuracy is good but CSAT on autonomous tickets is low, the problem is tone or completeness, not the threshold. Measure satisfaction by band to diagnose correctly.

Key takeaways

  • Confidence thresholds control the balance between deflection rate and answer accuracy — calibrate them empirically, not by guessing.
  • Start conservative (autonomous at 90%, assisted at 70–90%, escalate below 70%) and tune down with real accuracy data after 30 days.
  • Calibrate by sampling answers in each confidence band and measuring actual accuracy — then lower thresholds incrementally where accuracy supports it.
  • Move to per-category thresholds in months 2–3 as category-level accuracy data becomes available — order status can run at lower thresholds than policy questions.
  • Never lower thresholds without improving the underlying knowledge base — deflection rate and accuracy are co-dependent. Both need targets.

Frequently Asked Questions

Turn support into your competitive edge

Join the ecommerce teams resolving more tickets, answering 24/7, and turning support into a revenue channel with Bookbag.