What confidence thresholds are — and why they matter
When an AI agent processes a customer's question, it doesn't just generate an answer — it generates an answer with an internal confidence score that reflects how certain it is. A question that closely matches well-documented policy produces a high-confidence answer. A question that's ambiguous, outside the documented policy, or involves edge cases the agent hasn't seen produces a lower-confidence answer.
Confidence thresholds are the rules you set that determine what happens based on that score. Above a threshold: the agent answers autonomously. In a middle range: the agent drafts an answer for human review. Below a threshold: the agent escalates immediately. These settings are the primary lever controlling the balance between your deflection rate and your answer accuracy — and they need to be calibrated empirically, not guessed.
An AI agent set too conservatively (escalates everything below 90%) will have excellent answer accuracy but 20% deflection. Set too aggressively (answers autonomously down to 50% confidence), it will have 70% deflection but unacceptable accuracy. The right setting depends on your data — not on defaults.
The three resolution modes
Assisted mode is underutilized by most teams. It's not a failure state — it's a productivity multiplier. Agents reviewing AI drafts work 2–3x faster than agents writing from scratch. Many tickets that 'can't be autonomously resolved' can still be handled much faster with an AI draft than without one.
| Mode | When it applies | What happens | Customer experience |
|---|---|---|---|
| Autonomous resolution | Confidence above upper threshold | AI answers without human review | Instant answer, no wait |
| Assisted resolution | Confidence in middle range | AI drafts answer, human reviews and sends | Fast answer (< 5 min), human-verified |
| Immediate escalation | Confidence below lower threshold | AI routes to human with escalation summary | Human answers with full AI context |
Starting thresholds: safe defaults for a new deployment
For a new AI deployment where you don't yet have empirical accuracy data, use conservative starting thresholds. These are intentionally safe — they favor accuracy over deflection in the first 30 days when calibration is ongoing.
- 1Autonomous threshold: 90% — the agent answers without human review only when highly confident. This will produce lower deflection in the first month but ensures accuracy during the calibration window.
- 2Assisted threshold: 70–90% — the agent drafts an answer for human review. Agents should send these drafts with edits if needed, or overrule if the draft is wrong. Track edit rates per category.
- 3Escalation threshold: below 70% — route directly to a human. At this confidence level, the agent is more likely to waste the customer's time than to help.
- 4Add hard overrides: regardless of confidence score, configure explicit topic-based escalation rules (fraud, legal language, high-value refund requests, safety concerns). These should always escalate, even if the agent is 95% confident.
Calibrating thresholds with real data
After 30 days of deployment, you have enough data to calibrate your thresholds empirically. The calibration question is: what was the actual accuracy of answers at each confidence band?
- 1Pull a stratified sample of autonomous answers from the first 30 days — 30 answers from the 90–95% band, 30 from the 80–90% band, and 20 from the 75–80% band (if any reached autonomous resolution).
- 2Grade each answer as correct, partially correct, or incorrect. Calculate accuracy per band. If answers in the 85–90% band are 92% correct, you can safely lower your autonomous threshold to 85%.
- 3For the assisted mode band, calculate human edit rate — what percentage of drafts did agents modify before sending? A low edit rate (< 15%) suggests the assisted band could move toward autonomous. A high edit rate (> 40%) suggests the knowledge base needs improvement before lowering the threshold.
- 4After calibration, lower the autonomous threshold incrementally — drop it by 3–5 percentage points, then run another 30-day calibration before lowering further. Don't drop thresholds based on one data point or intuition.
- 5Recalibrate quarterly — even after initial calibration, thresholds should be reviewed every 90 days. Knowledge base improvements, policy changes, and seasonal volume shifts all affect accuracy at each confidence band.
Per-category threshold tuning
Start with a global threshold and move to per-category thresholds in your second and third months as you gather category-level accuracy data. Bookbag's reporting shows accuracy by question category, making this analysis straightforward.
| Question category | Why accuracy varies | Typical optimal autonomous threshold |
|---|---|---|
| Order status / WISMO | Data-driven; high accuracy when data is live | 80–85% |
| Return eligibility | Policy-dependent; accuracy high when docs are clear | 85–90% |
| Product questions | Data-dependent; accuracy varies with catalog completeness | 80–90% depending on catalog quality |
| Shipping timelines | Carrier data + policy; mostly reliable | 82–88% |
| Promotions and discounts | Documentation-dependent; changes frequently | 88–92% (higher because promos change often) |
| Account questions | Low data access; lower confidence typically appropriate | 88–90% |
Common calibration mistakes
The most common threshold calibration mistakes are predictable and fixable:
- Setting thresholds once and never updating — thresholds set at launch become inaccurate as the knowledge base improves. A threshold of 90% that made sense at launch might be appropriate at 82% after 3 months of knowledge base iteration. Review quarterly.
- Conflating confidence with accuracy — a high confidence score doesn't guarantee a correct answer; it means the agent is confident in the answer it generated. The only way to validate confidence calibration is to sample answers and measure actual accuracy. Don't trust confidence scores without empirical validation.
- Using one global threshold for all categories — order status questions with live data access are more reliably accurate at lower confidence levels than policy interpretation questions with vague documentation. Per-category thresholds are more efficient.
- Lowering thresholds too aggressively to hit deflection targets — if leadership sets a deflection target and the team lowers confidence thresholds to hit it without improving the knowledge base, accuracy suffers. Deflection rate and accuracy rate are co-dependent metrics. Set targets for both.
- Not measuring CSAT separately for each confidence band — if accuracy is good but CSAT on autonomous tickets is low, the problem is tone or completeness, not the threshold. Measure satisfaction by band to diagnose correctly.
Key takeaways
- Confidence thresholds control the balance between deflection rate and answer accuracy — calibrate them empirically, not by guessing.
- Start conservative (autonomous at 90%, assisted at 70–90%, escalate below 70%) and tune down with real accuracy data after 30 days.
- Calibrate by sampling answers in each confidence band and measuring actual accuracy — then lower thresholds incrementally where accuracy supports it.
- Move to per-category thresholds in months 2–3 as category-level accuracy data becomes available — order status can run at lower thresholds than policy questions.
- Never lower thresholds without improving the underlying knowledge base — deflection rate and accuracy are co-dependent. Both need targets.