What deflection rate should I expect at a 90% autonomous threshold?

With a well-built knowledge base, most stores see roughly 40–55% autonomous deflection in month one at a 90% threshold. As the docs mature and you calibrate down toward 82–85%, deflection commonly climbs to 60–70% while accuracy holds. Dropping straight to 75% without improving the knowledge base can push deflection higher but tends to pull accuracy into the low 80s — not a trade most stores want.

How do I set confidence thresholds in Bookbag?

You configure thresholds in the agent's resolution settings, with both global values and per-category overrides. Layer hard escalation rules on top for sensitive topics like fraud and high-value refunds. The analytics dashboard shows confidence-band distribution and accuracy by band, which is what you use to run the 30-day calibration loop and decide where to lower the line.

Should I use assisted mode or skip straight from autonomous to escalation?

Keep assisted mode on. In the middle confidence band the agent's draft is usually most of the way there, so an agent can verify and send in a minute or two. Skipping it means those tickets get written from scratch, which is slower for your team and no faster for the customer. The assisted-band edit rate is also your best signal for when to lower the autonomous threshold.

My agent's confidence scores don't seem to predict accuracy. What now?

That's miscalibration — the score isn't tracking real correctness. The usual cause is an ungrounded or messy knowledge base, so the fastest fix is cleaning contradictory or stale docs and making sure the agent answers from live order and catalog data. Until calibration improves, rely on empirical accuracy sampling to set thresholds rather than trusting the raw confidence number.

How often should I recalibrate thresholds?

Run a full calibration after the first 30 days, then review at least every 90 days. Knowledge base edits, policy and pricing changes, new products, and seasonal volume all shift accuracy inside each confidence band, so a threshold that was correct one quarter can drift the next. Recalibrate sooner if you ship a major docs update or see reopen rates climb.

Playbooks

Setting Confidence Thresholds for Autonomous AI Resolution

Confidence thresholds are the dial between deflection rate and answer accuracy. Here's how to find the right setting for your store — and tune it as your knowledge base improves.

The Bookbag Team·June 2026· 15 min read

In this article

What confidence thresholds are
Confidence is not accuracy
The three resolution modes
Safe starting thresholds
Calibrating with real data
Per-category threshold tuning
Hard overrides that ignore confidence
Metrics to watch
Common calibration mistakes
How Bookbag handles thresholds

What confidence thresholds are — and why they matter

A confidence threshold is the cutoff that decides whether your AI support agent answers a customer on its own or hands the ticket to a human. When the agent drafts a reply, it also produces an internal confidence score for that reply. The threshold is the rule you set on top of that score: above the line, the agent resolves autonomously; below it, the agent escalates. Get the line right and you deflect the bulk of routine volume without putting wrong answers in front of customers. Get it wrong and you give up one side of that trade for the other.

Confidence varies for good reasons. A WISMO question — 'where is order #4821?' — that maps cleanly to live tracking data produces a high-confidence answer. A vague question about a discount that expired last week, or a return on an item your policy doesn't explicitly cover, produces a lower-confidence one. Confidence thresholds are how you turn that signal into a routing decision instead of letting the agent answer everything at the same risk level.

This is the single most important setting in an autonomous support deployment, and it is the one teams most often leave on a default. The right value is not a number you can copy from a blog post. It depends on how complete your help docs are, how clean your product catalog is, and how forgiving your customers are — which means it has to be calibrated against your own data.

It also helps to be honest about what each direction costs. Set the threshold too high and the agent escalates questions it could have answered perfectly, so your team keeps fielding the same WISMO and return-window tickets you bought automation to remove. Set it too low and the agent ships confident answers on questions it didn't really understand, which shows up later as reopened tickets, refund disputes, and one-star CSAT. Both failures are invisible on a deflection dashboard if you only look at the headline number — which is why the rest of this guide is about measuring the trade, not just picking a value.

Definition

A confidence threshold is the minimum confidence score an AI agent must reach to resolve a ticket autonomously. Above the threshold, the agent answers and closes the ticket. Below it, the agent drafts for review or escalates to a human. It is the primary lever controlling the balance between deflection rate and answer accuracy.

Confidence is not accuracy — and why that ruins naive setups

Confidence is the model's estimate of how likely its answer is correct. Accuracy is whether the answer was actually correct. They correlate, but they are not the same number, and the gap between them is exactly where threshold setups go wrong. An agent can be 90% confident on an answer that turns out to be 75% accurate in practice, because confidence reflects how well the question matched what the agent was given — not whether what it was given was right.

This matters because the fix for a low-accuracy band is rarely 'raise the threshold.' Usually the underlying problem is the knowledge base: a policy page that contradicts itself, a shipping table that's three months stale, a return window documented two different ways. The agent reads the ambiguous source, picks one reading, and reports high confidence — because from its point of view the question was answerable. Raising the threshold hides the symptom; fixing the source removes it.

Grounding is what keeps confidence and accuracy close together. Industry testing finds that ungrounded chatbots hallucinate somewhere between 15% and 27% of the time, while LLMs grounded in a controlled knowledge base drop that to roughly 0.7%–1.5%. The same testing shows generative agents reaching about 92% intent-understanding accuracy versus 65%–70% for older keyword bots. The lesson: a confidence score is only trustworthy when the agent is answering from real, current store data — orders, catalog, and clean docs — rather than guessing.

A high confidence score doesn't mean the answer is right. It means the agent found a clear path from the question to the source it was given. If the source is wrong, so is the confident answer.
— Support QA principle

The three resolution modes you're actually configuring

A confidence threshold isn't a single on/off switch — it splits incoming questions into three lanes, and you set two cutoffs to define them. Above your upper threshold, the agent resolves autonomously. In the middle band, the agent drafts a reply for a human to review and send. Below your lower threshold, the agent escalates immediately with a summary of what it gathered.

Most teams obsess over the autonomous lane and ignore the middle one, which is a mistake. Assisted resolution is not a failure state — it's a productivity multiplier. An agent reviewing a near-complete AI draft works far faster than one writing from a blank box, and plenty of tickets that 'can't be fully automated' still close in a fraction of the time with a draft in hand. Treat the middle band as a feature, not a consolation prize.

The width of that middle band is its own decision. A wide assisted band (say 65–90%) routes more tickets through human review, which is the cautious choice when your knowledge base is young and you'd rather a person catch mistakes than a customer. A narrow band pushes more volume to the autonomous and escalation lanes, which is where you end up once accuracy in the upper bands has proven itself. Most stores start wide and narrow the band as calibration data comes in — the same direction the autonomous threshold moves.

Mode	When it applies	What happens	Customer experience
Autonomous resolution	Confidence above the upper threshold	Agent answers and closes without human review	Instant answer, no wait
Assisted resolution	Confidence in the middle band	Agent drafts; a human reviews, edits, and sends	Fast answer (under 5 min), human-verified
Immediate escalation	Confidence below the lower threshold	Agent routes to a human with a context summary	Human answers with full AI context

Safe starting thresholds for a brand-new deployment

On day one you have no accuracy data, so start conservative and let real results earn you a lower threshold. These defaults intentionally favor accuracy over deflection during the first 30 days, when calibration is still in progress. You will deflect less than your eventual ceiling in month one — that is the correct trade while you're still learning where the agent is reliable.

Use these four rules as a starting configuration, then plan to revisit them after your first month of data.

1Autonomous threshold at 90%. The agent only closes a ticket on its own when it's highly confident. Deflection will be lower this month, but you avoid putting shaky answers in front of customers before you've validated the agent.
2Assisted band from 70% to 90%. The agent drafts; agents send with edits, or overrule when the draft is wrong. Track the edit rate per category — it's your richest calibration signal.
3Escalation below 70%. Route straight to a human. At this confidence the agent is more likely to waste the customer's time than to help, and a fast human handoff beats a confident wrong answer.
4Add hard overrides on top. Regardless of score, force escalation on fraud, chargebacks, legal language, safety issues, and high-value refunds. These escalate even at 95% confidence — see the overrides section below.

Why start high

Early wrong answers are expensive in trust, not just CSAT. A customer who catches the agent being confidently wrong in week one stops trusting it for months. Start at 90%, prove accuracy, then lower the line on evidence. You can always deflect more later; you can't un-burn a first impression.

Calibrating thresholds with real data after 30 days

After about 30 days you have enough resolved tickets to calibrate empirically instead of guessing. The whole exercise answers one question: what was the actual accuracy of answers inside each confidence band? Once you know that, lowering the threshold becomes an evidence-based decision rather than a hopeful one.

Run this loop monthly at first, then quarterly once it stabilizes. The grading step is the part teams want to skip, and it's the part that makes everything else work — without a human reading a sample of real answers, you have a confidence number that may or may not mean anything. Budget an hour a month for it. Two people grading the same sample is even better, because it surfaces where your own definition of 'correct' is fuzzy.

1Pull a stratified sample of autonomous answers from the period — roughly 30 from the 90–95% band, 30 from the 80–90% band, and 20 from the 75–80% band if any reached autonomous resolution. Random within each band, not cherry-picked.
2Grade each as correct, partially correct, or incorrect, and compute accuracy per band. If the 85–90% band is grading at 92% correct, you have evidence to lower the autonomous threshold toward 85%.
3For the assisted band, measure the human edit rate — what share of drafts did agents change before sending? A low rate (under 15%) means that band is nearly autonomous already. A high rate (over 40%) means fix the knowledge base before lowering anything.
4Lower the autonomous threshold incrementally — 3 to 5 points at a time — then run another full cycle before dropping further. One good month is a data point, not a mandate.
5Recalibrate every 90 days regardless. Knowledge base edits, policy changes, new products, and seasonal volume all shift accuracy inside each band, so a threshold that was right in March can be wrong by June.

Measuring AI answer accuracy

Per-category threshold tuning in months two and three

A single global threshold is the right call for your first month, but it leaves deflection on the table. Order-status questions backed by live tracking data are reliable at a much lower confidence than open-ended policy interpretation, so forcing both through the same cutoff means you either over-escalate the easy ones or over-automate the hard ones. By months two and three you'll have enough category-level data to split the difference.

The pattern below is typical, not prescriptive — your own accuracy sampling decides the final numbers. Categories grounded in structured, live data tolerate lower thresholds; categories that lean on prose documentation need a higher bar.

There's a practical reason to split by category beyond raw accuracy: it changes where you invest. When you see that returns are stuck at a 90% threshold because the docs are ambiguous, that's a clear, scoped fix — rewrite one policy page and you may unlock a 5-point drop and a meaningful jump in deflection for that category alone. A single global number averages those signals together and hides exactly which part of the knowledge base is holding you back.

Question category	Why accuracy varies	Typical autonomous threshold
Order status / WISMO	Data-driven; high accuracy when tracking is live	80–85%
Return eligibility	Policy-dependent; accurate when the policy is clearly documented	85–90%
Product questions	Catalog-dependent; varies with attribute completeness	80–90%
Shipping timelines	Carrier data plus policy; mostly reliable	82–88%
Promotions and discounts	Documentation-dependent and changes often	88–92%
Account and billing	Limited data access; lower confidence usually appropriate	88–90%

The integration-depth effect

Benchmarks consistently show the gap between mediocre and great deflection is mostly integration depth — bots wired into orders, catalog, and identity contain far more than standalone FAQ bots (often 70–90% versus 40–60%). That's why data-grounded categories like WISMO can safely run lower thresholds: the agent isn't guessing, it's reading.

Hard overrides: tickets that escalate no matter how confident the agent is

Some questions should never be resolved autonomously, even when the agent is 99% confident it knows the answer. Confidence measures whether the agent can answer — not whether it should. A confident answer to a chargeback threat or a product-safety complaint is exactly the kind of answer you want a human to own. These are topic-based escalation rules that sit above the confidence logic and win every time.

Configure your override list before you ever lower a threshold. It's what lets you deflect aggressively on routine volume without exposing yourself on the small set of tickets where a wrong or tone-deaf answer carries real cost.

Fraud, chargebacks, and disputed charges — financial and legal exposure; a human verifies identity and intent.
Legal language — anything mentioning lawyers, lawsuits, regulators, or formal complaints routes to a human immediately.
Safety and injury claims — product defects that caused harm, allergic reactions, choking hazards. Never autonomous, regardless of category.
High-value refunds above your merchant-set cap — let the agent handle small refunds within rules and escalate the big ones for approval.
Explicit human requests — when a customer asks for a person, honor it. Fighting the handoff tanks CSAT faster than any wrong answer.
Detected high frustration — repeated all-caps, profanity, or churn threats should escalate even on a question the agent could technically answer.

Metrics to watch while you tune

Threshold tuning is a balancing act between two numbers that move in opposite directions, so you have to watch them together. Lower the threshold and deflection rises while accuracy is at risk; raise it and accuracy is safe while deflection falls. Tracking only one of them is how teams talk themselves into a setting that looks good on a dashboard and feels bad to customers.

Industry framing helps set expectations: a deflection rate above 40% is generally considered good and above 80% is considered great, but those numbers are only meaningful alongside accuracy and CSAT for the same tickets. Watch the full set below, broken out by confidence band wherever you can.

The reopen rate deserves special attention because it's the metric that catches the failure deflection hides. A ticket the agent 'resolved' that the customer reopens two hours later was never resolved — it just left the queue and came back, usually annoyed. If you lower a threshold and deflection ticks up while reopens climb with it, you didn't gain anything; you moved work into the future and added friction. Treat a rising reopen rate as a hard signal to stop lowering and go fix the underlying docs.

Metric	What it tells you	Healthy direction
Autonomous deflection rate	Share of tickets the agent closed on its own	Rising — but only with accuracy holding
Answer accuracy by band	Whether confidence is predicting correctness	High and stable across bands you automate
Assisted edit rate	How often agents rewrite AI drafts	Falling — signals a band is ready to automate
CSAT by confidence band	Whether autonomous answers actually satisfy	Even across bands; flag any band that lags
Reopen / repeat-contact rate	Whether 'resolved' tickets actually stayed resolved	Low — spikes mean accuracy is worse than it looks
Escalation quality	Whether handoffs arrive with usable context	High — humans shouldn't restart from zero

How to measure ticket deflection

Common calibration mistakes

The ways threshold tuning goes wrong are predictable, which means they're avoidable. These are the five that cost teams the most deflection — or the most trust.

Set once, never updated. A 90% threshold that made sense at launch may be appropriate at 82% after three months of knowledge base work. Thresholds left static go stale as the underlying docs improve. Review quarterly.
Confusing confidence with accuracy. A high score means the agent is confident, not correct. The only way to validate calibration is to sample answers and grade them. Never trust the score on its own.
One global threshold for everything. WISMO with live data is reliable at a lower confidence than vague policy questions. Forcing both through the same cutoff wastes deflection on the easy half.
Lowering thresholds to hit a deflection target. If leadership mandates a number and the team drops the threshold to reach it without improving the knowledge base, accuracy quietly erodes. Deflection and accuracy are co-dependent — set targets for both.
Not measuring CSAT per band. If accuracy is fine but CSAT on autonomous tickets is low, the issue is tone or completeness, not the threshold. Break satisfaction out by band so you diagnose the real problem.
Skipping the assisted band entirely. Jumping from autonomous straight to escalation throws away the draft, and agents end up writing those tickets from scratch — slower for them, no faster for the customer.

How Bookbag handles confidence and resolution modes

Bookbag is built around the three-lane model this guide describes. Because it's an agent that takes real actions rather than a script that deflects, its confidence reflects what it can verify against live data — Shopify, WooCommerce, or BigCommerce orders, your catalog, and your help docs — not a guess from static text. That grounding is what keeps confidence and accuracy aligned, and it's why data-backed categories like order tracking and returns can run at lower thresholds safely.

You set global thresholds and per-category overrides in the agent's resolution settings, layer hard escalation rules on top for fraud, legal, safety, and high-value refunds, and read confidence-band distribution plus accuracy in the analytics dashboard to drive the calibration loop. Human handoff carries the full conversation and a summary, so escalations never start from zero. Pricing is flat — message credits, no per-resolution fee — so lowering a threshold to deflect more never produces a surprise bill the way a per-resolution model would.

If you're comparing approaches, a general-purpose builder like Chatbase can answer from your docs but doesn't connect to orders or take store actions, which caps how confidently it can resolve ecommerce tickets. Bookbag's edge is the live data behind the confidence score.

See Bookbag pricing Compare Bookbag vs Chatbase

Key takeaways

Confidence thresholds are the dial between deflection rate and answer accuracy — calibrate them on your own data, not on a default.
Confidence is not accuracy. A confident wrong answer almost always traces back to the knowledge base, not the threshold.
Start conservative — autonomous at 90%, assisted 70–90%, escalate below 70% — then lower in 3–5 point steps as accuracy sampling supports it.
Move to per-category thresholds in months two and three: data-grounded WISMO can run lower than vague policy questions.
Keep hard overrides for fraud, legal, safety, and high-value refunds that escalate no matter how confident the agent is.
Never lower a threshold to hit a deflection target without first improving the docs — deflection and accuracy are co-dependent.

Setting Confidence Thresholds for Autonomous AI Resolution

What confidence thresholds are — and why they matter

Confidence is not accuracy — and why that ruins naive setups

The three resolution modes you're actually configuring

Safe starting thresholds for a brand-new deployment

Calibrating thresholds with real data after 30 days

Per-category threshold tuning in months two and three

Hard overrides: tickets that escalate no matter how confident the agent is

Metrics to watch while you tune

Common calibration mistakes

How Bookbag handles confidence and resolution modes

Key takeaways

Frequently Asked Questions

Keep reading

Escalation Rules: When AI Should Hand Off to a Human

Measuring and Improving AI Answer Accuracy in Ecommerce Support

How to Train Your AI Support Agent (and Keep It Accurate)

The Ticket Deflection Playbook for Ecommerce

Human Handoff Playbook: AI-to-Agent Transfers Customers Don't Hate

Turn support into your competitive edge