The two escalation failure modes
Escalation goes wrong in two opposite directions. Under-escalation: the AI attempts to answer questions it shouldn't, gives wrong or insufficient answers, and erodes customer trust. Over-escalation: the AI routes too many questions to humans, deflation rates are low, and the agent provides less value than it should.
Both failures are fixable, but they require different interventions. Under-escalation is a knowledge and confidence threshold problem. Over-escalation is usually a calibration problem — thresholds set too conservatively, or topic triggers that are too broad. The goal is calibrating between them: escalate exactly what needs a human, and nothing more.
Every escalation should be justified. The AI should be able to 'explain' why it escalated — low confidence on a specific question, a topic that's flagged as human-only, an emotional signal in the conversation. If you can't articulate why an escalation happened, the trigger is misconfigured.
The four escalation trigger categories
Escalation triggers fall into four categories. Configure all four — they catch different types of situations and work together to provide complete coverage.
- 1Confidence-based triggers — the AI isn't confident enough in its answer. These are the most important triggers and should be configured as a threshold (e.g., escalate below 75% confidence).
- 2Topic-based triggers — certain topics always require human handling regardless of confidence. These are configured as explicit rules: 'if the conversation involves [topic], escalate.'
- 3Behavioral triggers — the customer signals they need a human through their language or behavior: explicit request, emotional escalation, or repeated attempts at the same question.
- 4Time-based triggers — if a conversation has gone on too long without resolution (e.g., > 5 back-and-forths), escalate rather than continuing to loop.
Confidence-based escalation: setting the right threshold
These are starting thresholds. After your first 30 days of data, review how often each band fires and the accuracy of answers in each band. Most stores find they can lower the autonomous threshold slightly after calibration — the agent is more accurate in practice than the conservative initial settings assume.
| Confidence level | Recommended action | Why |
|---|---|---|
| Above 90% | Answer autonomously | High confidence = reliable answer |
| 75–90% | Answer + offer to connect a human | Good but not certain — give customer the option |
| 60–75% | Draft answer for human review (assisted mode) | Useful starting point but needs verification |
| Below 60% | Escalate immediately | Guessing at this confidence level causes more harm than help |
Topic-based triggers: questions that always escalate
Some topics should always route to a human, regardless of how confident the AI is. This is a policy decision, not a technical one. Here are the standard always-escalate topics for ecommerce:
- Fraud and payment disputes — 'I didn't authorize this charge,' 'my card was used without my permission.' These require investigation and often a formal dispute process.
- Legal language — any mention of lawsuit, attorney, court, or regulatory body. Don't let AI respond to legal threats.
- Safety concerns — if a product caused injury or a safety-related incident, a human must handle it and it must be documented.
- Wholesale and bulk inquiries — these are sales opportunities, not support tickets. Route to sales.
- Press and media inquiries — 'I'm writing an article about your brand.' Route to marketing or PR.
- High-value refund requests — set a dollar threshold (e.g., orders over $500). The financial exposure warrants human review.
- Suspicious return patterns — if a customer has made multiple returns recently, flag for human review before processing.
Behavioral triggers: reading the conversation
Some escalation triggers aren't about what the customer is asking — they're about how they're communicating. Configure these behavioral signals as triggers:
- Explicit request for a human — 'talk to a person,' 'get me a human,' 'I want to speak to someone.' Always trigger immediately; never ask for confirmation before escalating.
- Expressed frustration or anger — phrases like 'this is ridiculous,' 'I've been waiting,' 'worst experience,' 'I'm so frustrated.' The AI should acknowledge and escalate, not try to resolve through the frustration.
- Capitalization and punctuation intensity — ALL CAPS messages and excessive punctuation (!!!) are high-signal indicators of emotional state. Not a perfect signal, but a useful one.
- Repeated question attempts — if a customer has asked the same question 3 or more times and hasn't received a satisfying answer, stop trying and escalate. Continued looping makes the experience worse.
- Contradictory information — if the AI's answer contradicts something the customer says they were told previously, don't dig in; escalate for a human to reconcile the discrepancy.
Tuning your escalation rules over time
Bookbag's escalation analytics show you exactly which rules fired, how often, and how the escalated tickets were resolved. Use that data to tune — don't tune by intuition alone.
- 1Topics that escalate frequently but could be handled by AI — if 'promo code' is escalating 50 times per week and a human is answering the same way every time, that's an AI-resolvable question that needs better documentation, not a human-required topic. Remove it from the always-escalate list and add knowledge coverage.
- 2Topics the AI resolves but shouldn't — if your accuracy audit finds that a particular topic has low accuracy (e.g., 70% correct on exchange policy), tighten the confidence threshold for that topic specifically, or add it to the always-review list until the knowledge is improved.
- 3Threshold drift — confidence thresholds that made sense at launch may be too conservative six months later once the agent has better knowledge coverage. Review them with actual accuracy data, not assumption.
Key takeaways
- Escalation fails in two directions: too little (AI answers things it shouldn't) and too much (AI escalates things it could handle). Both are calibration problems.
- Configure all four trigger types: confidence-based, topic-based, behavioral, and time-based.
- Start with conservative confidence thresholds (escalate < 75%) and tune them down with real accuracy data after 30 days.
- Some topics always require humans: fraud, legal language, safety, high-value refunds, suspicious return patterns.
- Review escalation rules monthly in the first quarter; tune based on data, not intuition.