Evidence-Based AI Evaluation: From Messages to Decisions
What Is an Evidence Payload?
An evidence payload is the structured data package submitted for AI decision evaluation. Unlike message QA — where you evaluate the text of a message — decision auditing evaluates the decision against the evidence that should support it. The payload captures everything needed to determine whether the AI got it right.
Every evidence payload includes six components:
evidence— The factual inputs the AI used to make its decisionpolicy_context— The regulations and rules the decision must comply withai_generated_content— The actual decision the AI producedmodel_trace— The reasoning chain the AI followedmodel_metadata— Model version, confidence, last validationredacted_fields— Sensitive data that's masked for privacy
Here's what a sample evidence payload looks like for a government benefits determination:
Why Evidence Matters More Than Output
The fundamental difference between message QA and decision auditing comes down to one question. Message QA asks "is this text good?" Decision auditing asks "is this decision supported by this evidence under these rules?"
This distinction matters because:
- •The same output text could be right or wrong depending on the evidence — a denial is correct when income exceeds the threshold, but wrong when it doesn't
- •Policy context changes the evaluation — what's compliant in one jurisdiction may not be in another
- •Model trace reveals whether the AI followed the right reasoning chain, not just whether it arrived at the right answer
- •Without evidence, you can't distinguish a correct decision from a lucky one
The Taxonomy: Industry-Specific, Structurally Consistent
The evaluation taxonomy has three components that work together across every industry. The categories are specific to the domain, but the structure stays the same — so you can compare accuracy and risk across verticals.
1. Failure Categories — What went wrong
Industry-specific failure types that describe the nature of the error.
2. Business Impact — What's at stake
The downstream consequence of the failure: regulatory action, litigation, patient harm, wrongful denial, etc.
3. Evidence Sufficiency — Can a determination be made?
Whether the evidence in the payload is complete enough to support a decision at all.
Here's how the taxonomy maps across industries:
Government Benefits
Lending
Healthcare
How a Verdict Works
Let's walk through a real example. An insurance company submits a homeowner's claim for water damage. The AI system classified the damage as "flood" (excluded from the policy) instead of "plumbing failure" (covered under the policy).
The evidence payload includes the adjuster's report, which clearly documents a burst pipe in the upstairs bathroom. No external water source. No weather event. The AI misclassified the cause of loss.
The verdict catches the error, flags the incorrect classification, and provides the corrected determination:
The verdict identifies the failure, corrects the decision, cites the policy provision, and creates a full audit trail with the reviewer's rationale. The correction also becomes a training example so the model learns to distinguish plumbing failures from flood events.
From Verdicts to Training Data
Every verdict — whether the decision was correct or incorrect — becomes training data. This is what closes the loop between evaluation and improvement.
- Correct decisions validate the model — they become positive training examples that reinforce the right reasoning patterns
- Incorrect decisions + corrections become SFT pairs — the original (wrong) and corrected (right) versions paired together for supervised fine-tuning
- Pattern analysis across verdicts reveals systematic model weaknesses — if the model consistently misclassifies water damage, that's a targeted training priority
- Training data is industry-specific and evidence-grounded — not generic text, but real decisions with real evidence and real policy context
Key Takeaways
- 1.Evidence payloads include the decision + evidence + policy context + model trace
- 2.Evaluation is against evidence and policy, not just output quality
- 3.The taxonomy adapts to each industry while maintaining structural consistency
- 4.Every verdict produces a compliance-ready audit trail
- 5.Corrections become training data that improves the AI