Score every output. Taxonomy-driven, staged, auditor-backed.
Every tool call and LLM output runs through your QA project's taxonomy. The staged AI auditor scores it — fast by default, deeper on edge cases. Every verdict becomes training data with cryptographic provenance.
What Evaluation ships
Scoring, routing, human review, export, and a red-team harness — one product.
Taxonomy editor
Define your scoring dimensions per project. Hallucination, tone, policy compliance, factual accuracy, brand voice — whatever you grade on.
Staged AI auditor
Three model stages per review: fast / standard / deep. Pick depth per project. Costs scale with depth, not volume.
Review modes
Automated: AI writes the verdict. Assisted: AI drafts, human approves. Human: every output reviewed. Pick per project.
Gate integration
Automated projects run synchronously — the taxonomy verdict merges with Guardrails severity-aware (block > hold > flag > allow).
Training export
Every annotation is a training sample. SFT / DPO / ranking formats. Cryptographic provenance on every row.
AI quality insights
Post-approval jobs surface systemic failure patterns — drift, prompt-sensitivity, taxonomy gaps.
From one call to a taxonomy verdict
The SDK sends the output. The auditor runs. The verdict comes back. Async or sync — you pick per project.
# Score the LLM output — synchronous when the QA project is automated.
result = client.agent.output(
run_uid=run["run_uid"],
text="I've processed your refund for $500.00...",
context={"order_id": "FF-4210", "channel": "support"},
)
# → {'decision': 'hold', 'verdict': 'needs_fix',
# 'flags': ['policy_violation.refund_amount_exceeds_threshold'],
# 'scores': {'tone': 0.92, 'accuracy': 0.95,
# 'policy_compliance': 0.41}}CI-ready eval harness
Pre-built suites. Regression baselines. GitHub Action. Fail the build when your agent regresses.
Pre-built suites
jailbreaks-v1, pii-leak-v1, policy-coverage-v1. Imported into your org with one CLI call.
Regression baselines
Every run can be marked as a baseline. Future runs diff against it — new fails = build fails.
GitHub Action
bookbaghq/bookbag-eval-action@v1 in your workflow. Min-pass threshold, suite selector, fails CI on regression.
Evaluation FAQs
Frequently Asked Questions
Stop shipping AI without a score. Every output goes through the auditor.
Join the teams shipping safer AI with real-time evaluation, audit trails, and continuous improvement.