BookbagBookbag
Evaluation

Score every output. Taxonomy-driven, staged, auditor-backed.

Every tool call and LLM output runs through your QA project's taxonomy. The staged AI auditor scores it — fast by default, deeper on edge cases. Every verdict becomes training data with cryptographic provenance.

What Evaluation ships

Scoring, routing, human review, export, and a red-team harness — one product.

Taxonomy editor

Define your scoring dimensions per project. Hallucination, tone, policy compliance, factual accuracy, brand voice — whatever you grade on.

Staged AI auditor

Three model stages per review: fast / standard / deep. Pick depth per project. Costs scale with depth, not volume.

Review modes

Automated: AI writes the verdict. Assisted: AI drafts, human approves. Human: every output reviewed. Pick per project.

Gate integration

Automated projects run synchronously — the taxonomy verdict merges with Guardrails severity-aware (block > hold > flag > allow).

Training export

Every annotation is a training sample. SFT / DPO / ranking formats. Cryptographic provenance on every row.

AI quality insights

Post-approval jobs surface systemic failure patterns — drift, prompt-sensitivity, taxonomy gaps.

From one call to a taxonomy verdict

The SDK sends the output. The auditor runs. The verdict comes back. Async or sync — you pick per project.

eval.py
# Score the LLM output — synchronous when the QA project is automated.
result = client.agent.output(
    run_uid=run["run_uid"],
    text="I've processed your refund for $500.00...",
    context={"order_id": "FF-4210", "channel": "support"},
)
# → {'decision': 'hold', 'verdict': 'needs_fix',
#    'flags': ['policy_violation.refund_amount_exceeds_threshold'],
#    'scores': {'tone': 0.92, 'accuracy': 0.95,
#               'policy_compliance': 0.41}}

CI-ready eval harness

Pre-built suites. Regression baselines. GitHub Action. Fail the build when your agent regresses.

Pre-built suites

jailbreaks-v1, pii-leak-v1, policy-coverage-v1. Imported into your org with one CLI call.

Regression baselines

Every run can be marked as a baseline. Future runs diff against it — new fails = build fails.

GitHub Action

bookbaghq/bookbag-eval-action@v1 in your workflow. Min-pass threshold, suite selector, fails CI on regression.

Evaluation FAQs

Frequently Asked Questions

Stop shipping AI without a score. Every output goes through the auditor.

Join the teams shipping safer AI with real-time evaluation, audit trails, and continuous improvement.