Question 1

What is the 'staged AI auditor'?

Accepted Answer

Three model stages that run in sequence per output: fast (cheap cache-key model, quick sanity), standard (the taxonomy-driven review you'd build by hand), deep (the highest-accuracy slow model for edge cases). Each project picks its depth. Costs scale with depth, not with every output.

Question 2

What counts as a taxonomy?

Accepted Answer

Your scoring dimensions — e.g. hallucination, tone, policy compliance, factual accuracy, brand voice. Bookbag ships sensible defaults per vertical (support agent, sales agent, legal copilot) but the taxonomy is editable per-project. Every auditor pass maps to your taxonomy verdict: safe_to_deploy / needs_fix / blocked.

Question 3

What's the difference between automated / assisted / human review?

Accepted Answer

Automated: AI auditor writes the verdict, no human in the loop. Fastest, cheapest. Assisted: AI drafts, human approves. Middle ground — catches edge cases without making a human read every output. Human: every output goes through a reviewer. Slowest, highest trust. You pick per QA project.

Question 4

How does training-data export work?

Accepted Answer

Every annotated output is a training sample. Export to SFT (prompt + completion), DPO (preferred vs rejected), or ranking (scored outputs for RLHF). Format includes cryptographic provenance so your fine-tuning pipeline can prove which evaluator produced which judgment.

Question 5

What's the eval harness?

Accepted Answer

Run suites of pre-built or custom cases against your agent — jailbreaks-v1, PII-leak-v1, policy-coverage-v1, etc. CI-ready: there's a GitHub Action that runs the suite on PRs and fails the build if regression vs a baseline EvalRun is detected.

Question 6

Does Evaluation gate the runtime?

Accepted Answer

Only when the RuntimeKey's QA project is set to review_mode: automated — then the AI auditor runs synchronously and its verdict is merged with the Guardrails decision (severity-aware: block > hold > flag > allow). Human/assisted projects are async — outputs are queued for review, runtime doesn't wait.

Score every output. Taxonomy-driven, staged, auditor-backed.

What Evaluation ships

Taxonomy editor

Staged AI auditor

Review modes

Gate integration

Training export

AI quality insights

From one call to a taxonomy verdict

CI-ready eval harness

Pre-built suites

Regression baselines

GitHub Action

Evaluation FAQs

Frequently Asked Questions

Stop shipping AI without a score. Every output goes through the auditor.