BookbagBookbag
Product Overview

How Bookbag Works

Evaluate every AI output with real-time API gates, multi-stage evaluation, and human oversight. Structured verdicts, audit trails, and training data — from chatbots to medical AI.

Two Ways to Integrate

Real-time API for production systems. Batch upload for audits and training data. Use both.

Real-Time — Gate API
1. Your app calls the SDK
client.gate.evaluate(input, output)
2. Multi-stage evaluation
1-3 passes, per-stage model selection
allow → Ship to user
flag → Ship + queue for review
block → Fallback response
require_sme → Hold for expert
1-4 second response time
Batch — Upload & Review
1. Export from your tool
CSV upload or API batch import
2. Human review queue
Annotators → QA → SME escalation
3. Verdict packages returned
Scores, flags, rewrites, audit trail, training data
Gold-standard human evaluation

Multi-Stage Evaluation

Choose evaluation depth per project. Fast screening for high-volume, deep analysis for high-stakes. Per-stage model selection optimizes cost and quality.

Single Pass

Fast screening with one evaluation stage. Best for high-volume content where speed matters most.

Latency:500–2,000ms
Model:gpt-4o-mini
Best for:Chatbots, content screening

Two Pass

Recommended

Balanced depth. First pass triages, second pass evaluates flagged items in detail. Best quality-to-cost ratio.

Latency:1,500–4,000ms
Models:gpt-4o-mini → gpt-4o
Best for:Production AI, support, sales

Three Pass

Maximum depth. Three evaluation stages with escalating model capability. For regulated and high-stakes decisions.

Latency:3,000–10,000ms
Models:gpt-4o-mini → gpt-4o → o3
Best for:Healthcare, legal, finance

Three Review Modes

Choose the level of oversight that fits your risk profile. Switch modes per project.

Automated

Full AI evaluation, synchronous response. Decision returned via Gate API in 1-4 seconds. No human involvement.

Best for: High-volume screening, chatbots, content generation

Assisted

AI evaluates and returns a decision immediately. Flagged items are queued for human review in the background. Best of both worlds.

Best for: Production AI with continuous human oversight

Human

Expert human review on every item. Three-tier workflow: annotator, QA reviewer, subject matter expert. Gold-standard quality.

Best for: Healthcare, legal, finance, training data creation
Gate API

Integrate in Minutes

Install the SDK, create a client, and evaluate your first AI output. Python and Node.js with zero external dependencies. Advisory or enforced mode.

View full API documentation
Python
from bookbag import BookbagClient

client = BookbagClient(
  api_key="bk_gate_xxx"
)

result = client.gate.evaluate(
  input="What is my refund policy?",
  output="Full refund within 90 days."
)

# result.decision: allow | flag | block
# result.scores, result.flags, result.audit_id

What Comes Back With Every Evaluation

Every evaluation — whether via real-time API or batch upload — returns a structured data package. Not just a label.

  • Failure Analysis
    Hallucination, factual error, policy violation, tone issue, over-promising — with severity and business impact
  • Rubric Scores
    Scored 1-5 on correctness, tone, personalization, policy compliance, and confidence
  • Gold-Standard Rewrites
    Corrected responses with explanations. For blocked items: SME rationale and evidence citations
  • Training Data Export
    SFT pairs, DPO preference data, and ranking signals — structured for model fine-tuning
  • Complete Audit Trail
    Who evaluated, when, which taxonomy version, what decision — immutable and searchable
VERDICT PACKAGE
evaluation_id:eval_a7k2m9
decision:flag
risk:high
flags:[over-promising, personalization_failure]
policy_action:review
SCORES
correctness:3/5
tone:4/5
personalization:2/5
policy:2/5
confidence:0.87
RATIONALE
"Over-promising detected: guarantee language violates claims policy §4.2. Personalization is generic."
AUDIT
audit_id:aud_9x2k4m
evaluation_ms:2,340
taxonomy_v:v2.3.1

Customizable Taxonomies

Define what matters for your domain. Configure rubrics, failure categories, and policies per project. Version-stamped for audit compliance.

Project-Level Configuration

Each project has its own rubrics, failure categories, and evaluation criteria. Switch configurations per campaign, client, or domain.

Version-Stamped Policies

Every evaluation is logged with the exact taxonomy version used. Trace back to the policy in effect at the time.

Built-in Templates

Start with pre-built templates across 10 AI QA categories. Customize for your domain, or test your skills with 50 interactive quizzes.

Your AI Gets Better Over Time

Every correction becomes training data. Export in standard ML formats to retrain your models.

SFT
Supervised Fine-Tuning

Input → approved output pairs for fine-tuning your base model

DPO
Direct Preference Optimization

Preference pairs: chosen vs rejected outputs for RLHF training

Ranking
Ranked Outputs

Multiple outputs ranked by quality for reward model training

Human Review Workflow

When items are flagged or queued for human review, Bookbag's 3-tier workflow routes them to the right expertise.

Annotator Review

First-pass evaluation against your rubrics. Fast, structured workflows for high-volume review.

  • Task-based queue
  • Rubric-guided evaluation
  • Quick approve / reject / escalate

QA Review

Rewrite, approve, or escalate. Corrections become gold-standard examples and training data.

  • Edit and approve workflow
  • Create approved templates
  • Export training data

SME Approval

Subject matter experts make final calls on high-risk items. Full provenance and evidence trails.

  • Blocked-only items
  • Requires rationale + evidence
  • Audit-ready recordkeeping

Get Started

Two paths to production. Choose API-first for real-time integration, or batch upload for audits and training data.

Track A — API Integration
1
Install the SDK
pip install bookbag
2
Configure taxonomy + review mode
Define rubrics, choose evaluation depth and mode
3
Call gate.evaluate() from your app
Real-time decisions in 1-4 seconds
Track B — Batch Upload
1
Subscribe and create a project
Choose plan, configure rubrics
2
Upload AI conversations
CSV or API batch import from any tool
3
Receive verdict packages
Scores, flags, rewrites, audit trail, training data

What you can launch in 2 weeks

Gate API integrated in your app
Taxonomy configured for your domain
Review mode selected and tuned
Multi-stage evaluation depth set
First 100 evaluations processed
Training data export configured
Audit trail flowing
Edge cases routing to SME queue

Ready to evaluate your AI?

Join the teams shipping safer AI with real-time evaluation, audit trails, and continuous improvement.