BookbagBookbag

Data sources

Everything your agent knows comes from data sources — website crawls, file uploads, pasted text, and Q&A pairs. Learn each type, how training (chunk, embed, index) works, Q&A priority, and how to keep knowledge fresh with retraining.

View as Markdown

A Bookbag agent only answers from what you teach it. Data sources are how you teach it. Each source you add is extracted, split into chunks, embedded into a vector index, and made retrievable — so when a customer asks a question, the agent pulls the most relevant pieces of your content and answers from those.

You manage sources from an agent's Data sources tab. This page covers the four source types, what happens when a source trains, how Q&A pairs take priority, and how to keep everything current.

Where data sources live

Sources are scoped to a single agent. Two agents in the same workspace each have their own knowledge — add a source to the agent that should answer from it. Manage them at app.bookbag.ai under your agent's Data sources tab.

The four source types

Bookbag supports four kinds of data source. Most stores use a mix: a website crawl for the bulk of content, files for catalogs and policy PDFs, and a handful of Q&A pairs to pin the answers that must be exact.

TypeWhat it ingestsBest for
WebsiteA bounded multi-page crawl that follows links from a starting URL and extracts the readable text of each page.Help centers, FAQ pages, policy and shipping pages.
FileAn uploaded document (PDF, doc, spreadsheet, text). Bookbag extracts and chunks the text.Product catalogs, manuals, policy documents, exported macros.
TextA snippet you paste directly into a text box.A policy that isn't written down anywhere public yet.
Q&AAn exact question paired with the exact answer you want returned.High-stakes answers: refund window, warranty terms, shipping cut-offs.

Website crawl

Enter a URL and Bookbag crawls outward from it, following links and extracting the readable content of each page it finds — stripping navigation, footers, and boilerplate. A single website source can produce many documents (one per page) and many chunks.

Point the crawler at content, not your homepage

Your homepage is mostly marketing and navigation. Start the crawl at your help center or FAQ — it is the highest-signal content for support and produces far better retrieval.

File upload

Upload a document and Bookbag extracts the text, chunks it, and embeds it. Good for content that lives in PDFs or spreadsheets — product specs, return-policy documents, internal macros you want the agent to draw on.

Text snippet

Paste text straight in. This is the fastest way to capture a policy or process that lives in someone's head and isn't published anywhere the crawler can reach.

Q&A pairs

A Q&A pair is an exact question and the exact answer you want the agent to give. Unlike crawled or uploaded text — which the model paraphrases — Q&A pairs are treated as authoritative and take priority during retrieval.

Pin the answers that must be exact

Anything touching money, eligibility, or legal commitments belongs in a Q&A pair. A handful of well-chosen pairs (refund window, shipping times, warranty terms) eliminates the most damaging category of wrong answers.

How training works

When you add or retrain a source, Bookbag runs it through an ingestion pipeline. The same four steps run for every non-Q&A source:

  1. 1
    Extract
    Bookbag pulls the readable text out of the source — crawling pages for a website, parsing a file, or reading your pasted text.
  2. 2
    Chunk
    The text is split into smaller passages sized for retrieval. Tight, single-topic chunks retrieve better than one giant page.
  3. 3
    Embed
    Each chunk is turned into a vector with the agent's embedding model, capturing its meaning so similar questions match it.
  4. 4
    Index
    The vectors are stored in the agent's vector index, ready to be searched at query time.

Q&A pairs follow a shorter path: the question is embedded and stored alongside the approved answer, so a matching question returns that answer directly.

For the full picture of how indexed sources turn into trustworthy, cited answers at query time, see Response quality.

Source status

Each source shows a status as it moves through the pipeline:

StatusMeaning
QueuedThe source is waiting to be processed.
ProcessingBookbag is extracting, chunking, and embedding it now.
TrainedDone — the agent can answer from this source.
ErrorIngestion failed. The source shows the reason (for example, no extractable text or no crawlable pages).
check

When a source reaches Trained, its content is live in the playground and on every connected channel immediately.

When a source errors

The most common causes are a page with no extractable text (an image-only PDF, or a JavaScript-rendered page the crawler can't read) and a starting URL with no crawlable links. Fix the source or paste the content as a Text source instead, then retrain.

Q&A priority and how retrieval chooses

At query time Bookbag first checks your Q&A pairs. If the customer's question closely matches a pair, that exact answer is returned and the agent skips paraphrasing entirely. Only when no Q&A pair is a strong match does Bookbag fall back to searching your chunked sources and grounding the model's reply in the top results.

This is why Q&A is the right tool for precision and crawls/files are the right tool for coverage. Use crawls and files to give the agent broad knowledge; use Q&A to lock down the specific answers you cannot afford to get wrong.

Keeping knowledge fresh

Stale data is the single most common cause of wrong answers. When a policy, price, or shipping timeline changes, update the source.

Retraining a source

Use Retrain on a source to re-run ingestion. Retraining is idempotent: Bookbag clears the source's prior documents, chunks, and Q&A data first, then re-extracts and re-indexes from scratch — so a re-crawled page never leaves stale chunks behind.

  1. 1
    Update the underlying content
    Edit the page, re-export the file, or rewrite the Q&A answer.
  2. 2
    Retrain the source
    For a website source this re-crawls; for a file, re-upload; for text or Q&A, edit and save.
  3. 3
    Confirm it reaches Trained
    The status returns to Trained and the new content is immediately live.
Scheduled retraining

On Standard and higher plans, website sources can retrain on a schedule so a changing help center stays current without manual re-crawls. See Plans & billing for which plans include scheduled retraining.

Turning real conversations into better data

Your customers tell you where your knowledge has gaps. Bookbag surfaces this two ways:

  • Suggestions on the Data sources tab flag low-confidence answers (a likely missing-content gap) and thumbs-down answers, with the original question — so you know exactly what source or Q&A pair to add.
  • Improve answer lets you edit a reply and save it as a high-priority Q&A pair, so the corrected answer is retrieved first next time.

Review these alongside Activity & chat logs on a regular cadence and the agent gets measurably better every week.

Embedding models and the vector index

Every agent has one embedding model, and all of its chunks are embedded with that model. This matters because retrieval can only compare vectors produced by the same model — Bookbag pins the embedding model per agent so dimensions never mix. If you change an agent's embedding model, retrain its sources so the index is rebuilt consistently. For the trade-offs between embedding models, see Models & model choice.

Common questions

What's next