Data sources
Everything your agent knows comes from data sources — website crawls, file uploads, pasted text, and Q&A pairs. Learn each type, how training (chunk, embed, index) works, Q&A priority, and how to keep knowledge fresh with retraining.
View as MarkdownA Bookbag agent only answers from what you teach it. Data sources are how you teach it. Each source you add is extracted, split into chunks, embedded into a vector index, and made retrievable — so when a customer asks a question, the agent pulls the most relevant pieces of your content and answers from those.
You manage sources from an agent's Data sources tab. This page covers the four source types, what happens when a source trains, how Q&A pairs take priority, and how to keep everything current.
Sources are scoped to a single agent. Two agents in the same workspace each have their own knowledge — add a source to the agent that should answer from it. Manage them at app.bookbag.ai under your agent's Data sources tab.
The four source types
Bookbag supports four kinds of data source. Most stores use a mix: a website crawl for the bulk of content, files for catalogs and policy PDFs, and a handful of Q&A pairs to pin the answers that must be exact.
| Type | What it ingests | Best for |
|---|---|---|
| Website | A bounded multi-page crawl that follows links from a starting URL and extracts the readable text of each page. | Help centers, FAQ pages, policy and shipping pages. |
| File | An uploaded document (PDF, doc, spreadsheet, text). Bookbag extracts and chunks the text. | Product catalogs, manuals, policy documents, exported macros. |
| Text | A snippet you paste directly into a text box. | A policy that isn't written down anywhere public yet. |
| Q&A | An exact question paired with the exact answer you want returned. | High-stakes answers: refund window, warranty terms, shipping cut-offs. |
Website crawl
Enter a URL and Bookbag crawls outward from it, following links and extracting the readable content of each page it finds — stripping navigation, footers, and boilerplate. A single website source can produce many documents (one per page) and many chunks.
Your homepage is mostly marketing and navigation. Start the crawl at your help center or FAQ — it is the highest-signal content for support and produces far better retrieval.
File upload
Upload a document and Bookbag extracts the text, chunks it, and embeds it. Good for content that lives in PDFs or spreadsheets — product specs, return-policy documents, internal macros you want the agent to draw on.
Text snippet
Paste text straight in. This is the fastest way to capture a policy or process that lives in someone's head and isn't published anywhere the crawler can reach.
Q&A pairs
A Q&A pair is an exact question and the exact answer you want the agent to give. Unlike crawled or uploaded text — which the model paraphrases — Q&A pairs are treated as authoritative and take priority during retrieval.
Anything touching money, eligibility, or legal commitments belongs in a Q&A pair. A handful of well-chosen pairs (refund window, shipping times, warranty terms) eliminates the most damaging category of wrong answers.
How training works
When you add or retrain a source, Bookbag runs it through an ingestion pipeline. The same four steps run for every non-Q&A source:
- 1ExtractBookbag pulls the readable text out of the source — crawling pages for a website, parsing a file, or reading your pasted text.
- 2ChunkThe text is split into smaller passages sized for retrieval. Tight, single-topic chunks retrieve better than one giant page.
- 3EmbedEach chunk is turned into a vector with the agent's embedding model, capturing its meaning so similar questions match it.
- 4IndexThe vectors are stored in the agent's vector index, ready to be searched at query time.
Q&A pairs follow a shorter path: the question is embedded and stored alongside the approved answer, so a matching question returns that answer directly.
For the full picture of how indexed sources turn into trustworthy, cited answers at query time, see Response quality.
Source status
Each source shows a status as it moves through the pipeline:
| Status | Meaning |
|---|---|
| Queued | The source is waiting to be processed. |
| Processing | Bookbag is extracting, chunking, and embedding it now. |
| Trained | Done — the agent can answer from this source. |
| Error | Ingestion failed. The source shows the reason (for example, no extractable text or no crawlable pages). |
When a source reaches Trained, its content is live in the playground and on every connected channel immediately.
The most common causes are a page with no extractable text (an image-only PDF, or a JavaScript-rendered page the crawler can't read) and a starting URL with no crawlable links. Fix the source or paste the content as a Text source instead, then retrain.
Q&A priority and how retrieval chooses
At query time Bookbag first checks your Q&A pairs. If the customer's question closely matches a pair, that exact answer is returned and the agent skips paraphrasing entirely. Only when no Q&A pair is a strong match does Bookbag fall back to searching your chunked sources and grounding the model's reply in the top results.
This is why Q&A is the right tool for precision and crawls/files are the right tool for coverage. Use crawls and files to give the agent broad knowledge; use Q&A to lock down the specific answers you cannot afford to get wrong.
Keeping knowledge fresh
Stale data is the single most common cause of wrong answers. When a policy, price, or shipping timeline changes, update the source.
Retraining a source
Use Retrain on a source to re-run ingestion. Retraining is idempotent: Bookbag clears the source's prior documents, chunks, and Q&A data first, then re-extracts and re-indexes from scratch — so a re-crawled page never leaves stale chunks behind.
- 1Update the underlying contentEdit the page, re-export the file, or rewrite the Q&A answer.
- 2Retrain the sourceFor a website source this re-crawls; for a file, re-upload; for text or Q&A, edit and save.
- 3Confirm it reaches TrainedThe status returns to Trained and the new content is immediately live.
On Standard and higher plans, website sources can retrain on a schedule so a changing help center stays current without manual re-crawls. See Plans & billing for which plans include scheduled retraining.
Turning real conversations into better data
Your customers tell you where your knowledge has gaps. Bookbag surfaces this two ways:
- Suggestions on the Data sources tab flag low-confidence answers (a likely missing-content gap) and thumbs-down answers, with the original question — so you know exactly what source or Q&A pair to add.
- Improve answer lets you edit a reply and save it as a high-priority Q&A pair, so the corrected answer is retrieved first next time.
Review these alongside Activity & chat logs on a regular cadence and the agent gets measurably better every week.
Embedding models and the vector index
Every agent has one embedding model, and all of its chunks are embedded with that model. This matters because retrieval can only compare vectors produced by the same model — Bookbag pins the embedding model per agent so dimensions never mix. If you change an agent's embedding model, retrain its sources so the index is rebuilt consistently. For the trade-offs between embedding models, see Models & model choice.
Common questions
What's next
Chat with your agent and inspect the exact sources behind each answer.
How retrieval, citations, and Q&A priority produce trustworthy answers.
Pick the model and embedding model that fit your agent.
How to structure knowledge for accurate, on-brand answers.