How many sources can I add?

It depends on your plan — each plan sets a maximum number of sources per agent. See Plans & billing for the limits. The Free plan is intended for trying things out; paid plans raise the cap substantially.

Does adding sources cost credits?

No. Credits are spent on AI replies, not on training. You can add and retrain as many sources as your plan allows without spending credits. See Credits & usage.

A page on my site isn't being picked up by the crawl. Why?

The crawler follows links and reads server-rendered text. Pages with no inbound links from the starting URL, or content that only renders via JavaScript, may not be captured. Add the page as its own website source, or paste its content as a Text source.

The agent gave a slightly wrong answer. Should I fix the prompt?

Usually not. Most wrong answers are a missing or unclear source, not a prompt problem. Add a Q&A pair with the exact correct answer — it pins the wording and short-circuits paraphrase.

Can two agents share knowledge?

Not directly — sources are per-agent. If two agents need the same knowledge, add the source to each. This keeps each agent's retrieval clean and scoped to what it should answer.

Data sources

Everything your agent knows comes from data sources — website crawls, file uploads, pasted text, and Q&A pairs. Learn each type, how training (chunk, embed, index) works, Q&A priority, and how to keep knowledge fresh with retraining.

View as Markdown

A Bookbag agent only answers from what you teach it. Data sources are how you teach it. Each source you add is extracted, split into chunks, embedded into a vector index, and made retrievable — so when a customer asks a question, the agent pulls the most relevant pieces of your content and answers from those.

You manage sources from an agent's Data sources tab. This page covers the four source types, what happens when a source trains, how Q&A pairs take priority, and how to keep everything current.

Where data sources live

Sources are scoped to a single agent. Two agents in the same workspace each have their own knowledge — add a source to the agent that should answer from it. Manage them at app.bookbag.ai under your agent's Data sources tab.

The four source types

Bookbag supports four kinds of data source. Most stores use a mix: a website crawl for the bulk of content, files for catalogs and policy PDFs, and a handful of Q&A pairs to pin the answers that must be exact.

Type	What it ingests	Best for
Website	A bounded multi-page crawl that follows links from a starting URL and extracts the readable text of each page.	Help centers, FAQ pages, policy and shipping pages.
File	An uploaded document (PDF, doc, spreadsheet, text). Bookbag extracts and chunks the text.	Product catalogs, manuals, policy documents, exported macros.
Text	A snippet you paste directly into a text box.	A policy that isn't written down anywhere public yet.
Q&A	An exact question paired with the exact answer you want returned.	High-stakes answers: refund window, warranty terms, shipping cut-offs.

Website crawl

Enter a URL and Bookbag crawls outward from it, following links and extracting the readable content of each page it finds — stripping navigation, footers, and boilerplate. A single website source can produce many documents (one per page) and many chunks.

Point the crawler at content, not your homepage

Your homepage is mostly marketing and navigation. Start the crawl at your help center or FAQ — it is the highest-signal content for support and produces far better retrieval.

File upload

Upload a document and Bookbag extracts the text, chunks it, and embeds it. Good for content that lives in PDFs or spreadsheets — product specs, return-policy documents, internal macros you want the agent to draw on.

Text snippet

Paste text straight in. This is the fastest way to capture a policy or process that lives in someone's head and isn't published anywhere the crawler can reach.

Q&A pairs

A Q&A pair is an exact question and the exact answer you want the agent to give. Unlike crawled or uploaded text — which the model paraphrases — Q&A pairs are treated as authoritative and take priority during retrieval.

Pin the answers that must be exact

Anything touching money, eligibility, or legal commitments belongs in a Q&A pair. A handful of well-chosen pairs (refund window, shipping times, warranty terms) eliminates the most damaging category of wrong answers.

How training works

When you add or retrain a source, Bookbag runs it through an ingestion pipeline. The same four steps run for every non-Q&A source:

1
Extract
Bookbag pulls the readable text out of the source — crawling pages for a website, parsing a file, or reading your pasted text.
2
Chunk
The text is split into smaller passages sized for retrieval. Tight, single-topic chunks retrieve better than one giant page.
3
Embed
Each chunk is turned into a vector with the agent's embedding model, capturing its meaning so similar questions match it.
4
Index
The vectors are stored in the agent's vector index, ready to be searched at query time.

Q&A pairs follow a shorter path: the question is embedded and stored alongside the approved answer, so a matching question returns that answer directly.

For the full picture of how indexed sources turn into trustworthy, cited answers at query time, see Response quality.

Source status

Each source shows a status as it moves through the pipeline:

Status	Meaning
Queued	The source is waiting to be processed.
Processing	Bookbag is extracting, chunking, and embedding it now.
Trained	Done — the agent can answer from this source.
Error	Ingestion failed. The source shows the reason (for example, no extractable text or no crawlable pages).

check

When a source reaches Trained, its content is live in the playground and on every connected channel immediately.

When a source errors

The most common causes are a page with no extractable text (an image-only PDF, or a JavaScript-rendered page the crawler can't read) and a starting URL with no crawlable links. Fix the source or paste the content as a Text source instead, then retrain.

Q&A priority and how retrieval chooses

At query time Bookbag first checks your Q&A pairs. If the customer's question closely matches a pair, that exact answer is returned and the agent skips paraphrasing entirely. Only when no Q&A pair is a strong match does Bookbag fall back to searching your chunked sources and grounding the model's reply in the top results.

This is why Q&A is the right tool for precision and crawls/files are the right tool for coverage. Use crawls and files to give the agent broad knowledge; use Q&A to lock down the specific answers you cannot afford to get wrong.

How next-generation retrieval finds the right answer

When no Q&A pair matches, Bookbag searches your chunked sources with a retrieval pipeline built for large, messy knowledge bases. It does not just return the closest text match — it scores, dedupes, and reranks so the agent grounds its answer in the best, freshest, most specific passages.

Capability	What it does for your answers
Higher-accuracy semantic search	Blends meaning (embeddings), keywords, and exact-phrase matches, then reranks for diversity — so the agent draws on several relevant passages instead of three near-copies of one page.
Stale & conflicting-doc handling	Repeated content is de-duplicated, and when two sources disagree the most recently updated one wins — the agent is told which source is freshest and trusts it.
Structured-data retrieval	Specs, sizing charts, and policy tables are read as structured facts (e.g. size, price, return window), so attribute questions like “what’s the return window for sale items?” get the exact value.

It gets better every time you retrain

Freshness and structured-fact signals are computed when a source trains. Retrain (or re-crawl) a source after you update it so the agent picks up the new content and treats it as the most current.

Keeping knowledge fresh

Stale data is the single most common cause of wrong answers. When a policy, price, or shipping timeline changes, update the source.

Retraining a source

Use Retrain on a source to re-run ingestion. Retraining is idempotent: Bookbag clears the source's prior documents, chunks, and Q&A data first, then re-extracts and re-indexes from scratch — so a re-crawled page never leaves stale chunks behind.

1
Update the underlying content
Edit the page, re-export the file, or rewrite the Q&A answer.
2
Retrain the source
For a website source this re-crawls; for a file, re-upload; for text or Q&A, edit and save.
3
Confirm it reaches Trained
The status returns to Trained and the new content is immediately live.

Scheduled retraining

On Standard and higher plans, website sources can retrain on a schedule so a changing help center stays current without manual re-crawls. See Plans & billing for which plans include scheduled retraining.

Turning real conversations into better data

Your customers tell you where your knowledge has gaps. Bookbag surfaces this two ways:

Suggestions on the Data sources tab flag low-confidence answers (a likely missing-content gap) and thumbs-down answers, with the original question — so you know exactly what source or Q&A pair to add.
Improve answer lets you edit a reply and save it as a high-priority Q&A pair, so the corrected answer is retrieved first next time.

Review these alongside Activity & chat logs on a regular cadence and the agent gets measurably better every week.

Embedding models and the vector index

Every agent has one embedding model, and all of its chunks are embedded with that model. This matters because retrieval can only compare vectors produced by the same model — Bookbag pins the embedding model per agent so dimensions never mix. If you change an agent's embedding model, retrain its sources so the index is rebuilt consistently. For the trade-offs between embedding models, see Models & model choice.

Common questions

What's next

Test in the playground

Chat with your agent and inspect the exact sources behind each answer.

Response quality

How retrieval, citations, and Q&A priority produce trustworthy answers.

Models & model choice

Pick the model and embedding model that fit your agent.

Best practices

How to structure knowledge for accurate, on-brand answers.

Response quality

Playground