Website crawl — feed your existing site to the bot

The crawl is the fastest way to get Wilow answering useful questions: point it at your existing website, let it pull the pages you choose, and the content lands in your knowledge base as searchable snippets. For most accounts this is the setup step — you go from "empty bot that says I don't know" to "answers grounded in your real content" in twenty minutes.

Use this alongside, not instead of, snippets and documents. The crawl covers the public shape of your business (product pages, FAQs, about); snippets and documents cover the things that aren't on your site (internal docs, PDFs, fast fixes).

How the wizard walks you through it

The crawl page is a step-by-step wizard. You'll see one panel at a time; Next advances and Back returns. The five steps:

Confirm the URL. We pre-fill your account domain; change if you want a different starting point. Subdomains are crawled too, as long as they're under the same registered domain.
Accept the crawl terms. First-time only — you confirm you have permission to crawl this site. We respect robots.txt by default and tell you if it disallows us; you can override per-crawl with "I authorise this crawl anyway", which we use sparingly.
Pick which pages to crawl. We discover the site's URL tree and show you a checkbox list. Default selection skips boilerplate (privacy policy, terms of service, sitemap pages) and pre-checks what looks like content. Adjust as needed.
Watch progress. Live polling shows pages-done/total and any errors. Most crawls finish in a few minutes; large sites take longer. You can leave the page and come back; the job continues server-side.
Review extracted snippets. Past the raw page-text capture, Wilow optionally extracts structured snippets — your FAQ Q&As, your product blurbs, your contact details — using an LLM pass over the raw text. Pre-checked candidates are non-duplicates of what's already in your knowledge base. Uncheck anything that's uninteresting; Accept selected copies them in.

The extracted snippets are deduplicated against your existing knowledge — if a candidate is too similar to a snippet you've already got, the wizard surfaces it as a merge proposal instead of double-storing.

Robots.txt and politeness

We respect robots.txt. If your site's robots disallows us, the wizard tells you up-front and offers an I authorise this crawl anyway override — use only for your own sites where the robots block was incidental. The crawl is rate-limited (around one request per second) and identifies itself as WilowBot in the User-Agent so your access logs are explicit.

Re-crawling later

Content drifts. Re-run the crawl periodically to catch new pages and update changed ones. The wizard treats it as a fresh job and shows new candidates as additions, with changed pages flagged so you can see the diff before accepting. Existing snippets aren't blown away — the post-crawl extract step funnels everything through merge proposals for review.

Scheduled auto-sync. On the Website crawl page, pick a cadence (daily, weekly, or monthly) under Scheduled auto-sync and we re-run the crawl on that schedule. Tenant-edited products and contacts are never overwritten — auto-sync only adds new candidates and refreshes unchanged source text. The "Last sync" timestamp on the same panel tells you when it last ran.

Costs and limits

The crawl itself counts against a per-day page limit set on your plan. Most accounts won't hit it on the initial crawl; only operations with large sites and frequent re-crawls do. Usage is visible on the usage page.

The extract step uses an LLM — that's a real token spend. You can skip extraction entirely (just take the raw page text) if you want to keep costs down; in that case the bot is grounded on the full page text rather than digested Q&A snippets.

Common questions

How do I crawl my site? Knowledge → Website crawl → Start crawl. Walk the wizard.
It says robots.txt disallows. Your robots.txt explicitly blocks crawlers. Either edit your robots to allow WilowBot, or use the I authorise this crawl anyway override (your-own-site only).
The crawl is stuck on "discovering pages". Sites with huge URL trees or aggressive bot-block CDNs (Cloudflare's "I'm under attack" mode) can stall discovery. Lower the depth in the URL step or whitelist WilowBot at your CDN.
I see duplicates in extracted snippets. Anything similar to an existing snippet flows to merge proposals for review rather than being stored twice. Check that page.
Can I exclude pages? Two ways. Per-job: in the page-selection step, uncheck them. Workspace-wide: on the Website crawl page, Exclude patterns takes a list of glob patterns (e.g. /blog/*, *?utm_*); matching URLs are skipped on every crawl and every auto-sync.
Does the crawl pull products into my catalog? Yes if you have the product-catalog feature on — extraction produces product candidates that surface on the catalog page for review.

See also knowledge base for snippet management, documents for PDFs/DOCX, catalog for products extracted from the crawl, and merge proposals for dedup review.

Where to find us

Stuck? Email [email protected]. Include the URL you're trying to crawl and any error you saw — usually it's a robots/CDN issue and we can spot it on first glance.