Skip to content

webcrawler-deep-crawl

Deep-crawl any website from start URLs, return per-page LLM-ready text/markdown/HTML plus metadata (title, description, author, language, canonical URL, OG) and in-scope outbound links. Use when user mentions deep crawl website, recursive crawl, crawl a whole site, scrape entire website, scrape docs site, scrape documentation, scrape knowledge base, scrape blog, build RAG corpus, build vector database from website, knowledge base for chatbot, GPT knowledge files, llms.txt, sitemap crawl, BFS crawl, scrape with depth or page limit, include exclude URL globs, remove boilerplate, strip navigation header footer, website to markdown, website to text, multi-page extraction, bulk page scraping, clean markdown from URL, docs site to markdown corpus, site to clean corpus. Also applies to building RAG pipelines, indexing a customer site, syncing docs into a vector store, generating training corpora from any docs hub, or expanding a single start URL into a clean corpus of every reachable in-scope page.

Repository Source folder

Details

Path
solutions/search-research/webcrawler-deep-crawl
Bundled scripts
4
Dependencies
1

Bundled scripts

  • solutions/search-research/webcrawler-deep-crawl/scripts/discover-sitemap.py
  • solutions/search-research/webcrawler-deep-crawl/scripts/discover-llms-txt.py
  • solutions/search-research/webcrawler-deep-crawl/scripts/extract-page-content.py
  • solutions/search-research/webcrawler-deep-crawl/scripts/discover-links.py

FAQ