
About
Fast, local-first web content extraction optimized for LLMs. webclaw extracts clean markdown from any URL with 67% fewer tokens than raw HTML, using Chrome-level TLS fingerprinting — no headless browser required. It provides 10 MCP tools for AI agents: scrape, crawl, map, batch, extract, summarize, diff, brand, search, and research.
Features
- LLM-optimized output — 67% fewer tokens than raw HTML while preserving metadata, links, and images
- Sub-millisecond extraction — 3.2ms average for 100KB pages
- TLS fingerprinting — Chrome-level fingerprinting bypasses basic bot detection without browsers
- No browser overhead — Pure Rust extraction, no Selenium/Puppeteer
- 8 local tools — Most tools work offline without any API keys
- Multiple output formats — Markdown, text, JSON, LLM-optimized, HTML
- Content control — CSS selectors for include/exclude, auto main-content detection
- Recursive crawling — BFS same-origin crawl with sitemap support
- LLM features — Summarization and structured extraction via Ollama/OpenAI/Anthropic
- Change tracking — Diff snapshots to detect content changes
- Brand extraction — Colors, fonts, logos from any website
Tools
- scrape — Extract clean content from any URL
- crawl — Recursive site crawl with depth/page limits
- map — Discover URLs from sitemaps
- batch — Parallel multi-URL extraction
- extract — LLM-powered structured data extraction
- summarize — Page summarization
- diff — Content change detection
- brand — Brand identity extraction (colors, fonts, logos)
- search — Web search + scrape results (requires API key)
- research — Deep multi-source research (requires API key)
Installation
Quick setup (recommended)
npx create-webclaw
Auto-detects your AI tools and configures everything.
Homebrew
brew tap 0xMassi/webclaw
brew install webclaw
Docker
docker run --rm ghcr.io/0xmassi/webclaw https://example.com
Prebuilt binaries
Download from GitHub Releases for macOS and Linux.
Usage Examples
Basic extraction:
webclaw https://stripe.com -f llm
Crawl documentation:
webclaw https://docs.rust-lang.org --crawl --depth 2 --max-pages 50
Extract brand identity:
webclaw https://github.com --brand
Content filtering:
webclaw URL --include "article, .content" --exclude "nav, footer"
LLM features:
webclaw URL --summarize
webclaw URL --extract-prompt "Get all prices"
Optional Cloud Features
The cloud API at webclaw.io provides:
- Antibot bypass (Cloudflare, DataDome, WAF)
- JavaScript rendering for SPAs
- Web search and deep research tools
- Async crawl jobs
Local tools work without any API key. Cloud is used as fallback for bot-protected sites.
This server runs through your single 1Server connection. No extra config required.