webclaw

webclaw

Installable
0xmassi
GitHubnpm

About

Fast, local-first web content extraction optimized for LLMs. webclaw extracts clean markdown from any URL with 67% fewer tokens than raw HTML, using Chrome-level TLS fingerprinting — no headless browser required. It provides 10 MCP tools for AI agents: scrape, crawl, map, batch, extract, summarize, diff, brand, search, and research.

Features

  • LLM-optimized output — 67% fewer tokens than raw HTML while preserving metadata, links, and images
  • Sub-millisecond extraction — 3.2ms average for 100KB pages
  • TLS fingerprinting — Chrome-level fingerprinting bypasses basic bot detection without browsers
  • No browser overhead — Pure Rust extraction, no Selenium/Puppeteer
  • 8 local tools — Most tools work offline without any API keys
  • Multiple output formats — Markdown, text, JSON, LLM-optimized, HTML
  • Content control — CSS selectors for include/exclude, auto main-content detection
  • Recursive crawling — BFS same-origin crawl with sitemap support
  • LLM features — Summarization and structured extraction via Ollama/OpenAI/Anthropic
  • Change tracking — Diff snapshots to detect content changes
  • Brand extraction — Colors, fonts, logos from any website

Tools

  • scrape — Extract clean content from any URL
  • crawl — Recursive site crawl with depth/page limits
  • map — Discover URLs from sitemaps
  • batch — Parallel multi-URL extraction
  • extract — LLM-powered structured data extraction
  • summarize — Page summarization
  • diff — Content change detection
  • brand — Brand identity extraction (colors, fonts, logos)
  • search — Web search + scrape results (requires API key)
  • research — Deep multi-source research (requires API key)

Installation

npx create-webclaw

Auto-detects your AI tools and configures everything.

Homebrew

brew tap 0xMassi/webclaw
brew install webclaw

Docker

docker run --rm ghcr.io/0xmassi/webclaw https://example.com

Prebuilt binaries

Download from GitHub Releases for macOS and Linux.

Usage Examples

Basic extraction:

webclaw https://stripe.com -f llm

Crawl documentation:

webclaw https://docs.rust-lang.org --crawl --depth 2 --max-pages 50

Extract brand identity:

webclaw https://github.com --brand

Content filtering:

webclaw URL --include "article, .content" --exclude "nav, footer"

LLM features:

webclaw URL --summarize
webclaw URL --extract-prompt "Get all prices"

Optional Cloud Features

The cloud API at webclaw.io provides:

  • Antibot bypass (Cloudflare, DataDome, WAF)
  • JavaScript rendering for SPAs
  • Web search and deep research tools
  • Async crawl jobs

Local tools work without any API key. Cloud is used as fallback for bot-protected sites.

This server runs through your single 1Server connection. No extra config required.

0Installs
--Stars

Categories

WebAI ToolsData

Rate limit exceeded

Please slow down and try again in a moment.

Rate limit exceeded

Please slow down and try again in a moment.