P

PDF Reader MCP

sylphxai
GitHubnpm

About

Production-ready PDF processing server for AI agents with enterprise-grade capabilities. Extract text, images, and metadata with 5-10x faster parallel processing performance.

Features

Core Capabilities

  • Text Extraction - Full document or specific pages with intelligent parsing
  • Image Extraction - Base64-encoded images with complete metadata (width, height, format)
  • Content Ordering - Y-coordinate based layout preservation for natural reading flow
  • Metadata Extraction - Author, title, creation date, and custom properties
  • Page Counting - Fast enumeration without loading full content
  • Dual Sources - Local files (absolute or relative paths) and HTTP/HTTPS URLs
  • Batch Processing - Multiple PDFs processed concurrently

Performance

  • 5-10x faster than sequential processing with automatic parallelization
  • 12,933 ops/sec error handling, 5,575 ops/sec text extraction
  • Process 50-page PDFs in seconds with multi-core utilization
  • Lightweight with minimal dependencies

Advanced Features

  • Smart Pagination - Extract ranges like "1-5,10-15,20"
  • Multi-Format Images - RGB, RGBA, Grayscale with automatic detection
  • Path Flexibility - Windows, Unix, and relative paths all supported
  • Error Resilience - Per-page error isolation with detailed messages
  • Large File Support - Efficient streaming and memory management
  • Type Safe - Full TypeScript with strict mode enabled

Usage

Basic Text Extraction

{
  "sources": [{
    "path": "documents/report.pdf"
  }],
  "include_full_text": true,
  "include_metadata": true,
  "include_page_count": true
}

Extract Specific Pages

{
  "sources": [{
    "path": "documents/manual.pdf",
    "pages": "1-5,10,15-20"
  }],
  "include_full_text": true
}

Absolute Paths

{
  "sources": [{
    "path": "C:\\Users\\John\\Documents\\report.pdf"
  }],
  "include_full_text": true
}

Extract Images with Natural Ordering

{
  "sources": [{
    "path": "presentation.pdf",
    "pages": [1, 2, 3]
  }],
  "include_images": true,
  "include_full_text": true
}

Images and text are returned in document order based on Y-coordinates, preserving natural reading flow.

Batch Processing

{
  "sources": [
    { "path": "C:\\Reports\\Q1.pdf", "pages": "1-10" },
    { "path": "/home/user/Q2.pdf", "pages": "1-10" },
    { "url": "https://example.com/Q3.pdf" }
  ],
  "include_full_text": true
}

All PDFs are processed in parallel automatically.

From URL

{
  "sources": [{
    "url": "https://arxiv.org/pdf/2301.00001.pdf"
  }],
  "include_full_text": true
}

Security & Sandboxing

By default the server can read any local file and fetch any HTTP(S) URL. Use these options to restrict access:

Restrict Filesystem Access

npx @sylphx/pdf-reader-mcp --allow-dir=/srv/pdfs --allow-dir=/data/reports

Or via environment variable:

MCP_PDF_ALLOWED_DIRS="/srv/pdfs:/data/reports" npx @sylphx/pdf-reader-mcp

Disable or Restrict HTTP

# Block all URL sources
npx @sylphx/pdf-reader-mcp --no-http

# Allowlist hosts
npx @sylphx/pdf-reader-mcp --allow-host=cdn.example.com --allow-host=files.internal

HTTP Transport (Remote Access)

Run as an HTTP server for remote access:

MCP_TRANSPORT=http npx @sylphx/pdf-reader-mcp

Performance Benchmarks

OperationOps/secUse Case
Error handling12,933Validation & safety
Extract full text5,575Document analysis
Extract page5,329Single page ops
Multiple pages5,242Batch processing
Metadata only4,912Quick inspection

Parallel Processing Speedup

  • 10-page PDF: ~2s → ~0.3s (5-8x faster)
  • 50-page PDF: ~10s → ~1s (10x faster)
  • 100+ pages: Linear scaling with CPU cores

Quality

  • ✅ 103 tests passing
  • ✅ 94%+ test coverage
  • ✅ 98%+ function coverage
  • ✅ Strict TypeScript
  • ✅ Production ready

This server runs through your single 1Server connection. No extra config required.

0Installs
725Stars

Categories

FilesProductivityAI Tools

Tags

Official