P
About
Production-ready PDF processing server for AI agents with enterprise-grade capabilities. Extract text, images, and metadata with 5-10x faster parallel processing performance.
Features
Core Capabilities
- Text Extraction - Full document or specific pages with intelligent parsing
- Image Extraction - Base64-encoded images with complete metadata (width, height, format)
- Content Ordering - Y-coordinate based layout preservation for natural reading flow
- Metadata Extraction - Author, title, creation date, and custom properties
- Page Counting - Fast enumeration without loading full content
- Dual Sources - Local files (absolute or relative paths) and HTTP/HTTPS URLs
- Batch Processing - Multiple PDFs processed concurrently
Performance
- 5-10x faster than sequential processing with automatic parallelization
- 12,933 ops/sec error handling, 5,575 ops/sec text extraction
- Process 50-page PDFs in seconds with multi-core utilization
- Lightweight with minimal dependencies
Advanced Features
- Smart Pagination - Extract ranges like "1-5,10-15,20"
- Multi-Format Images - RGB, RGBA, Grayscale with automatic detection
- Path Flexibility - Windows, Unix, and relative paths all supported
- Error Resilience - Per-page error isolation with detailed messages
- Large File Support - Efficient streaming and memory management
- Type Safe - Full TypeScript with strict mode enabled
Usage
Basic Text Extraction
{
"sources": [{
"path": "documents/report.pdf"
}],
"include_full_text": true,
"include_metadata": true,
"include_page_count": true
}
Extract Specific Pages
{
"sources": [{
"path": "documents/manual.pdf",
"pages": "1-5,10,15-20"
}],
"include_full_text": true
}
Absolute Paths
{
"sources": [{
"path": "C:\\Users\\John\\Documents\\report.pdf"
}],
"include_full_text": true
}
Extract Images with Natural Ordering
{
"sources": [{
"path": "presentation.pdf",
"pages": [1, 2, 3]
}],
"include_images": true,
"include_full_text": true
}
Images and text are returned in document order based on Y-coordinates, preserving natural reading flow.
Batch Processing
{
"sources": [
{ "path": "C:\\Reports\\Q1.pdf", "pages": "1-10" },
{ "path": "/home/user/Q2.pdf", "pages": "1-10" },
{ "url": "https://example.com/Q3.pdf" }
],
"include_full_text": true
}
All PDFs are processed in parallel automatically.
From URL
{
"sources": [{
"url": "https://arxiv.org/pdf/2301.00001.pdf"
}],
"include_full_text": true
}
Security & Sandboxing
By default the server can read any local file and fetch any HTTP(S) URL. Use these options to restrict access:
Restrict Filesystem Access
npx @sylphx/pdf-reader-mcp --allow-dir=/srv/pdfs --allow-dir=/data/reports
Or via environment variable:
MCP_PDF_ALLOWED_DIRS="/srv/pdfs:/data/reports" npx @sylphx/pdf-reader-mcp
Disable or Restrict HTTP
# Block all URL sources
npx @sylphx/pdf-reader-mcp --no-http
# Allowlist hosts
npx @sylphx/pdf-reader-mcp --allow-host=cdn.example.com --allow-host=files.internal
HTTP Transport (Remote Access)
Run as an HTTP server for remote access:
MCP_TRANSPORT=http npx @sylphx/pdf-reader-mcp
Performance Benchmarks
| Operation | Ops/sec | Use Case |
|---|---|---|
| Error handling | 12,933 | Validation & safety |
| Extract full text | 5,575 | Document analysis |
| Extract page | 5,329 | Single page ops |
| Multiple pages | 5,242 | Batch processing |
| Metadata only | 4,912 | Quick inspection |
Parallel Processing Speedup
- 10-page PDF: ~2s → ~0.3s (5-8x faster)
- 50-page PDF: ~10s → ~1s (10x faster)
- 100+ pages: Linear scaling with CPU cores
Quality
- ✅ 103 tests passing
- ✅ 94%+ test coverage
- ✅ 98%+ function coverage
- ✅ Strict TypeScript
- ✅ Production ready
This server runs through your single 1Server connection. No extra config required.
0Installs
725Stars
Categories
FilesProductivityAI Tools
Links
Tags
Official