--- name: ocr-and-documents description: Extract text from PDFs and scanned documents. Use web_extract for remote URLs, pymupdf for local text-based PDFs, marker-pdf for OCR/scanned docs. For DOCX use python-docx, for PPTX see the powerpoint skill. version: 2.3.0 author: Hermes Agent license: MIT metadata: hermes: tags: [PDF, Documents, Research, Arxiv, Text-Extraction, OCR] related_skills: [powerpoint] --- # PDF & Document Extraction For DOCX: use `python-docx` (parses actual document structure, far better than OCR). For PPTX: see the `powerpoint` skill (uses `python-pptx` with full slide/notes support). This skill covers **PDFs and scanned documents**. ## Step 1: Remote URL Available? If the document has a URL, **always try `web_extract` first**: ``` web_extract(urls=["https://arxiv.org/pdf/2402.03300"]) web_extract(urls=["https://example.com/report.pdf"]) ``` This handles PDF-to-markdown conversion via Firecrawl with no local dependencies. Only use local extraction when: the file is local, web_extract fails, or you need batch processing. ## Step 2: Choose Local Extractor | Feature | pymupdf (~25MB) | marker-pdf (~3-5GB) | |---------|-----------------|---------------------| | **Text-based PDF** | ✅ | ✅ | | **Scanned PDF (OCR)** | ❌ | ✅ (90+ languages) | | **Tables** | ✅ (basic) | ✅ (high accuracy) | | **Equations / LaTeX** | ❌ | ✅ | | **Code blocks** | ❌ | ✅ | | **Forms** | ❌ | ✅ | | **Headers/footers removal** | ❌ | ✅ | | **Reading order detection** | ❌ | ✅ | | **Images extraction** | ✅ (embedded) | ✅ (with context) | | **Images → text (OCR)** | ❌ | ✅ | | **EPUB** | ✅ | ✅ | | **Markdown output** | ✅ (via pymupdf4llm) | ✅ (native, higher quality) | | **Install size** | ~25MB | ~3-5GB (PyTorch + models) | | **Speed** | Instant | ~1-14s/page (CPU), ~0.2s/page (GPU) | **Decision**: Use pymupdf unless you need OCR, equations, forms, or complex layout analysis. If the user needs marker capabilities but the system lacks ~5GB free disk: > "This document needs OCR/advanced extraction (marker-pdf), which requires ~5GB for PyTorch and models. Your system has [X]GB free. Options: free up space, provide a URL so I can use web_extract, or I can try pymupdf which works for text-based PDFs but not scanned documents or equations." --- ## pymupdf (lightweight) ```bash pip install pymupdf pymupdf4llm ``` **Via helper script**: ```bash python scripts/extract_pymupdf.py document.pdf # Plain text python scripts/extract_pymupdf.py document.pdf --markdown # Markdown python scripts/extract_pymupdf.py document.pdf --tables # Tables python scripts/extract_pymupdf.py document.pdf --images out/ # Extract images python scripts/extract_pymupdf.py document.pdf --metadata # Title, author, pages python scripts/extract_pymupdf.py document.pdf --pages 0-4 # Specific pages ``` **Inline**: ```bash python3 -c " import pymupdf doc = pymupdf.open('document.pdf') for page in doc: print(page.get_text()) " ``` --- ## marker-pdf (high-quality OCR) ```bash # Check disk space first python scripts/extract_marker.py --check pip install marker-pdf ``` **Via helper script**: ```bash python scripts/extract_marker.py document.pdf # Markdown python scripts/extract_marker.py document.pdf --json # JSON with metadata python scripts/extract_marker.py document.pdf --output_dir out/ # Save images python scripts/extract_marker.py scanned.pdf # Scanned PDF (OCR) python scripts/extract_marker.py document.pdf --use_llm # LLM-boosted accuracy ``` **CLI** (installed with marker-pdf): ```bash marker_single document.pdf --output_dir ./output marker /path/to/folder --workers 4 # Batch ``` --- ## Arxiv Papers ``` # Abstract only (fast) web_extract(urls=["https://arxiv.org/abs/2402.03300"]) # Full paper web_extract(urls=["https://arxiv.org/pdf/2402.03300"]) # Search web_search(query="arxiv GRPO reinforcement learning 2026") ``` ## Split, Merge & Search pymupdf handles these natively — use `execute_code` or inline Python: ```python # Split: extract pages 1-5 to a new PDF import pymupdf doc = pymupdf.open("report.pdf") new = pymupdf.open() for i in range(5): new.insert_pdf(doc, from_page=i, to_page=i) new.save("pages_1-5.pdf") ``` ```python # Merge multiple PDFs import pymupdf result = pymupdf.open() for path in ["a.pdf", "b.pdf", "c.pdf"]: result.insert_pdf(pymupdf.open(path)) result.save("merged.pdf") ``` ```python # Search for text across all pages import pymupdf doc = pymupdf.open("report.pdf") for i, page in enumerate(doc): results = page.search_for("revenue") if results: print(f"Page {i+1}: {len(results)} match(es)") print(page.get_text("text")) ``` No extra dependencies needed — pymupdf covers split, merge, search, and text extraction in one package. --- ## Notes - `web_extract` is always first choice for URLs - pymupdf is the safe default — instant, no models, works everywhere - marker-pdf is for OCR, scanned docs, equations, complex layouts — install only when needed - Both helper scripts accept `--help` for full usage - marker-pdf downloads ~2.5GB of models to `~/.cache/huggingface/` on first use - For Word docs: `pip install python-docx` (better than OCR — parses actual structure) - For PowerPoint: see the `powerpoint` skill (uses python-pptx)