skills/productivity/ocr-and-documents/SKILL.md

---
name: ocr-and-documents
description: Extract text from PDFs and scanned documents. Use web_extract for remote URLs, pymupdf for local text-based PDFs, marker-pdf for OCR/scanned docs. For DOCX use python-docx, for PPTX see the powerpoint skill.
version: 2.3.0
author: Hermes Agent
license: MIT
metadata:
  hermes:
    tags: [PDF, Documents, Research, Arxiv, Text-Extraction, OCR]
    related_skills: [powerpoint]
---

# PDF & Document Extraction

For DOCX: use `python-docx` (parses actual document structure, far better than OCR).
For PPTX: see the `powerpoint` skill (uses `python-pptx` with full slide/notes support).
This skill covers **PDFs and scanned documents**.

## Step 1: Remote URL Available?

If the document has a URL, **always try `web_extract` first**:

```
web_extract(urls=["https://arxiv.org/pdf/2402.03300"])
web_extract(urls=["https://example.com/report.pdf"])
```

This handles PDF-to-markdown conversion via Firecrawl with no local dependencies.

Only use local extraction when: the file is local, web_extract fails, or you need batch processing.

## Step 2: Choose Local Extractor

| Feature | pymupdf (~25MB) | marker-pdf (~3-5GB) |
|---------|-----------------|---------------------|
| **Text-based PDF** | ✅ | ✅ |
| **Scanned PDF (OCR)** | ❌ | ✅ (90+ languages) |
| **Tables** | ✅ (basic) | ✅ (high accuracy) |
| **Equations / LaTeX** | ❌ | ✅ |
| **Code blocks** | ❌ | ✅ |
| **Forms** | ❌ | ✅ |
| **Headers/footers removal** | ❌ | ✅ |
| **Reading order detection** | ❌ | ✅ |
| **Images extraction** | ✅ (embedded) | ✅ (with context) |
| **Images → text (OCR)** | ❌ | ✅ |
| **EPUB** | ✅ | ✅ |
| **Markdown output** | ✅ (via pymupdf4llm) | ✅ (native, higher quality) |
| **Install size** | ~25MB | ~3-5GB (PyTorch + models) |
| **Speed** | Instant | ~1-14s/page (CPU), ~0.2s/page (GPU) |

**Decision**: Use pymupdf unless you need OCR, equations, forms, or complex layout analysis.

If the user needs marker capabilities but the system lacks ~5GB free disk:
> "This document needs OCR/advanced extraction (marker-pdf), which requires ~5GB for PyTorch and models. Your system has [X]GB free. Options: free up space, provide a URL so I can use web_extract, or I can try pymupdf which works for text-based PDFs but not scanned documents or equations."

---

## pymupdf (lightweight)

```bash
pip install pymupdf pymupdf4llm
```

**Via helper script**:
```bash
python scripts/extract_pymupdf.py document.pdf              # Plain text
python scripts/extract_pymupdf.py document.pdf --markdown    # Markdown
python scripts/extract_pymupdf.py document.pdf --tables      # Tables
python scripts/extract_pymupdf.py document.pdf --images out/ # Extract images
python scripts/extract_pymupdf.py document.pdf --metadata    # Title, author, pages
python scripts/extract_pymupdf.py document.pdf --pages 0-4   # Specific pages
```

**Inline**:
```bash
python3 -c "
import pymupdf
doc = pymupdf.open('document.pdf')
for page in doc:
    print(page.get_text())
"
```

---

## marker-pdf (high-quality OCR)

```bash
# Check disk space first
python scripts/extract_marker.py --check

pip install marker-pdf
```

**Via helper script**:
```bash
python scripts/extract_marker.py document.pdf                # Markdown
python scripts/extract_marker.py document.pdf --json         # JSON with metadata
python scripts/extract_marker.py document.pdf --output_dir out/  # Save images
python scripts/extract_marker.py scanned.pdf                 # Scanned PDF (OCR)
python scripts/extract_marker.py document.pdf --use_llm      # LLM-boosted accuracy
```

**CLI** (installed with marker-pdf):
```bash
marker_single document.pdf --output_dir ./output
marker /path/to/folder --workers 4    # Batch
```

---

## Arxiv Papers

```
# Abstract only (fast)
web_extract(urls=["https://arxiv.org/abs/2402.03300"])

# Full paper
web_extract(urls=["https://arxiv.org/pdf/2402.03300"])

# Search
web_search(query="arxiv GRPO reinforcement learning 2026")
```

## Split, Merge & Search

pymupdf handles these natively — use `execute_code` or inline Python:

```python
# Split: extract pages 1-5 to a new PDF
import pymupdf
doc = pymupdf.open("report.pdf")
new = pymupdf.open()
for i in range(5):
    new.insert_pdf(doc, from_page=i, to_page=i)
new.save("pages_1-5.pdf")
```

```python
# Merge multiple PDFs
import pymupdf
result = pymupdf.open()
for path in ["a.pdf", "b.pdf", "c.pdf"]:
    result.insert_pdf(pymupdf.open(path))
result.save("merged.pdf")
```

```python
# Search for text across all pages
import pymupdf
doc = pymupdf.open("report.pdf")
for i, page in enumerate(doc):
    results = page.search_for("revenue")
    if results:
        print(f"Page {i+1}: {len(results)} match(es)")
        print(page.get_text("text"))
```

No extra dependencies needed — pymupdf covers split, merge, search, and text extraction in one package.

---

## Notes

- `web_extract` is always first choice for URLs
- pymupdf is the safe default — instant, no models, works everywhere
- marker-pdf is for OCR, scanned docs, equations, complex layouts — install only when needed
- Both helper scripts accept `--help` for full usage
- marker-pdf downloads ~2.5GB of models to `~/.cache/huggingface/` on first use
- For Word docs: `pip install python-docx` (better than OCR — parses actual structure)
- For PowerPoint: see the `powerpoint` skill (uses python-pptx)
Sync all skills and memories 2026-04-14 07:27 2026-04-14 07:27:20 +09:00			`---`
			`name: ocr-and-documents`
			`description: Extract text from PDFs and scanned documents. Use web_extract for remote URLs, pymupdf for local text-based PDFs, marker-pdf for OCR/scanned docs. For DOCX use python-docx, for PPTX see the powerpoint skill.`
			`version: 2.3.0`
			`author: Hermes Agent`
			`license: MIT`
			`metadata:`
			`hermes:`
			`tags: [PDF, Documents, Research, Arxiv, Text-Extraction, OCR]`
			`related_skills: [powerpoint]`
			`---`

			`# PDF & Document Extraction`

			For DOCX: use `python-docx` (parses actual document structure, far better than OCR).
			For PPTX: see the `powerpoint` skill (uses `python-pptx` with full slide/notes support).
			`This skill covers PDFs and scanned documents.`

			`## Step 1: Remote URL Available?`

			If the document has a URL, always try `web_extract` first:

			```
			`web_extract(urls=["https://arxiv.org/pdf/2402.03300"])`
			`web_extract(urls=["https://example.com/report.pdf"])`
			```

			`This handles PDF-to-markdown conversion via Firecrawl with no local dependencies.`

			`Only use local extraction when: the file is local, web_extract fails, or you need batch processing.`

			`## Step 2: Choose Local Extractor`

			`\| Feature \| pymupdf (~25MB) \| marker-pdf (~3-5GB) \|`
			`\|---------\|-----------------\|---------------------\|`
			`\| Text-based PDF \| ✅ \| ✅ \|`
			`\| Scanned PDF (OCR) \| ❌ \| ✅ (90+ languages) \|`
			`\| Tables \| ✅ (basic) \| ✅ (high accuracy) \|`
			`\| Equations / LaTeX \| ❌ \| ✅ \|`
			`\| Code blocks \| ❌ \| ✅ \|`
			`\| Forms \| ❌ \| ✅ \|`
			`\| Headers/footers removal \| ❌ \| ✅ \|`
			`\| Reading order detection \| ❌ \| ✅ \|`
			`\| Images extraction \| ✅ (embedded) \| ✅ (with context) \|`
			`\| Images → text (OCR) \| ❌ \| ✅ \|`
			`\| EPUB \| ✅ \| ✅ \|`
			`\| Markdown output \| ✅ (via pymupdf4llm) \| ✅ (native, higher quality) \|`
			`\| Install size \| ~25MB \| ~3-5GB (PyTorch + models) \|`
			`\| Speed \| Instant \| ~1-14s/page (CPU), ~0.2s/page (GPU) \|`

			`Decision: Use pymupdf unless you need OCR, equations, forms, or complex layout analysis.`

			`If the user needs marker capabilities but the system lacks ~5GB free disk:`
			`> "This document needs OCR/advanced extraction (marker-pdf), which requires ~5GB for PyTorch and models. Your system has [X]GB free. Options: free up space, provide a URL so I can use web_extract, or I can try pymupdf which works for text-based PDFs but not scanned documents or equations."`

			`---`

			`## pymupdf (lightweight)`

			```bash
			`pip install pymupdf pymupdf4llm`
			```

			`Via helper script:`
			```bash
			`python scripts/extract_pymupdf.py document.pdf # Plain text`
			`python scripts/extract_pymupdf.py document.pdf --markdown # Markdown`
			`python scripts/extract_pymupdf.py document.pdf --tables # Tables`
			`python scripts/extract_pymupdf.py document.pdf --images out/ # Extract images`
			`python scripts/extract_pymupdf.py document.pdf --metadata # Title, author, pages`
			`python scripts/extract_pymupdf.py document.pdf --pages 0-4 # Specific pages`
			```

			`Inline:`
			```bash
			`python3 -c "`
			`import pymupdf`
			`doc = pymupdf.open('document.pdf')`
			`for page in doc:`
			`print(page.get_text())`
			`"`
			```

			`---`

			`## marker-pdf (high-quality OCR)`

			```bash
			`# Check disk space first`
			`python scripts/extract_marker.py --check`

			`pip install marker-pdf`
			```

			`Via helper script:`
			```bash
			`python scripts/extract_marker.py document.pdf # Markdown`
			`python scripts/extract_marker.py document.pdf --json # JSON with metadata`
			`python scripts/extract_marker.py document.pdf --output_dir out/ # Save images`
			`python scripts/extract_marker.py scanned.pdf # Scanned PDF (OCR)`
			`python scripts/extract_marker.py document.pdf --use_llm # LLM-boosted accuracy`
			```

			`CLI (installed with marker-pdf):`
			```bash
			`marker_single document.pdf --output_dir ./output`
			`marker /path/to/folder --workers 4 # Batch`
			```

			`---`

			`## Arxiv Papers`

			```
			`# Abstract only (fast)`
			`web_extract(urls=["https://arxiv.org/abs/2402.03300"])`

			`# Full paper`
			`web_extract(urls=["https://arxiv.org/pdf/2402.03300"])`

			`# Search`
			`web_search(query="arxiv GRPO reinforcement learning 2026")`
			```

			`## Split, Merge & Search`

			pymupdf handles these natively — use `execute_code` or inline Python:

			```python
			`# Split: extract pages 1-5 to a new PDF`
			`import pymupdf`
			`doc = pymupdf.open("report.pdf")`
			`new = pymupdf.open()`
			`for i in range(5):`
			`new.insert_pdf(doc, from_page=i, to_page=i)`
			`new.save("pages_1-5.pdf")`
			```

			```python
			`# Merge multiple PDFs`
			`import pymupdf`
			`result = pymupdf.open()`
			`for path in ["a.pdf", "b.pdf", "c.pdf"]:`
			`result.insert_pdf(pymupdf.open(path))`
			`result.save("merged.pdf")`
			```

			```python
			`# Search for text across all pages`
			`import pymupdf`
			`doc = pymupdf.open("report.pdf")`
			`for i, page in enumerate(doc):`
			`results = page.search_for("revenue")`
			`if results:`
			`print(f"Page {i+1}: {len(results)} match(es)")`
			`print(page.get_text("text"))`
			```

			`No extra dependencies needed — pymupdf covers split, merge, search, and text extraction in one package.`

			`---`

			`## Notes`

			- `web_extract` is always first choice for URLs
			`- pymupdf is the safe default — instant, no models, works everywhere`
			`- marker-pdf is for OCR, scanned docs, equations, complex layouts — install only when needed`
			- Both helper scripts accept `--help` for full usage
			- marker-pdf downloads ~2.5GB of models to `~/.cache/huggingface/` on first use
			- For Word docs: `pip install python-docx` (better than OCR — parses actual structure)
			- For PowerPoint: see the `powerpoint` skill (uses python-pptx)