Sync all skills and memories 2026-04-14 07:27

2026-04-14 07:27:20 +09:00
parent 516bb44fe6
commit 1eba2bca95
386 changed files with 167655 additions and 0 deletions
--- a/skills/productivity/ocr-and-documents/SKILL.md
+++ b/skills/productivity/ocr-and-documents/SKILL.md
@@ -0,0 +1,171 @@
+---
+name: ocr-and-documents
+description: Extract text from PDFs and scanned documents. Use web_extract for remote URLs, pymupdf for local text-based PDFs, marker-pdf for OCR/scanned docs. For DOCX use python-docx, for PPTX see the powerpoint skill.
+version: 2.3.0
+author: Hermes Agent
+license: MIT
+metadata:
+  hermes:
+    tags: [PDF, Documents, Research, Arxiv, Text-Extraction, OCR]
+    related_skills: [powerpoint]
+---
+
+# PDF & Document Extraction
+
+For DOCX: use `python-docx` (parses actual document structure, far better than OCR).
+For PPTX: see the `powerpoint` skill (uses `python-pptx` with full slide/notes support).
+This skill covers **PDFs and scanned documents**.
+
+## Step 1: Remote URL Available?
+
+If the document has a URL, **always try `web_extract` first**:
+
+```
+web_extract(urls=["https://arxiv.org/pdf/2402.03300"])
+web_extract(urls=["https://example.com/report.pdf"])
+```
+
+This handles PDF-to-markdown conversion via Firecrawl with no local dependencies.
+
+Only use local extraction when: the file is local, web_extract fails, or you need batch processing.
+
+## Step 2: Choose Local Extractor
+
+| Feature | pymupdf (~25MB) | marker-pdf (~3-5GB) |
+|---------|-----------------|---------------------|
+| **Text-based PDF** | ✅ | ✅ |
+| **Scanned PDF (OCR)** | ❌ | ✅ (90+ languages) |
+| **Tables** | ✅ (basic) | ✅ (high accuracy) |
+| **Equations / LaTeX** | ❌ | ✅ |
+| **Code blocks** | ❌ | ✅ |
+| **Forms** | ❌ | ✅ |
+| **Headers/footers removal** | ❌ | ✅ |
+| **Reading order detection** | ❌ | ✅ |
+| **Images extraction** | ✅ (embedded) | ✅ (with context) |
+| **Images → text (OCR)** | ❌ | ✅ |
+| **EPUB** | ✅ | ✅ |
+| **Markdown output** | ✅ (via pymupdf4llm) | ✅ (native, higher quality) |
+| **Install size** | ~25MB | ~3-5GB (PyTorch + models) |
+| **Speed** | Instant | ~1-14s/page (CPU), ~0.2s/page (GPU) |
+
+**Decision**: Use pymupdf unless you need OCR, equations, forms, or complex layout analysis.
+
+If the user needs marker capabilities but the system lacks ~5GB free disk:
+> "This document needs OCR/advanced extraction (marker-pdf), which requires ~5GB for PyTorch and models. Your system has [X]GB free. Options: free up space, provide a URL so I can use web_extract, or I can try pymupdf which works for text-based PDFs but not scanned documents or equations."
+
+---
+
+## pymupdf (lightweight)
+
+```bash
+pip install pymupdf pymupdf4llm
+```
+
+**Via helper script**:
+```bash
+python scripts/extract_pymupdf.py document.pdf              # Plain text
+python scripts/extract_pymupdf.py document.pdf --markdown    # Markdown
+python scripts/extract_pymupdf.py document.pdf --tables      # Tables
+python scripts/extract_pymupdf.py document.pdf --images out/ # Extract images
+python scripts/extract_pymupdf.py document.pdf --metadata    # Title, author, pages
+python scripts/extract_pymupdf.py document.pdf --pages 0-4   # Specific pages
+```
+
+**Inline**:
+```bash
+python3 -c "
+import pymupdf
+doc = pymupdf.open('document.pdf')
+for page in doc:
+    print(page.get_text())
+"
+```
+
+---
+
+## marker-pdf (high-quality OCR)
+
+```bash
+# Check disk space first
+python scripts/extract_marker.py --check
+
+pip install marker-pdf
+```
+
+**Via helper script**:
+```bash
+python scripts/extract_marker.py document.pdf                # Markdown
+python scripts/extract_marker.py document.pdf --json         # JSON with metadata
+python scripts/extract_marker.py document.pdf --output_dir out/  # Save images
+python scripts/extract_marker.py scanned.pdf                 # Scanned PDF (OCR)
+python scripts/extract_marker.py document.pdf --use_llm      # LLM-boosted accuracy
+```
+
+**CLI** (installed with marker-pdf):
+```bash
+marker_single document.pdf --output_dir ./output
+marker /path/to/folder --workers 4    # Batch
+```
+
+---
+
+## Arxiv Papers
+
+```
+# Abstract only (fast)
+web_extract(urls=["https://arxiv.org/abs/2402.03300"])
+
+# Full paper
+web_extract(urls=["https://arxiv.org/pdf/2402.03300"])
+
+# Search
+web_search(query="arxiv GRPO reinforcement learning 2026")
+```
+
+## Split, Merge & Search
+
+pymupdf handles these natively — use `execute_code` or inline Python:
+
+```python
+# Split: extract pages 1-5 to a new PDF
+import pymupdf
+doc = pymupdf.open("report.pdf")
+new = pymupdf.open()
+for i in range(5):
+    new.insert_pdf(doc, from_page=i, to_page=i)
+new.save("pages_1-5.pdf")
+```
+
+```python
+# Merge multiple PDFs
+import pymupdf
+result = pymupdf.open()
+for path in ["a.pdf", "b.pdf", "c.pdf"]:
+    result.insert_pdf(pymupdf.open(path))
+result.save("merged.pdf")
+```
+
+```python
+# Search for text across all pages
+import pymupdf
+doc = pymupdf.open("report.pdf")
+for i, page in enumerate(doc):
+    results = page.search_for("revenue")
+    if results:
+        print(f"Page {i+1}: {len(results)} match(es)")
+        print(page.get_text("text"))
+```
+
+No extra dependencies needed — pymupdf covers split, merge, search, and text extraction in one package.
+
+---
+
+## Notes
+
+- `web_extract` is always first choice for URLs
+- pymupdf is the safe default — instant, no models, works everywhere
+- marker-pdf is for OCR, scanned docs, equations, complex layouts — install only when needed
+- Both helper scripts accept `--help` for full usage
+- marker-pdf downloads ~2.5GB of models to `~/.cache/huggingface/` on first use
+- For Word docs: `pip install python-docx` (better than OCR — parses actual structure)
+- For PowerPoint: see the `powerpoint` skill (uses python-pptx)