--- name: arxiv description: Search and retrieve academic papers from arXiv using their free REST API. No API key needed. Search by keyword, author, category, or ID. Combine with web_extract or the ocr-and-documents skill to read full paper content. version: 1.0.0 author: Hermes Agent license: MIT metadata: hermes: tags: [Research, Arxiv, Papers, Academic, Science, API] related_skills: [ocr-and-documents] --- # arXiv Research Search and retrieve academic papers from arXiv via their free REST API. No API key, no dependencies — just curl. ## Quick Reference | Action | Command | |--------|---------| | Search papers | `curl "https://export.arxiv.org/api/query?search_query=all:QUERY&max_results=5"` | | Get specific paper | `curl "https://export.arxiv.org/api/query?id_list=2402.03300"` | | Read abstract (web) | `web_extract(urls=["https://arxiv.org/abs/2402.03300"])` | | Read full paper (PDF) | `web_extract(urls=["https://arxiv.org/pdf/2402.03300"])` | ## Searching Papers The API returns Atom XML. Parse with `grep`/`sed` or pipe through `python3` for clean output. ### Basic search ```bash curl -s "https://export.arxiv.org/api/query?search_query=all:GRPO+reinforcement+learning&max_results=5" ``` ### Clean output (parse XML to readable format) ```bash curl -s "https://export.arxiv.org/api/query?search_query=all:GRPO+reinforcement+learning&max_results=5&sortBy=submittedDate&sortOrder=descending" | python3 -c " import sys, xml.etree.ElementTree as ET ns = {'a': 'http://www.w3.org/2005/Atom'} root = ET.parse(sys.stdin).getroot() for i, entry in enumerate(root.findall('a:entry', ns)): title = entry.find('a:title', ns).text.strip().replace('\n', ' ') arxiv_id = entry.find('a:id', ns).text.strip().split('/abs/')[-1] published = entry.find('a:published', ns).text[:10] authors = ', '.join(a.find('a:name', ns).text for a in entry.findall('a:author', ns)) summary = entry.find('a:summary', ns).text.strip()[:200] cats = ', '.join(c.get('term') for c in entry.findall('a:category', ns)) print(f'{i+1}. [{arxiv_id}] {title}') print(f' Authors: {authors}') print(f' Published: {published} | Categories: {cats}') print(f' Abstract: {summary}...') print(f' PDF: https://arxiv.org/pdf/{arxiv_id}') print() " ``` ## Search Query Syntax | Prefix | Searches | Example | |--------|----------|---------| | `all:` | All fields | `all:transformer+attention` | | `ti:` | Title | `ti:large+language+models` | | `au:` | Author | `au:vaswani` | | `abs:` | Abstract | `abs:reinforcement+learning` | | `cat:` | Category | `cat:cs.AI` | | `co:` | Comment | `co:accepted+NeurIPS` | ### Boolean operators ``` # AND (default when using +) search_query=all:transformer+attention # OR search_query=all:GPT+OR+all:BERT # AND NOT search_query=all:language+model+ANDNOT+all:vision # Exact phrase search_query=ti:"chain+of+thought" # Combined search_query=au:hinton+AND+cat:cs.LG ``` ## Sort and Pagination | Parameter | Options | |-----------|---------| | `sortBy` | `relevance`, `lastUpdatedDate`, `submittedDate` | | `sortOrder` | `ascending`, `descending` | | `start` | Result offset (0-based) | | `max_results` | Number of results (default 10, max 30000) | ```bash # Latest 10 papers in cs.AI curl -s "https://export.arxiv.org/api/query?search_query=cat:cs.AI&sortBy=submittedDate&sortOrder=descending&max_results=10" ``` ## Fetching Specific Papers ```bash # By arXiv ID curl -s "https://export.arxiv.org/api/query?id_list=2402.03300" # Multiple papers curl -s "https://export.arxiv.org/api/query?id_list=2402.03300,2401.12345,2403.00001" ``` ## BibTeX Generation After fetching metadata for a paper, generate a BibTeX entry: {% raw %} ```bash curl -s "https://export.arxiv.org/api/query?id_list=1706.03762" | python3 -c " import sys, xml.etree.ElementTree as ET ns = {'a': 'http://www.w3.org/2005/Atom', 'arxiv': 'http://arxiv.org/schemas/atom'} root = ET.parse(sys.stdin).getroot() entry = root.find('a:entry', ns) if entry is None: sys.exit('Paper not found') title = entry.find('a:title', ns).text.strip().replace('\n', ' ') authors = ' and '.join(a.find('a:name', ns).text for a in entry.findall('a:author', ns)) year = entry.find('a:published', ns).text[:4] raw_id = entry.find('a:id', ns).text.strip().split('/abs/')[-1] cat = entry.find('arxiv:primary_category', ns) primary = cat.get('term') if cat is not None else 'cs.LG' last_name = entry.find('a:author', ns).find('a:name', ns).text.split()[-1] print(f'@article{{{last_name}{year}_{raw_id.replace(\".\", \"\")},') print(f' title = {{{title}}},') print(f' author = {{{authors}}},') print(f' year = {{{year}}},') print(f' eprint = {{{raw_id}}},') print(f' archivePrefix = {{arXiv}},') print(f' primaryClass = {{{primary}}},') print(f' url = {{https://arxiv.org/abs/{raw_id}}}') print('}') " ``` {% endraw %} ## Reading Paper Content After finding a paper, read it: ``` # Abstract page (fast, metadata + abstract) web_extract(urls=["https://arxiv.org/abs/2402.03300"]) # Full paper (PDF → markdown via Firecrawl) web_extract(urls=["https://arxiv.org/pdf/2402.03300"]) ``` For local PDF processing, see the `ocr-and-documents` skill. ## Common Categories | Category | Field | |----------|-------| | `cs.AI` | Artificial Intelligence | | `cs.CL` | Computation and Language (NLP) | | `cs.CV` | Computer Vision | | `cs.LG` | Machine Learning | | `cs.CR` | Cryptography and Security | | `stat.ML` | Machine Learning (Statistics) | | `math.OC` | Optimization and Control | | `physics.comp-ph` | Computational Physics | Full list: https://arxiv.org/category_taxonomy ## Helper Script The `scripts/search_arxiv.py` script handles XML parsing and provides clean output: ```bash python scripts/search_arxiv.py "GRPO reinforcement learning" python scripts/search_arxiv.py "transformer attention" --max 10 --sort date python scripts/search_arxiv.py --author "Yann LeCun" --max 5 python scripts/search_arxiv.py --category cs.AI --sort date python scripts/search_arxiv.py --id 2402.03300 python scripts/search_arxiv.py --id 2402.03300,2401.12345 ``` No dependencies — uses only Python stdlib. --- ## Semantic Scholar (Citations, Related Papers, Author Profiles) arXiv doesn't provide citation data or recommendations. Use the **Semantic Scholar API** for that — free, no key needed for basic use (1 req/sec), returns JSON. ### Get paper details + citations ```bash # By arXiv ID curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:2402.03300?fields=title,authors,citationCount,referenceCount,influentialCitationCount,year,abstract" | python3 -m json.tool # By Semantic Scholar paper ID or DOI curl -s "https://api.semanticscholar.org/graph/v1/paper/DOI:10.1234/example?fields=title,citationCount" ``` ### Get citations OF a paper (who cited it) ```bash curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:2402.03300/citations?fields=title,authors,year,citationCount&limit=10" | python3 -m json.tool ``` ### Get references FROM a paper (what it cites) ```bash curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:2402.03300/references?fields=title,authors,year,citationCount&limit=10" | python3 -m json.tool ``` ### Search papers (alternative to arXiv search, returns JSON) ```bash curl -s "https://api.semanticscholar.org/graph/v1/paper/search?query=GRPO+reinforcement+learning&limit=5&fields=title,authors,year,citationCount,externalIds" | python3 -m json.tool ``` ### Get paper recommendations ```bash curl -s -X POST "https://api.semanticscholar.org/recommendations/v1/papers/" \ -H "Content-Type: application/json" \ -d '{"positivePaperIds": ["arXiv:2402.03300"], "negativePaperIds": []}' | python3 -m json.tool ``` ### Author profile ```bash curl -s "https://api.semanticscholar.org/graph/v1/author/search?query=Yann+LeCun&fields=name,hIndex,citationCount,paperCount" | python3 -m json.tool ``` ### Useful Semantic Scholar fields `title`, `authors`, `year`, `abstract`, `citationCount`, `referenceCount`, `influentialCitationCount`, `isOpenAccess`, `openAccessPdf`, `fieldsOfStudy`, `publicationVenue`, `externalIds` (contains arXiv ID, DOI, etc.) --- ## Complete Research Workflow 1. **Discover**: `python scripts/search_arxiv.py "your topic" --sort date --max 10` 2. **Assess impact**: `curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:ID?fields=citationCount,influentialCitationCount"` 3. **Read abstract**: `web_extract(urls=["https://arxiv.org/abs/ID"])` 4. **Read full paper**: `web_extract(urls=["https://arxiv.org/pdf/ID"])` 5. **Find related work**: `curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:ID/references?fields=title,citationCount&limit=20"` 6. **Get recommendations**: POST to Semantic Scholar recommendations endpoint 7. **Track authors**: `curl -s "https://api.semanticscholar.org/graph/v1/author/search?query=NAME"` ## Rate Limits | API | Rate | Auth | |-----|------|------| | arXiv | ~1 req / 3 seconds | None needed | | Semantic Scholar | 1 req / second | None (100/sec with API key) | ## Notes - arXiv returns Atom XML — use the helper script or parsing snippet for clean output - Semantic Scholar returns JSON — pipe through `python3 -m json.tool` for readability - arXiv IDs: old format (`hep-th/0601001`) vs new (`2402.03300`) - PDF: `https://arxiv.org/pdf/{id}` — Abstract: `https://arxiv.org/abs/{id}` - HTML (when available): `https://arxiv.org/html/{id}` - For local PDF processing, see the `ocr-and-documents` skill ## ID Versioning - `arxiv.org/abs/1706.03762` always resolves to the **latest** version - `arxiv.org/abs/1706.03762v1` points to a **specific** immutable version - When generating citations, preserve the version suffix you actually read to prevent citation drift (a later version may substantially change content) - The API `` field returns the versioned URL (e.g., `http://arxiv.org/abs/1706.03762v7`) ## Withdrawn Papers Papers can be withdrawn after submission. When this happens: - The `` field contains a withdrawal notice (look for "withdrawn" or "retracted") - Metadata fields may be incomplete - Always check the summary before treating a result as a valid paper