282 lines
10 KiB
Markdown
282 lines
10 KiB
Markdown
|
|
---
|
||
|
|
name: arxiv
|
||
|
|
description: Search and retrieve academic papers from arXiv using their free REST API. No API key needed. Search by keyword, author, category, or ID. Combine with web_extract or the ocr-and-documents skill to read full paper content.
|
||
|
|
version: 1.0.0
|
||
|
|
author: Hermes Agent
|
||
|
|
license: MIT
|
||
|
|
metadata:
|
||
|
|
hermes:
|
||
|
|
tags: [Research, Arxiv, Papers, Academic, Science, API]
|
||
|
|
related_skills: [ocr-and-documents]
|
||
|
|
---
|
||
|
|
|
||
|
|
# arXiv Research
|
||
|
|
|
||
|
|
Search and retrieve academic papers from arXiv via their free REST API. No API key, no dependencies — just curl.
|
||
|
|
|
||
|
|
## Quick Reference
|
||
|
|
|
||
|
|
| Action | Command |
|
||
|
|
|--------|---------|
|
||
|
|
| Search papers | `curl "https://export.arxiv.org/api/query?search_query=all:QUERY&max_results=5"` |
|
||
|
|
| Get specific paper | `curl "https://export.arxiv.org/api/query?id_list=2402.03300"` |
|
||
|
|
| Read abstract (web) | `web_extract(urls=["https://arxiv.org/abs/2402.03300"])` |
|
||
|
|
| Read full paper (PDF) | `web_extract(urls=["https://arxiv.org/pdf/2402.03300"])` |
|
||
|
|
|
||
|
|
## Searching Papers
|
||
|
|
|
||
|
|
The API returns Atom XML. Parse with `grep`/`sed` or pipe through `python3` for clean output.
|
||
|
|
|
||
|
|
### Basic search
|
||
|
|
|
||
|
|
```bash
|
||
|
|
curl -s "https://export.arxiv.org/api/query?search_query=all:GRPO+reinforcement+learning&max_results=5"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Clean output (parse XML to readable format)
|
||
|
|
|
||
|
|
```bash
|
||
|
|
curl -s "https://export.arxiv.org/api/query?search_query=all:GRPO+reinforcement+learning&max_results=5&sortBy=submittedDate&sortOrder=descending" | python3 -c "
|
||
|
|
import sys, xml.etree.ElementTree as ET
|
||
|
|
ns = {'a': 'http://www.w3.org/2005/Atom'}
|
||
|
|
root = ET.parse(sys.stdin).getroot()
|
||
|
|
for i, entry in enumerate(root.findall('a:entry', ns)):
|
||
|
|
title = entry.find('a:title', ns).text.strip().replace('\n', ' ')
|
||
|
|
arxiv_id = entry.find('a:id', ns).text.strip().split('/abs/')[-1]
|
||
|
|
published = entry.find('a:published', ns).text[:10]
|
||
|
|
authors = ', '.join(a.find('a:name', ns).text for a in entry.findall('a:author', ns))
|
||
|
|
summary = entry.find('a:summary', ns).text.strip()[:200]
|
||
|
|
cats = ', '.join(c.get('term') for c in entry.findall('a:category', ns))
|
||
|
|
print(f'{i+1}. [{arxiv_id}] {title}')
|
||
|
|
print(f' Authors: {authors}')
|
||
|
|
print(f' Published: {published} | Categories: {cats}')
|
||
|
|
print(f' Abstract: {summary}...')
|
||
|
|
print(f' PDF: https://arxiv.org/pdf/{arxiv_id}')
|
||
|
|
print()
|
||
|
|
"
|
||
|
|
```
|
||
|
|
|
||
|
|
## Search Query Syntax
|
||
|
|
|
||
|
|
| Prefix | Searches | Example |
|
||
|
|
|--------|----------|---------|
|
||
|
|
| `all:` | All fields | `all:transformer+attention` |
|
||
|
|
| `ti:` | Title | `ti:large+language+models` |
|
||
|
|
| `au:` | Author | `au:vaswani` |
|
||
|
|
| `abs:` | Abstract | `abs:reinforcement+learning` |
|
||
|
|
| `cat:` | Category | `cat:cs.AI` |
|
||
|
|
| `co:` | Comment | `co:accepted+NeurIPS` |
|
||
|
|
|
||
|
|
### Boolean operators
|
||
|
|
|
||
|
|
```
|
||
|
|
# AND (default when using +)
|
||
|
|
search_query=all:transformer+attention
|
||
|
|
|
||
|
|
# OR
|
||
|
|
search_query=all:GPT+OR+all:BERT
|
||
|
|
|
||
|
|
# AND NOT
|
||
|
|
search_query=all:language+model+ANDNOT+all:vision
|
||
|
|
|
||
|
|
# Exact phrase
|
||
|
|
search_query=ti:"chain+of+thought"
|
||
|
|
|
||
|
|
# Combined
|
||
|
|
search_query=au:hinton+AND+cat:cs.LG
|
||
|
|
```
|
||
|
|
|
||
|
|
## Sort and Pagination
|
||
|
|
|
||
|
|
| Parameter | Options |
|
||
|
|
|-----------|---------|
|
||
|
|
| `sortBy` | `relevance`, `lastUpdatedDate`, `submittedDate` |
|
||
|
|
| `sortOrder` | `ascending`, `descending` |
|
||
|
|
| `start` | Result offset (0-based) |
|
||
|
|
| `max_results` | Number of results (default 10, max 30000) |
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Latest 10 papers in cs.AI
|
||
|
|
curl -s "https://export.arxiv.org/api/query?search_query=cat:cs.AI&sortBy=submittedDate&sortOrder=descending&max_results=10"
|
||
|
|
```
|
||
|
|
|
||
|
|
## Fetching Specific Papers
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# By arXiv ID
|
||
|
|
curl -s "https://export.arxiv.org/api/query?id_list=2402.03300"
|
||
|
|
|
||
|
|
# Multiple papers
|
||
|
|
curl -s "https://export.arxiv.org/api/query?id_list=2402.03300,2401.12345,2403.00001"
|
||
|
|
```
|
||
|
|
|
||
|
|
## BibTeX Generation
|
||
|
|
|
||
|
|
After fetching metadata for a paper, generate a BibTeX entry:
|
||
|
|
|
||
|
|
{% raw %}
|
||
|
|
```bash
|
||
|
|
curl -s "https://export.arxiv.org/api/query?id_list=1706.03762" | python3 -c "
|
||
|
|
import sys, xml.etree.ElementTree as ET
|
||
|
|
ns = {'a': 'http://www.w3.org/2005/Atom', 'arxiv': 'http://arxiv.org/schemas/atom'}
|
||
|
|
root = ET.parse(sys.stdin).getroot()
|
||
|
|
entry = root.find('a:entry', ns)
|
||
|
|
if entry is None: sys.exit('Paper not found')
|
||
|
|
title = entry.find('a:title', ns).text.strip().replace('\n', ' ')
|
||
|
|
authors = ' and '.join(a.find('a:name', ns).text for a in entry.findall('a:author', ns))
|
||
|
|
year = entry.find('a:published', ns).text[:4]
|
||
|
|
raw_id = entry.find('a:id', ns).text.strip().split('/abs/')[-1]
|
||
|
|
cat = entry.find('arxiv:primary_category', ns)
|
||
|
|
primary = cat.get('term') if cat is not None else 'cs.LG'
|
||
|
|
last_name = entry.find('a:author', ns).find('a:name', ns).text.split()[-1]
|
||
|
|
print(f'@article{{{last_name}{year}_{raw_id.replace(\".\", \"\")},')
|
||
|
|
print(f' title = {{{title}}},')
|
||
|
|
print(f' author = {{{authors}}},')
|
||
|
|
print(f' year = {{{year}}},')
|
||
|
|
print(f' eprint = {{{raw_id}}},')
|
||
|
|
print(f' archivePrefix = {{arXiv}},')
|
||
|
|
print(f' primaryClass = {{{primary}}},')
|
||
|
|
print(f' url = {{https://arxiv.org/abs/{raw_id}}}')
|
||
|
|
print('}')
|
||
|
|
"
|
||
|
|
```
|
||
|
|
{% endraw %}
|
||
|
|
|
||
|
|
## Reading Paper Content
|
||
|
|
|
||
|
|
After finding a paper, read it:
|
||
|
|
|
||
|
|
```
|
||
|
|
# Abstract page (fast, metadata + abstract)
|
||
|
|
web_extract(urls=["https://arxiv.org/abs/2402.03300"])
|
||
|
|
|
||
|
|
# Full paper (PDF → markdown via Firecrawl)
|
||
|
|
web_extract(urls=["https://arxiv.org/pdf/2402.03300"])
|
||
|
|
```
|
||
|
|
|
||
|
|
For local PDF processing, see the `ocr-and-documents` skill.
|
||
|
|
|
||
|
|
## Common Categories
|
||
|
|
|
||
|
|
| Category | Field |
|
||
|
|
|----------|-------|
|
||
|
|
| `cs.AI` | Artificial Intelligence |
|
||
|
|
| `cs.CL` | Computation and Language (NLP) |
|
||
|
|
| `cs.CV` | Computer Vision |
|
||
|
|
| `cs.LG` | Machine Learning |
|
||
|
|
| `cs.CR` | Cryptography and Security |
|
||
|
|
| `stat.ML` | Machine Learning (Statistics) |
|
||
|
|
| `math.OC` | Optimization and Control |
|
||
|
|
| `physics.comp-ph` | Computational Physics |
|
||
|
|
|
||
|
|
Full list: https://arxiv.org/category_taxonomy
|
||
|
|
|
||
|
|
## Helper Script
|
||
|
|
|
||
|
|
The `scripts/search_arxiv.py` script handles XML parsing and provides clean output:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
python scripts/search_arxiv.py "GRPO reinforcement learning"
|
||
|
|
python scripts/search_arxiv.py "transformer attention" --max 10 --sort date
|
||
|
|
python scripts/search_arxiv.py --author "Yann LeCun" --max 5
|
||
|
|
python scripts/search_arxiv.py --category cs.AI --sort date
|
||
|
|
python scripts/search_arxiv.py --id 2402.03300
|
||
|
|
python scripts/search_arxiv.py --id 2402.03300,2401.12345
|
||
|
|
```
|
||
|
|
|
||
|
|
No dependencies — uses only Python stdlib.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Semantic Scholar (Citations, Related Papers, Author Profiles)
|
||
|
|
|
||
|
|
arXiv doesn't provide citation data or recommendations. Use the **Semantic Scholar API** for that — free, no key needed for basic use (1 req/sec), returns JSON.
|
||
|
|
|
||
|
|
### Get paper details + citations
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# By arXiv ID
|
||
|
|
curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:2402.03300?fields=title,authors,citationCount,referenceCount,influentialCitationCount,year,abstract" | python3 -m json.tool
|
||
|
|
|
||
|
|
# By Semantic Scholar paper ID or DOI
|
||
|
|
curl -s "https://api.semanticscholar.org/graph/v1/paper/DOI:10.1234/example?fields=title,citationCount"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Get citations OF a paper (who cited it)
|
||
|
|
|
||
|
|
```bash
|
||
|
|
curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:2402.03300/citations?fields=title,authors,year,citationCount&limit=10" | python3 -m json.tool
|
||
|
|
```
|
||
|
|
|
||
|
|
### Get references FROM a paper (what it cites)
|
||
|
|
|
||
|
|
```bash
|
||
|
|
curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:2402.03300/references?fields=title,authors,year,citationCount&limit=10" | python3 -m json.tool
|
||
|
|
```
|
||
|
|
|
||
|
|
### Search papers (alternative to arXiv search, returns JSON)
|
||
|
|
|
||
|
|
```bash
|
||
|
|
curl -s "https://api.semanticscholar.org/graph/v1/paper/search?query=GRPO+reinforcement+learning&limit=5&fields=title,authors,year,citationCount,externalIds" | python3 -m json.tool
|
||
|
|
```
|
||
|
|
|
||
|
|
### Get paper recommendations
|
||
|
|
|
||
|
|
```bash
|
||
|
|
curl -s -X POST "https://api.semanticscholar.org/recommendations/v1/papers/" \
|
||
|
|
-H "Content-Type: application/json" \
|
||
|
|
-d '{"positivePaperIds": ["arXiv:2402.03300"], "negativePaperIds": []}' | python3 -m json.tool
|
||
|
|
```
|
||
|
|
|
||
|
|
### Author profile
|
||
|
|
|
||
|
|
```bash
|
||
|
|
curl -s "https://api.semanticscholar.org/graph/v1/author/search?query=Yann+LeCun&fields=name,hIndex,citationCount,paperCount" | python3 -m json.tool
|
||
|
|
```
|
||
|
|
|
||
|
|
### Useful Semantic Scholar fields
|
||
|
|
|
||
|
|
`title`, `authors`, `year`, `abstract`, `citationCount`, `referenceCount`, `influentialCitationCount`, `isOpenAccess`, `openAccessPdf`, `fieldsOfStudy`, `publicationVenue`, `externalIds` (contains arXiv ID, DOI, etc.)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Complete Research Workflow
|
||
|
|
|
||
|
|
1. **Discover**: `python scripts/search_arxiv.py "your topic" --sort date --max 10`
|
||
|
|
2. **Assess impact**: `curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:ID?fields=citationCount,influentialCitationCount"`
|
||
|
|
3. **Read abstract**: `web_extract(urls=["https://arxiv.org/abs/ID"])`
|
||
|
|
4. **Read full paper**: `web_extract(urls=["https://arxiv.org/pdf/ID"])`
|
||
|
|
5. **Find related work**: `curl -s "https://api.semanticscholar.org/graph/v1/paper/arXiv:ID/references?fields=title,citationCount&limit=20"`
|
||
|
|
6. **Get recommendations**: POST to Semantic Scholar recommendations endpoint
|
||
|
|
7. **Track authors**: `curl -s "https://api.semanticscholar.org/graph/v1/author/search?query=NAME"`
|
||
|
|
|
||
|
|
## Rate Limits
|
||
|
|
|
||
|
|
| API | Rate | Auth |
|
||
|
|
|-----|------|------|
|
||
|
|
| arXiv | ~1 req / 3 seconds | None needed |
|
||
|
|
| Semantic Scholar | 1 req / second | None (100/sec with API key) |
|
||
|
|
|
||
|
|
## Notes
|
||
|
|
|
||
|
|
- arXiv returns Atom XML — use the helper script or parsing snippet for clean output
|
||
|
|
- Semantic Scholar returns JSON — pipe through `python3 -m json.tool` for readability
|
||
|
|
- arXiv IDs: old format (`hep-th/0601001`) vs new (`2402.03300`)
|
||
|
|
- PDF: `https://arxiv.org/pdf/{id}` — Abstract: `https://arxiv.org/abs/{id}`
|
||
|
|
- HTML (when available): `https://arxiv.org/html/{id}`
|
||
|
|
- For local PDF processing, see the `ocr-and-documents` skill
|
||
|
|
|
||
|
|
## ID Versioning
|
||
|
|
|
||
|
|
- `arxiv.org/abs/1706.03762` always resolves to the **latest** version
|
||
|
|
- `arxiv.org/abs/1706.03762v1` points to a **specific** immutable version
|
||
|
|
- When generating citations, preserve the version suffix you actually read to prevent citation drift (a later version may substantially change content)
|
||
|
|
- The API `<id>` field returns the versioned URL (e.g., `http://arxiv.org/abs/1706.03762v7`)
|
||
|
|
|
||
|
|
## Withdrawn Papers
|
||
|
|
|
||
|
|
Papers can be withdrawn after submission. When this happens:
|
||
|
|
- The `<summary>` field contains a withdrawal notice (look for "withdrawn" or "retracted")
|
||
|
|
- Metadata fields may be incomplete
|
||
|
|
- Always check the summary before treating a result as a valid paper
|