Deterministic arXiv batch downloader based on category.md. By default, only PDFs are downloaded; use --with-src if source packages are needed.
- arxiv_downloader.py: Main program
- category.md: Research fields and arXiv category codes table (drives the download)
- metadata.json: Download result index (generated after execution)
- Reads category codes for each research field from
category.md - Downloads
--per-fieldpapers for each field - Verifies and downloads PDFs, optionally downloads source packages (src)
- Writes a unified index to
metadata.json
- Python 3.8+ (standard library only, no third-party dependencies)
- Access to arXiv (use a proxy if necessary)
Run in the current directory:
python arxiv_downloader.py --per-field 5Download PDFs only (default):
python arxiv_downloader.py --per-field 5Download PDF + source package:
python arxiv_downloader.py --per-field 5 --with-srcSelf-test network connectivity:
python arxiv_downloader.py --self-test--per-field: Number of papers to download per field (default: 5)--max-results-factor: Multiplier for API candidate count (default: 3)--timeout: Request timeout in seconds (default: 12.0)--workers: Number of concurrent threads (default: 6)--retries: Number of retry attempts on failure (default: 2)--backoff: Backoff multiplier (default: 1.7)--with-src: Download source package (src)--api-min-interval: Minimum interval between API requests (default: 3.2 seconds)--field-nos: Download only specified field numbers, comma-separated (e.g.,1,2,5)--max-fields: Process at most the first N fields--dry-run: Select only, without downloading--local-dir: Root download directory (default:.cache, relative to current directory)--quiet: Disable progress output--proxy: Force proxy (e.g.,http://127.0.0.1:7890)--self-test: Connectivity test then exit
Default download directory is .cache (can be changed via --local-dir). Example structure:
.cache/
01-artificial-intelligence/
2401.01234/
paper.pdf
source.tar.gz
field_slug is generated from the number and field name; source.tar.gz is downloaded only when --with-src is enabled.
After execution, metadata.json is generated. Example fields:
field_no/field_name/field_slugprimary_codes: Primary category codes for the fieldarxiv_id/title/authors/categorieslocal_dir: Path relative to this directoryfiles: Relative paths forpdfandsrcurls:abs/pdf/srclinksdownloaded: Whether download succeededstatus:ok/ok_cached/failed
category.md maintains research fields and their category codes in a table format. The program parses the "Primary arXiv Codes" column where codes are wrapped in backticks.
- If network access is restricted, configure system proxy or use
--proxy. - Setting
--workerstoo high may trigger rate limiting; choose a moderate value. --with-srcsignificantly slows down the process, and not all papers have available source packages.