A personal digital library toolkit for building a local, searchable corpus of scanned publications — magazines, books, and periodicals from online archives.
Once built, the library is navigable by grep, by hand, or by any AI assistant. An AI with access to the indexed corpus can act as an informed librarian immediately — searching, cross-referencing, and surfacing insights across thousands of pages.
See
LIBRARIAN.mdfor the AI orientation guide.
- Download — scrape PDF links from an archive page and download them locally
- Convert — extract OCR text and render each page as a PNG, producing searchable Markdown
- Search — grep, browse, or open the library in an AI-assisted editor and query in plain English
- Record — write research findings to
findings/(gitignored; can sync via Dropbox, iCloud, or Google Drive)
The collections/ directory holds your library. PDFs and indexed output are gitignored so no
copyrighted material ever enters the repository. Collection metadata (COLLECTION.md) is tracked,
so the shape of your library is version-controlled even if the contents are not.
pip3 install -r requirements.txtOr install individually:
pip3 install pymupdf # required for convert.py
pip3 install internetarchive # required for archive.org downloadsPython 3.10+.
| Script | Purpose |
|---|---|
download.py |
Scrape all PDF links from an archive page and download them |
convert.py |
Convert a folder of PDFs to searchable Markdown with page images |
search.py |
Search across all indexed collections with grouped, formatted output |
init-findings.sh |
Scaffold the findings/ directory, with optional cloud storage symlink |
init-symlinks.sh |
Recreate cloud-storage symlinks (auto-derived from collections/) |
bootstrap.sh |
Full reconstruction pipeline: symlinks → download → convert → catalogue |
The source is auto-detected from the URL. Both modes share --output-dir, --delay, and --dry-run.
World Radio History — scrapes PDF links from an archive page:
# Preview what would be downloaded
python3 download.py "https://www.worldradiohistory.com/ETI_Magazine.htm" --dry-run
# Download everything
python3 download.py "https://www.worldradiohistory.com/ETI_Magazine.htm" \
--output-dir collections/eti/pdfs
# Download a subset matching a string
python3 download.py "https://www.worldradiohistory.com/ETI_Magazine.htm" \
--filter "1970" --output-dir collections/eti/pdfsarchive.org — downloads files from a single archive.org item by identifier.
Each issue typically has two PDF variants: a plain image PDF and a _text.pdf with an
Abbyy OCR text layer. The --pdf-format flag controls which variant is downloaded
(text is the default since convert.py extracts from the OCR layer):
# Download all OCR PDFs from an archive.org item
python3 download.py "https://archive.org/details/ElektorMagazine" \
--output-dir collections/elektor/pdfs
# Download only issues from a specific decade
python3 download.py "https://archive.org/details/ElektorMagazine" \
--output-dir collections/elektor/pdfs \
--year-from 1974 --year-to 1989
# Download image-only PDFs (no OCR layer)
python3 download.py "https://archive.org/details/ElektorMagazine" \
--pdf-format image --output-dir collections/elektor/pdfs
# Preview without downloading
python3 download.py "https://archive.org/details/ElektorMagazine" \
--year-from 1980 --dry-run| Flag | Description | Default |
|---|---|---|
--pdf-format |
text (_text.pdf, OCR), image (plain PDF), both |
text |
--year-from |
Only download files with a year >= this value | — |
--year-to |
Only download files with a year <= this value | — |
python3 convert.py --analyze --input-dir collections/eti/pdfsReports OCR coverage, detected naming patterns, page counts, and suggests a convert command.
python3 convert.py \
--input-dir collections/eti/pdfs \
--output-dir collections/eti/indexedEach PDF becomes a directory containing:
content.md— full OCR text with page images embedded inlineindex.md— page-by-page article and section headingspages/page-NNN.png— each page rendered at 200 DPI
A master index.md is written to the output root linking all publications.
# Find all publications mentioning a topic
grep -ril "VCA\|voltage controlled amplifier" collections/eti/indexed/
# Show matching lines with context
grep -in -A3 "fuzz box" collections/*/indexed/*/content.mdOpen the collections/ directory in an AI-assisted editor and ask questions in plain English:
"There's a Guitar Fuzz Box project in Hobby Electronics magazine. Please find it for me."
"Are there any synthesiser projects across the whole collection?"
"What does the ETI Transcendent Polysynth series cover, and which issues should I read first?"
The AI reads LIBRARIAN.md and the collection index files to orient itself, then navigates freely.
Write research outputs to findings/ — topic references, cross-collection notes, article summaries.
This folder is gitignored so personal research never enters the repository.
Context files are included for all major AI coding assistants. When you open this project, your
assistant reads its context file and is directed to LIBRARIAN.md automatically — no prompting needed.
| Assistant | Context file |
|---|---|
| Claude Code | CLAUDE.md |
| OpenAI Codex CLI | AGENTS.md |
| Google Gemini CLI | GEMINI.md |
| GitHub Copilot | .github/copilot-instructions.md |
The library corpus and research findings can be stored in cloud storage and symlinked into the project, making everything available across multiple machines without committing copyrighted content.
Both collections/*/pdfs, collections/*/indexed, and findings/ are gitignored, so symlinks
to cloud folders work seamlessly with version control.
The cloud storage layout mirrors the repo structure exactly. For example, using Dropbox:
~/Dropbox/my-library/
├── findings/ ← research findings (symlinked to findings/)
└── collections/
├── collection-a/
│ ├── pdfs/ ← PDFs for collection A (symlinked to collections/collection-a/pdfs)
│ └── indexed/ ← converted output (symlinked to collections/collection-a/indexed)
└── collection-b/
├── pdfs/
└── indexed/
Mirroring the repo layout means relative links in findings/*.md resolve correctly when
apps resolve symlinks to their real filesystem path.
After cloning on a new machine, bootstrap.sh rebuilds the entire library in one step:
cp .env.template .env
# Edit .env — set LIBRARY_BASE to your cloud storage root, e.g.:
# LIBRARY_BASE="${HOME}/Dropbox/my-library"
./bootstrap.shThis creates cloud directories, restores symlinks, downloads any missing PDFs (using the Source URL
from each COLLECTION.md), converts them to searchable Markdown, and regenerates CATALOGUE.md.
The script is idempotent — already-downloaded PDFs and already-converted output are skipped.
init-symlinks.sh restores symlinks without downloading or converting. Symlink targets are
auto-derived from collections/ using the naming convention above. To override, define a LINKS
array in .env:
# .env
LINKS=(
"findings:${LIBRARY_BASE}/findings"
"collections/collection-a/pdfs:${LIBRARY_BASE}/collections/collection-a/pdfs"
"collections/collection-a/indexed:${LIBRARY_BASE}/collections/collection-a/indexed"
)Then run:
./init-symlinks.shSee CATALOGUE.md for all collections.
| Collection | Period | PDFs | Pages |
|---|---|---|---|
| Hobby Electronics | 1978–1984 | 67 | ~5,000 |
| ETI — Electronics Today International | 1972–1999 | 326 | 27,328 |
| Everyday Electronics | 1971–1999 | 332 | 24,430 |
| Bernards/Babani BP Books | Various | 111 | 16,153 |
| Electronics & Music Maker / Music Technology | 1981–1994 | 132 | 12,554 |
| Elektor | 1974–1989 | 160 | 9,165 |
| Practical Electronics | 1964–1992 | 341 | 27,145 |
| Electronic Musician / Polyphony | 1975–2023 | 443 | 10,389 |
| Polyphony | 1975–1985 | 45 | 1,761 |
| Popular Electronics / Poptronics | 1954–2003 | 595 | 69,656 |
| Radio Electronics | 1948–1999 | 636 | ~74,000 |
| Moritz Klein | Ongoing | 15 | 903 |
Click Use this template on GitHub to create your own library repository. The collections/ PDFs
and indexed output are excluded by .gitignore, but collection metadata (COLLECTION.md files) and
the project structure are tracked — so the shape of your library is preserved in version control.
To pull improvements from this template into your instance, add it as an upstream remote once:
git remote add upstream https://github.com/ali5ter/publication-library.gitThen merge whenever you want to pick up changes:
git fetch upstream
git merge upstream/main --no-edit
git push origin mainGit handles merging template changes with your instance-specific commits automatically. Conflicts
are unlikely since the template never modifies COLLECTION.md files or CATALOGUE.md.
Bug reports and enhancement requests are welcome via GitHub Issues.
PDFs and all converted output are derived from copyrighted material. This repository contains
only the scripts and metadata. Collection PDFs and indexed output are excluded via .gitignore
and must not be committed or redistributed.
Source PDFs can be downloaded from World Radio History for personal use.