Skip to content

ali5ter/electronics-publications-library

Repository files navigation

publication-library

A personal digital library toolkit for building a local, searchable corpus of scanned publications — magazines, books, and periodicals from online archives.

Once built, the library is navigable by grep, by hand, or by any AI assistant. An AI with access to the indexed corpus can act as an informed librarian immediately — searching, cross-referencing, and surfacing insights across thousands of pages.

See LIBRARIAN.md for the AI orientation guide.


How it works

  1. Download — scrape PDF links from an archive page and download them locally
  2. Convert — extract OCR text and render each page as a PNG, producing searchable Markdown
  3. Search — grep, browse, or open the library in an AI-assisted editor and query in plain English
  4. Record — write research findings to findings/ (gitignored; can sync via Dropbox, iCloud, or Google Drive)

The collections/ directory holds your library. PDFs and indexed output are gitignored so no copyrighted material ever enters the repository. Collection metadata (COLLECTION.md) is tracked, so the shape of your library is version-controlled even if the contents are not.


Requirements

pip3 install -r requirements.txt

Or install individually:

pip3 install pymupdf          # required for convert.py
pip3 install internetarchive  # required for archive.org downloads

Python 3.10+.


Tools

Script Purpose
download.py Scrape all PDF links from an archive page and download them
convert.py Convert a folder of PDFs to searchable Markdown with page images
search.py Search across all indexed collections with grouped, formatted output
init-findings.sh Scaffold the findings/ directory, with optional cloud storage symlink
init-symlinks.sh Recreate cloud-storage symlinks (auto-derived from collections/)
bootstrap.sh Full reconstruction pipeline: symlinks → download → convert → catalogue

Workflow

1. Download a collection

The source is auto-detected from the URL. Both modes share --output-dir, --delay, and --dry-run.

World Radio History — scrapes PDF links from an archive page:

# Preview what would be downloaded
python3 download.py "https://www.worldradiohistory.com/ETI_Magazine.htm" --dry-run

# Download everything
python3 download.py "https://www.worldradiohistory.com/ETI_Magazine.htm" \
  --output-dir collections/eti/pdfs

# Download a subset matching a string
python3 download.py "https://www.worldradiohistory.com/ETI_Magazine.htm" \
  --filter "1970" --output-dir collections/eti/pdfs

archive.org — downloads files from a single archive.org item by identifier. Each issue typically has two PDF variants: a plain image PDF and a _text.pdf with an Abbyy OCR text layer. The --pdf-format flag controls which variant is downloaded (text is the default since convert.py extracts from the OCR layer):

# Download all OCR PDFs from an archive.org item
python3 download.py "https://archive.org/details/ElektorMagazine" \
  --output-dir collections/elektor/pdfs

# Download only issues from a specific decade
python3 download.py "https://archive.org/details/ElektorMagazine" \
  --output-dir collections/elektor/pdfs \
  --year-from 1974 --year-to 1989

# Download image-only PDFs (no OCR layer)
python3 download.py "https://archive.org/details/ElektorMagazine" \
  --pdf-format image --output-dir collections/elektor/pdfs

# Preview without downloading
python3 download.py "https://archive.org/details/ElektorMagazine" \
  --year-from 1980 --dry-run
Flag Description Default
--pdf-format text (_text.pdf, OCR), image (plain PDF), both text
--year-from Only download files with a year >= this value
--year-to Only download files with a year <= this value

2. Probe the collection structure

python3 convert.py --analyze --input-dir collections/eti/pdfs

Reports OCR coverage, detected naming patterns, page counts, and suggests a convert command.

3. Convert to searchable Markdown

python3 convert.py \
  --input-dir collections/eti/pdfs \
  --output-dir collections/eti/indexed

Each PDF becomes a directory containing:

  • content.md — full OCR text with page images embedded inline
  • index.md — page-by-page article and section headings
  • pages/page-NNN.png — each page rendered at 200 DPI

A master index.md is written to the output root linking all publications.

4. Search and research

# Find all publications mentioning a topic
grep -ril "VCA\|voltage controlled amplifier" collections/eti/indexed/

# Show matching lines with context
grep -in -A3 "fuzz box" collections/*/indexed/*/content.md

Open the collections/ directory in an AI-assisted editor and ask questions in plain English:

"There's a Guitar Fuzz Box project in Hobby Electronics magazine. Please find it for me."

"Are there any synthesiser projects across the whole collection?"

"What does the ETI Transcendent Polysynth series cover, and which issues should I read first?"

The AI reads LIBRARIAN.md and the collection index files to orient itself, then navigates freely.

5. Record findings

Write research outputs to findings/ — topic references, cross-collection notes, article summaries. This folder is gitignored so personal research never enters the repository.


AI assistant support

Context files are included for all major AI coding assistants. When you open this project, your assistant reads its context file and is directed to LIBRARIAN.md automatically — no prompting needed.

Assistant Context file
Claude Code CLAUDE.md
OpenAI Codex CLI AGENTS.md
Google Gemini CLI GEMINI.md
GitHub Copilot .github/copilot-instructions.md

Sharing across devices

The library corpus and research findings can be stored in cloud storage and symlinked into the project, making everything available across multiple machines without committing copyrighted content.

Both collections/*/pdfs, collections/*/indexed, and findings/ are gitignored, so symlinks to cloud folders work seamlessly with version control.

Recommended cloud folder layout

The cloud storage layout mirrors the repo structure exactly. For example, using Dropbox:

~/Dropbox/my-library/
├── findings/              ← research findings (symlinked to findings/)
└── collections/
    ├── collection-a/
    │   ├── pdfs/          ← PDFs for collection A (symlinked to collections/collection-a/pdfs)
    │   └── indexed/       ← converted output (symlinked to collections/collection-a/indexed)
    └── collection-b/
        ├── pdfs/
        └── indexed/

Mirroring the repo layout means relative links in findings/*.md resolve correctly when apps resolve symlinks to their real filesystem path.

One-command reconstruction with bootstrap.sh

After cloning on a new machine, bootstrap.sh rebuilds the entire library in one step:

cp .env.template .env
# Edit .env — set LIBRARY_BASE to your cloud storage root, e.g.:
# LIBRARY_BASE="${HOME}/Dropbox/my-library"

./bootstrap.sh

This creates cloud directories, restores symlinks, downloads any missing PDFs (using the Source URL from each COLLECTION.md), converts them to searchable Markdown, and regenerates CATALOGUE.md. The script is idempotent — already-downloaded PDFs and already-converted output are skipped.

Symlinks only

init-symlinks.sh restores symlinks without downloading or converting. Symlink targets are auto-derived from collections/ using the naming convention above. To override, define a LINKS array in .env:

# .env
LINKS=(
    "findings:${LIBRARY_BASE}/findings"
    "collections/collection-a/pdfs:${LIBRARY_BASE}/collections/collection-a/pdfs"
    "collections/collection-a/indexed:${LIBRARY_BASE}/collections/collection-a/indexed"
)

Then run:

./init-symlinks.sh

Library catalogue

See CATALOGUE.md for all collections.

Collection Period PDFs Pages
Hobby Electronics 1978–1984 67 ~5,000
ETI — Electronics Today International 1972–1999 326 27,328
Everyday Electronics 1971–1999 332 24,430
Bernards/Babani BP Books Various 111 16,153
Electronics & Music Maker / Music Technology 1981–1994 132 12,554
Elektor 1974–1989 160 9,165
Practical Electronics 1964–1992 341 27,145
Electronic Musician / Polyphony 1975–2023 443 10,389
Polyphony 1975–1985 45 1,761
Popular Electronics / Poptronics 1954–2003 595 69,656
Radio Electronics 1948–1999 636 ~74,000
Moritz Klein Ongoing 15 903

Using as a template

Click Use this template on GitHub to create your own library repository. The collections/ PDFs and indexed output are excluded by .gitignore, but collection metadata (COLLECTION.md files) and the project structure are tracked — so the shape of your library is preserved in version control.

Staying up to date with the template

To pull improvements from this template into your instance, add it as an upstream remote once:

git remote add upstream https://github.com/ali5ter/publication-library.git

Then merge whenever you want to pick up changes:

git fetch upstream
git merge upstream/main --no-edit
git push origin main

Git handles merging template changes with your instance-specific commits automatically. Conflicts are unlikely since the template never modifies COLLECTION.md files or CATALOGUE.md.


Contributing

Bug reports and enhancement requests are welcome via GitHub Issues.


Copyright notice

PDFs and all converted output are derived from copyrighted material. This repository contains only the scripts and metadata. Collection PDFs and indexed output are excluded via .gitignore and must not be committed or redistributed.

Source PDFs can be downloaded from World Radio History for personal use.