publication-library

A personal digital library toolkit for building a local, searchable corpus of scanned publications — magazines, books, and periodicals from online archives.

Once built, the library is navigable by grep, by hand, or by any AI assistant. An AI with access to the indexed corpus can act as an informed librarian immediately — searching, cross-referencing, and surfacing insights across thousands of pages.

See LIBRARIAN.md for the AI orientation guide.

How it works

Download — scrape PDF links from an archive page and download them locally
Convert — extract OCR text and render each page as a PNG, producing searchable Markdown
Search — grep, browse, or open the library in an AI-assisted editor and query in plain English
Record — write research findings to findings/ (gitignored; can sync via Dropbox, iCloud, or Google Drive)

The collections/ directory holds your library. PDFs and indexed output are gitignored so no copyrighted material ever enters the repository. Collection metadata (COLLECTION.md) is tracked, so the shape of your library is version-controlled even if the contents are not.

Requirements

pip3 install -r requirements.txt

Or install individually:

pip3 install pymupdf          # required for convert.py
pip3 install internetarchive  # required for archive.org downloads

Python 3.10+.

Tools

Script	Purpose
`download.py`	Scrape all PDF links from an archive page and download them
`convert.py`	Convert a folder of PDFs to searchable Markdown with page images
`search.py`	Search across all indexed collections with grouped, formatted output
`init-findings.sh`	Scaffold the `findings/` directory, with optional cloud storage symlink
`init-symlinks.sh`	Recreate cloud-storage symlinks (auto-derived from `collections/`)
`bootstrap.sh`	Full reconstruction pipeline: symlinks → download → convert → catalogue

Workflow

1. Download a collection

The source is auto-detected from the URL. Both modes share --output-dir, --delay, and --dry-run.

World Radio History — scrapes PDF links from an archive page:

# Preview what would be downloaded
python3 download.py "https://www.worldradiohistory.com/ETI_Magazine.htm" --dry-run

# Download everything
python3 download.py "https://www.worldradiohistory.com/ETI_Magazine.htm" \
  --output-dir collections/eti/pdfs

# Download a subset matching a string
python3 download.py "https://www.worldradiohistory.com/ETI_Magazine.htm" \
  --filter "1970" --output-dir collections/eti/pdfs

archive.org — downloads files from a single archive.org item by identifier. Each issue typically has two PDF variants: a plain image PDF and a _text.pdf with an Abbyy OCR text layer. The --pdf-format flag controls which variant is downloaded (text is the default since convert.py extracts from the OCR layer):

# Download all OCR PDFs from an archive.org item
python3 download.py "https://archive.org/details/ElektorMagazine" \
  --output-dir collections/elektor/pdfs

# Download only issues from a specific decade
python3 download.py "https://archive.org/details/ElektorMagazine" \
  --output-dir collections/elektor/pdfs \
  --year-from 1974 --year-to 1989

# Download image-only PDFs (no OCR layer)
python3 download.py "https://archive.org/details/ElektorMagazine" \
  --pdf-format image --output-dir collections/elektor/pdfs

# Preview without downloading
python3 download.py "https://archive.org/details/ElektorMagazine" \
  --year-from 1980 --dry-run

Flag	Description	Default
`--pdf-format`	`text` (_text.pdf, OCR), `image` (plain PDF), `both`	`text`
`--year-from`	Only download files with a year >= this value	—
`--year-to`	Only download files with a year <= this value	—

2. Probe the collection structure

python3 convert.py --analyze --input-dir collections/eti/pdfs

Reports OCR coverage, detected naming patterns, page counts, and suggests a convert command.

3. Convert to searchable Markdown

python3 convert.py \
  --input-dir collections/eti/pdfs \
  --output-dir collections/eti/indexed

Each PDF becomes a directory containing:

content.md — full OCR text with page images embedded inline
index.md — page-by-page article and section headings
pages/page-NNN.png — each page rendered at 200 DPI

A master index.md is written to the output root linking all publications.

4. Search and research

# Find all publications mentioning a topic
grep -ril "VCA\|voltage controlled amplifier" collections/eti/indexed/

# Show matching lines with context
grep -in -A3 "fuzz box" collections/*/indexed/*/content.md

Open the collections/ directory in an AI-assisted editor and ask questions in plain English:

"There's a Guitar Fuzz Box project in Hobby Electronics magazine. Please find it for me."

"Are there any synthesiser projects across the whole collection?"

"What does the ETI Transcendent Polysynth series cover, and which issues should I read first?"

The AI reads LIBRARIAN.md and the collection index files to orient itself, then navigates freely.

5. Record findings

Write research outputs to findings/ — topic references, cross-collection notes, article summaries. This folder is gitignored so personal research never enters the repository.

AI assistant support

Context files are included for all major AI coding assistants. When you open this project, your assistant reads its context file and is directed to LIBRARIAN.md automatically — no prompting needed.

Assistant	Context file
Claude Code	`CLAUDE.md`
OpenAI Codex CLI	`AGENTS.md`
Google Gemini CLI	`GEMINI.md`
GitHub Copilot	`.github/copilot-instructions.md`

Sharing across devices

The library corpus and research findings can be stored in cloud storage and symlinked into the project, making everything available across multiple machines without committing copyrighted content.

Both collections/*/pdfs, collections/*/indexed, and findings/ are gitignored, so symlinks to cloud folders work seamlessly with version control.

Recommended cloud folder layout

The cloud storage layout mirrors the repo structure exactly. For example, using Dropbox:

~/Dropbox/my-library/
├── findings/              ← research findings (symlinked to findings/)
└── collections/
    ├── collection-a/
    │   ├── pdfs/          ← PDFs for collection A (symlinked to collections/collection-a/pdfs)
    │   └── indexed/       ← converted output (symlinked to collections/collection-a/indexed)
    └── collection-b/
        ├── pdfs/
        └── indexed/

Mirroring the repo layout means relative links in findings/*.md resolve correctly when apps resolve symlinks to their real filesystem path.

One-command reconstruction with bootstrap.sh

After cloning on a new machine, bootstrap.sh rebuilds the entire library in one step:

cp .env.template .env
# Edit .env — set LIBRARY_BASE to your cloud storage root, e.g.:
# LIBRARY_BASE="${HOME}/Dropbox/my-library"

./bootstrap.sh

This creates cloud directories, restores symlinks, downloads any missing PDFs (using the Source URL from each COLLECTION.md), converts them to searchable Markdown, and regenerates CATALOGUE.md. The script is idempotent — already-downloaded PDFs and already-converted output are skipped.

Symlinks only

init-symlinks.sh restores symlinks without downloading or converting. Symlink targets are auto-derived from collections/ using the naming convention above. To override, define a LINKS array in .env:

# .env
LINKS=(
    "findings:${LIBRARY_BASE}/findings"
    "collections/collection-a/pdfs:${LIBRARY_BASE}/collections/collection-a/pdfs"
    "collections/collection-a/indexed:${LIBRARY_BASE}/collections/collection-a/indexed"
)

Then run:

./init-symlinks.sh

Library catalogue

See CATALOGUE.md for all collections.

Collection	Period	PDFs	Pages
Hobby Electronics	1978–1984	67	~5,000
ETI — Electronics Today International	1972–1999	326	27,328
Everyday Electronics	1971–1999	332	24,430
Bernards/Babani BP Books	Various	111	16,153
Electronics & Music Maker / Music Technology	1981–1994	132	12,554
Elektor	1974–1989	160	9,165
Practical Electronics	1964–1992	341	27,145
Electronic Musician / Polyphony	1975–2023	443	10,389
Polyphony	1975–1985	45	1,761
Popular Electronics / Poptronics	1954–2003	595	69,656
Radio Electronics	1948–1999	636	~74,000
Moritz Klein	Ongoing	15	903

Using as a template

Click Use this template on GitHub to create your own library repository. The collections/ PDFs and indexed output are excluded by .gitignore, but collection metadata (COLLECTION.md files) and the project structure are tracked — so the shape of your library is preserved in version control.

Staying up to date with the template

To pull improvements from this template into your instance, add it as an upstream remote once:

git remote add upstream https://github.com/ali5ter/publication-library.git

Then merge whenever you want to pick up changes:

git fetch upstream
git merge upstream/main --no-edit
git push origin main

Git handles merging template changes with your instance-specific commits automatically. Conflicts are unlikely since the template never modifies COLLECTION.md files or CATALOGUE.md.

Contributing

Bug reports and enhancement requests are welcome via GitHub Issues.

Copyright notice

PDFs and all converted output are derived from copyrighted material. This repository contains only the scripts and metadata. Collection PDFs and indexed output are excluded via .gitignore and must not be committed or redistributed.

Source PDFs can be downloaded from World Radio History for personal use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

publication-library

How it works

Requirements

Tools

Workflow

1. Download a collection

2. Probe the collection structure

3. Convert to searchable Markdown

4. Search and research

5. Record findings

AI assistant support

Sharing across devices

Recommended cloud folder layout

One-command reconstruction with bootstrap.sh

Symlinks only

Library catalogue

Using as a template

Staying up to date with the template

Contributing

Copyright notice

About

Uh oh!

Releases 12

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 81 Commits
.github		.github
collections		collections
lib		lib
.env.template		.env.template
.gitignore		.gitignore
.gitmodules		.gitmodules
.markdownlint.json		.markdownlint.json
AGENTS.md		AGENTS.md
CATALOGUE.md		CATALOGUE.md
CLAUDE.md		CLAUDE.md
COLLECTION.md.example		COLLECTION.md.example
GEMINI.md		GEMINI.md
GIT_REMOTE		GIT_REMOTE
LIBRARIAN.md		LIBRARIAN.md
README.md		README.md
bootstrap.sh		bootstrap.sh
convert.py		convert.py
download.py		download.py
init-findings.sh		init-findings.sh
init-symlinks.sh		init-symlinks.sh
requirements.txt		requirements.txt
search.py		search.py

Folders and files

Latest commit

History

Repository files navigation

publication-library

How it works

Requirements

Tools

Workflow

1. Download a collection

2. Probe the collection structure

3. Convert to searchable Markdown

4. Search and research

5. Record findings

AI assistant support

Sharing across devices

Recommended cloud folder layout

One-command reconstruction with bootstrap.sh

Symlinks only

Library catalogue

Using as a template

Staying up to date with the template

Contributing

Copyright notice

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 12

Contributors

Uh oh!

Languages