fsi/thesis-filtering-pipeline

Fork 0

Pipeline for filtering, processing, and querying 6.6 million PhD/Master theses from OpenAIRE. Main project description at https://wiki.f-si.org/index.php?title=EDA_and_chip-design_thesis_database

Python 99.9%
Shell 0.1%

Find a file

Luca c0722554b7 Small fixes		2025-12-12 12:46:40 +01:00
04_fiter_theses.py	First commit	2025-12-10 21:42:37 +01:00
40_Qwen3-Reranker-4B_vLLM_single_pass.py	First commit	2025-12-10 21:42:37 +01:00
42_plot_score_histogram.py	First commit	2025-12-10 21:42:37 +01:00
46_extract_by_threshold.py	First commit	2025-12-10 21:42:37 +01:00
47_output_score_0.1.jsonl	Scored database (single score)	2025-12-10 21:46:00 +01:00
50_resolve_to_aria2c.py	Small fixes	2025-12-12 12:46:40 +01:00
54_redownload_failed_with_chromium.py	First commit	2025-12-10 21:42:37 +01:00
60_pdf_to_markdown.py	First commit	2025-12-10 21:42:37 +01:00
70_categories_llm.jsonl	First commit	2025-12-10 21:42:37 +01:00
71_classify_by_category.py	First commit	2025-12-10 21:42:37 +01:00
80_make_html.py	First commit	2025-12-10 21:42:37 +01:00
90_chunk_theses.py	First commit	2025-12-10 21:42:37 +01:00
91_embed_chunks.py	First commit	2025-12-10 21:42:37 +01:00
92_start_server.sh	First commit	2025-12-10 21:42:37 +01:00
93_query_rag.py	Small fixes	2025-12-12 12:46:40 +01:00
LICENSE	Initial commit	2025-12-10 21:17:23 +01:00
README.md	Small fixes	2025-12-12 12:46:40 +01:00

README.md

OpenAIRE thesis filter: EDA and chip-design thesis database

This repository contains a complete pipeline for filtering, processing, and querying 6.6 million PhD/Master theses from OpenAIRE to build a specialized knowledge base for chip design, VLSI, and Electronic Design Automation (EDA) research.

Live demo: https://wiki.f-si.org/theses/

Project description: https://wiki.f-si.org/index.php?title=EDA_and_chip-design_thesis_database

Overview

This project processes the entire OpenAIRE thesis database (~6.6M records) to identify and extract theses relevant to semiconductor and chip design. The pipeline:

Semantic Filtering — Uses a neural reranker Qwen3-Reranker-4B to score relevance (6.6M → ~13K theses)
PDF Download — Resolves and downloads full-text PDFs
Markdown Conversion — Converts PDFs to searchable Markdown
Category Classification — Assigns theses to 53 specialized EDA/VLSI categories
RAG/CAG System — Enables natural language querying across all thesis content

The full pipeline was developed by vibe coding with the support of ChatGPT and Claude.ai. For debugging, please refer to them. The full process was tested on an RTX PRO 6000 96GB and took a total GPU time of ~80 hours. We used Ubuntu 24.04.3 because it supported Cuda 13.0.

Below is the schematic representation of the flow:

┌───────────────────────┐
│      OpenAIRE         │  Parallel download with aria2c
│      download         │
└───────────────────────┘
        │
        ▼ Full OpenAIRE database (203M publications + others, ~150GB of metadata)
┌───────────────────────┐
│      Metadata filter  │  Fast python filter
│      for theses       │
└───────────────────────┘
        │
        ▼ OpenAIRE database/theses-only (6.6M theses, ~31GB of metadata)
┌───────────────────────┐
│      Semantic filter  │  Qwen3-Reranker-4B via vLLM
│      (29 hours)       │  Scores each thesis 0.0–1.0
└───────────────────────┘
        │
        ▼ ~13,000 relevant theses (~58MB of metadata)
┌───────────────────────┐
│      PDF resolution   │  Parallel URL resolver
│      Chromium Retry   │  Handles JS-heavy sites
└───────────────────────┘
        │
        ▼ ~5,000 downloaded PDFs (~50GB)
┌───────────────────────┐
│      PDF → Markdown   │  Marker + Docling
│      (40 hours)       │  Handles scanned PDFs
└───────────────────────┘
        │
        ▼ ~ 1.4GB of Markdown
┌───────────────────────┐
│      Classification   │  53 EDA/VLSI categories
│      (10 hours)       │  ~160 semantic queries
└───────────────────────┘
        │
        ▼
┌───────────────────────┐
│      HTML Viewer      │  Interactive web interface
└───────────────────────┘
        │
        ▼
┌───────────────────────┐
│      Chunk & Embed    │  ~1M chunks in ChromaDB (~15GB)
│      RAG Query        │  Natural language (Qwen3-14B) Q&A
└───────────────────────┘

Dependencies

# It is recommended to use a virtual environment:
sudo apt-get install python3-venv
python3 -m venv ~/my-venv  # create a venv
source ~/my-venv/bin/activate  # activate it

# Install aria2 for parallel downloads.
sudo apt install aria2

# Core dependencies
pip install vllm transformers tqdm numpy
pip install sentence-transformers chromadb langchain-text-splitters

# PDF processing
pip install pymupdf marker-pdf docling

# Download utilities
pip install requests beautifulsoup4 playwright
playwright install chromium

# Visualization
pip install matplotlib

Pipeline scripts

Scripts and files are numbered XX_name.py for sequential execution. Run or process them in order from lowest to highest number.

Stage 0: Database download and theses extraction

cd 01_OpenAIRE_full_dump
# Create download list based on last Zenodo dump
curl -s https://zenodo.org/api/records/17098012 \
| jq -r '.files[] | select(.key|test("^publication_.*\\.tar$")) | .links.self' > urls.txt
# Download files in parallel
aria2c -x 8 -s 8 -i urls.txt
# Extract publications marked as theses (including PhD, Master, and unclassified theses)
python3 04_filter_theses.py \
  --tars 01_OpenAIRE_full_dump/publication_*.tar \
  --out-doctoral 05_openaire_doctoral_theses.jsonl \
  --out-master 05_openaire_master_theses.jsonl \
  --out-thesis 05_openaire_unclassified_theses.jsonl \
  -j "$(nproc)" \
  --print-nonempty
# Concatenate all theses into a single file
cat 05_openaire_doctoral_theses.jsonl 05_openaire_master_theses.jsonl 05_openaire_unclassified_theses.jsonl > 06_all_theses.jsonl

Stage 1: Semantic filtering

All theses are scored using Qwen3-Reranker-4B via vLLM by analyzing title, abstract, and other metadata. A score between 0.0 and 1.0 indicates how relevant a thesis is to chip-design/EDA. This is a single-stage filtering approach instead of a two-stage approach based on a fast "embedding" model like Qwen3-Embedding-0.6B followed by a slower reranker on the highest-scored theses. The single-stage approach is slower but more accurate.

Usage:

# Score all theses for chip design relevance (automatically resumes after crashes; runtime ~29 hours)
python3 40_Qwen3-Reranker-4B_vLLM_single_pass.py \
    06_all_theses.jsonl \
    41_output.jsonl \
    --keep-all \
    --batch-size 512

# Visualize distribution
python3 42_plot_score_histogram.py 41_output.jsonl -o histogram.png

# Extract high-scoring theses (score >= 0.1)
python3 46_extract_by_threshold.py \
    -i 41_output.jsonl \
    -o 47_output_score_0.1.jsonl \
    --threshold 0.1

Stage 2: PDF download

Usage:

# Resolve URLs and generate aria2c batch file (scans thesis URL for direct PDF links; runtime ~ 15 min)
python3 50_resolve_to_aria2c.py \
    47_output_score_0.1.jsonl \
    --out pdfs_selected \
    --workers 48 \
    --skip-existing

# Download PDFs with aria2c (parallel, resumable, runtime ~ 12 hours)
aria2c -i 51_aria2c_input.txt -x 8 -s 8 -j 8 \
  --continue=true \
  --auto-file-renaming=false \
  --check-certificate=false \
  --log=aria2c.log \
  --log-level=notice \
  --user-agent="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 Chrome/120.0 Safari/537.36"


# Retry failed downloads with Chromium (for JS-heavy sites, runtime ~2 hours)
python3 54_redownload_failed_with_chromium.py \
    --aria2c 51_aria2c_input.txt \
    --pdf-dir pdfs_selected \
    --workers 4

Stage 3: PDF to Markdown conversion

Features: Crash-safe with SQLite state tracking

Handles scanned PDFs via OCR
Parallel processing with sharding support
Automatic PostScript conversion

Usage:

# Single process (most stable, but slower)
python3 60_pdf_to_markdown.py \
    --input pdfs_selected \
    --output markdown_selected

# Parallel sharding (4 processes, runtime ~40 hours)
python3 60_pdf_to_markdown.py -i pdfs_selected -o markdown_selected --shard 1/4 &
...
python3 60_pdf_to_markdown.py -i pdfs_selected -o markdown_selected --shard 4/4 &

# Check progress
python3 60_pdf_to_markdown.py -o markdown_selected --status

# Retry failed files
python3 60_pdf_to_markdown.py -o markdown_selected --retry-failed

Stage 4: Theses classification into EDA/VLSI categories

The categories are contained in the file 70_categories_llm.jsonl and are manually curated. The score is based again on Qwen3-Reranker-4B.

Usage (runtime ~10 hours)

python3 71_classify_by_category.py \
    47_output_score_0.1.jsonl \
    70_categories_llm.jsonl \
    72_classified_output.jsonl \
    --batch-size 512 \
    --max-model-len 3072

Stage 5: Generation of the HTML viewer

This step generates the interactive web viewer. The resulting html and json files can be served directly by a web server.

Usage:

# Generate server-based viewer (recommended for large datasets)
python3 80_make_html.py \
    -i 72_classified_output.jsonl \
    -m server \
    -o thesis_view

# Serve locally
cd thesis_view_server && python -m http.server 8000

Stage 6: RAG/CAG system

In this stage the theses are further processed for being fed into a Retrieval Augmented Generation (RAG) / Cache Augmented Generation (CAG) system. The first two steps need to be performed only once.

Usage:

# 1. Chunk all theses (runtime ~1 min)
python3 90_chunk_theses.py
# Output: chunks.jsonl (~5000 Markdown theses turn into ~1M chunks)

# 2. Embed chunks into ChromaDB (runtime ~2 hours)
python3 91_embed_chunks.py
# Output: chroma_db/ (~15GB)

# 3. Start LLM server (in separate terminal)
./92_start_server.sh
# Serves Qwen3-14B on localhost:8000

# 4. Query the knowledge base
python3 93_query_rag.py

> Question: What algorithms are used for timing-driven placement?

Citation

If you use this pipeline or dataset, please cite:

EDA and chip-design thesis database, Free Silicon Foundation (2025).
https://wiki.f-si.org/index.php?title=EDA_and_chip-design_thesis_database

License

This project is released under the AGPL-3.0-or-later license. The processed thesis metadata follows OpenAIRE's licensing terms. Individual thesis PDFs retain their original licenses.

Acknowledgments

OpenAIRE for the open research database
Qwen for the reranker and LLM models
Marker and Docling for PDF conversion
ChromaDB for vector storage

This work was co-funded by the Swiss State Secretariat for Education, Research and Innovation (SERI) under the NGI0 Commons Fund project. The NGI0 Commons Fund has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No. 101135429.

Developed by the Free Silicon Foundation for open silicon research.