Project work: Olympia digital erschlieΓen - Workflow zur systematischen Erfassung und Analyse olympischer Veranstaltungsorte
Authors: Felix Piasta, Jeremy Melpitz, Marvin Gels
Institution: UniversitΓ€t Leipzig, Sommersemester 2025
This thesis project develops a comprehensive workflow for systematic collection and analysis of Olympic venue data, combining web scraping, AI-powered PDF extraction, venue matching algorithms, and interactive web visualization.
Next.js interactive web application featuring Olympic venue maps, charts, and analytics dashboard with dark mode support.
Comprehensive data collection pipeline scraping venue information from Olympedia.org and converting to GeoJSON format.
Automated N8N workflow converting Olympic venue PDF reports into structured JSON using Claude 4 AI extraction.
The following proof-of-concept components have been moved to archive/:
POC venue matching system with 82.2% success rate, combining GeoJSON and PDF data sources using fuzzy matching algorithms.
POC web scraper for downloading official Olympic Games reports from IOC Olympic Library (54 reports, 1896-2024).
Desktop GUI application for splitting PDF documents using ToC, regex patterns, or fixed page counts.
Research materials, visualization concepts, and external datasets informing the project development.
See the Archive README for more archived resources.
- Python 3.8+ with pip for data processing scripts
- Node.js 18+ with npm for the web application
- Chrome browser (for web scraping)
The repository contains pre-processed Olympic venue data. You can jump straight to visualization:
# 1. Navigate to the web application
cd webapp
# 2. Install dependencies
npm install
# 3. Start the development server
npm run dev
# 4. Open http://localhost:3000 in your browserThe webapp automatically loads processed GeoJSON data from geojson_scraper/00_final_geojsons/.
Download Olympic Reports (Optional - PDFs provided):
cd archive/olympic_reports-poc
pip install selenium webdriver-manager requests
python reports_scrapper.pyScrape Venue Coordinates from Olympedia:
cd geojson_scraper
pip install requests beautifulsoup4
python 01_scraper.py -n 100 -s 1 # Scrape 100 venues starting from ID 1Download Financial Review and venues PDF
curl --request GET -sL \
--url 'https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/CPQEHN#'
--output 'geojson_scraper/Growth dataset Olympic Games and Football World Cup.xlsx'
curl --request GET -sL \
--url 'https://stillmed.olympics.com/media/Documents/Olympic-Games/Olympic-legacy/Full-report-venues-post-games-use.pdf#_ga=2.261430735.317661095.1681111580-1043555523.1678197020'Then, split the full report into individual season PDFs and save it to pdfToJson/n8n/n8n_io/PDF_summery/venues_summer/
and pdfToJson/n8n/n8n_io/PDF_summery/venues_winter/.
You can use tools like PDF24 for this task.
Set up N8N workflow automation:
cd pdfToJson/n8n
docker compose up -d # Start N8N on http://localhost:5678
# Setup Python environment for PDF processing
cd n8n_io
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -r requirements.txt
cd ..
# Import workflows into N8N
./n8n_workflows/docker-exec-import.sh
# Configure processing (add Anthropic API key)
cd n8n_io
cp config.json.summery-example config.json
# Edit config.json with your API key and settingsProcess PDFs:
- Place Olympic venue PDFs in
pdfToJson/n8n/n8n_io/PDF_summery/venues_summer/orvenues_winter/ - Open http://localhost:5678 in your browser
- Run the workflow "OnePdf--FullWorkflow" from the N8N interface
- Processed JSON files appear in
*_chunked/directories and combined JSON files
See pdfToJson/README.md for detailed workflow configuration and troubleshooting.
Convert scraped data to GeoJSON:
cd geojson_scraper
python 02_geojson_templater.py # Creates basic GeoJSON files
python 03_duplicate_finder.py # Removes duplicates
python 04_renamer.py # Adds descriptive names
python 05_venue_combiner.py # Combines related venuescd webapp
npm install
npm run build # Production build
npm start # Production server
# Or: npm run dev # Development serverPDF Reports: Place Olympic venue PDFs in pdfToJson/ (see component README for details)
GeoJSON Files: Processed files go to geojson_scraper/00_final_geojsons/
Scraped Data: Raw venue JSON files stored in geojson_scraper/01_scraped_websites/
- Webapp can't find data: Ensure GeoJSON files exist in
geojson_scraper/00_final_geojsons/ - Scraping fails: Check internet connection and Olympedia.org availability
- Node.js issues: Verify Node.js 18+ is installed with
node --version - Python dependencies: Use virtual environments:
python -m venv venv && venv\Scripts\activate
- Backend: Python (Selenium, BeautifulSoup, PyMuPDF), N8N automation
- AI Processing: Claude 4 Sonnet for PDF extraction and validation
- Frontend: Next.js 15, React 19, MapLibre GL, Tailwind CSS
- Data: GeoJSON, PDF reports, external datasets (Harvard, Kaggle, World Bank)
- Venue Coverage: 1000+ venues across 130+ years of Olympic history
- Match Rate: 70% successful venue matching between data sources
- Games Coverage: 54 Olympic Games (Summer 1896-2024, Winter 1924-2022)
- Web Interface: Interactive maps, charts, and analytics dashboard