Skip to content

KubangPawis/wikipedia-knowledge-graph

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WikiGraph Banner


WikiGraph

📖 Overview

This project scrapes Wikipedia using BeautifulSoup, starting from a seed page and following hyperlinks to related pages. The goal is to produce structured nodes (pages visited) and edges (links between pages), which can then be loaded into a graph database or analyzed directly. The original seed used in class was Artificial intelligence

✨ Key Features

  • Node & edge extraction from Wikipedia articles (metadata + outbound links).
  • Polite crawling with delays; respects robots.txt disallow rules.
  • CSV outputs ready for downstream graph tooling or analytics.
  • Automation via Airflow and containerization via Docker for reproducible runs.

🤔 What’s Scraped

Nodes (pages visited)

  • page_title, url, categories, last_modified (from the page’s info tab).

Edges (hyperlinks on each page)

  • source_url, target_url (directed edge representing an outgoing link).

Default seed used during the course: /wiki/Artificial_intelligence. You can change the seed in your script/config.

🛠️ Tech Stack

  • Language: Python
  • Web Scraping: BeautifulSoup4
  • Data Manipulation: Pandas
  • Containerization: Docker
  • Automation: Apache Airflow
  • Version Control: Git/GitHub

🏁 Getting Started

A. Prerequisites

To setup this project locally, ensure you have:

Requirement Version
Python 3.11+
Git 2.30+
Docker Optional — Docker Desktop 4.0+ (Engine 20.10+)
Airflow Optional — 2.7+ for scheduling/orchestration

B. Setup

The following indicates a quick step-by-step to run the project using Command Prompt.

REM 1) Clone the repo (pick one)
git clone git@github.com:KubangPawis/wikipedia-knowledge-graph.git
git clone https://github.com/KubangPawis/wikipedia-knowledge-graph.git

REM 2) Go into the project folder
cd /d C:\path\to\wiki_graph

REM 3) Create and activate a Python 3.11 virtual environment
py -3.11 -m venv venv
venv/Scripts/Activate.ps1

REM 4) Install dependencies
python -m pip install -U pip
python -m pip install -r requirements.txt

C. Usage

  1. Run the scraper locally

Adjust paths/module names to match your repo layout (e.g., main.py).

python main.py
  • Configure your seed page (e.g., Artificial intelligence) inside main.py or a config file.
  • The scraper visits the seed, collects metadata and outgoing links, and writes CSVs.
  1. (Optional) Run with Apache Airflow
  • Use the provided DAG (e.g., wiki_scraper_dag.py) to schedule ETL runs.
  • Enable the DAG in the Airflow UI and trigger a run to orchestrate extract → transform → load.
  1. (Optional) Containerize with Docker
  • Build and run to ensure consistent environments across machines.

The course implementation used Pandas in-memory during runtime and exported CSVs at the end of a session.

Pipeline

Below is an illustration of the project pipeline:

WikiGraph Pipeline

🎯 Outputs

CSV files are written under an output directory (e.g., extracted_data/):

  • wiki_nodes.csv — page-level metadata (example class run: 11 records, 4 columns)
  • wiki_edges.csv — directed links (example class run: ~4,266 records, 2 columns)

Note: any *_1.csv files in the repo were used for testing during development.

✅ Ethical Web Scraping

  • Respect robots.txt and Terms of Use; avoid disallowed paths.
  • Implement delays/backoff to reduce load on servers.
  • Collect only public, non-personal information.

⚠️ Limitations

Known limitations from the course version:

  • Always starts from the same seed (no persisted crawl frontier).
  • No incremental continuation between sessions (restarts from scratch).
  • CSV sink only; DB sink (e.g., PostgreSQL) was planned for future work.

🛣️ Roadmap

Future improvements:

  • Add database persistence (PostgreSQL) + resume capability.
  • Configurable crawl depth and domain constraints.
  • Graph export formats (e.g., GraphML) or direct load to Neo4j/NetworkX.

💁 Contributing

Contributions are welcome!

  1. Fork the repo
  2. Create a feature branch: git checkout -b feat/my-change
  3. Commit with clear messages
  4. Open a Pull Request describing your change and its rationale

About

BeautifulSoup-based Wikipedia crawler that captures page metadata and hyperlinks, exporting nodes/edges CSVs for easy graph analysis. Airflow/Docker optional.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages