WikiGraph

📖 Overview

This project scrapes Wikipedia using BeautifulSoup, starting from a seed page and following hyperlinks to related pages. The goal is to produce structured nodes (pages visited) and edges (links between pages), which can then be loaded into a graph database or analyzed directly. The original seed used in class was Artificial intelligence

✨ Key Features

Node & edge extraction from Wikipedia articles (metadata + outbound links).
Polite crawling with delays; respects robots.txt disallow rules.
CSV outputs ready for downstream graph tooling or analytics.
Automation via Airflow and containerization via Docker for reproducible runs.

🤔 What’s Scraped

Nodes (pages visited)

page_title, url, categories, last_modified (from the page’s info tab).

Edges (hyperlinks on each page)

source_url, target_url (directed edge representing an outgoing link).

Default seed used during the course: /wiki/Artificial_intelligence. You can change the seed in your script/config.

🛠️ Tech Stack

Language: Python
Web Scraping: BeautifulSoup4
Data Manipulation: Pandas
Containerization: Docker
Automation: Apache Airflow
Version Control: Git/GitHub

🏁 Getting Started

A. Prerequisites

To setup this project locally, ensure you have:

Requirement	Version
Python	3.11+
Git	2.30+
Docker	Optional — Docker Desktop 4.0+ (Engine 20.10+)
Airflow	Optional — 2.7+ for scheduling/orchestration

B. Setup

The following indicates a quick step-by-step to run the project using Command Prompt.

REM 1) Clone the repo (pick one)
git clone git@github.com:KubangPawis/wikipedia-knowledge-graph.git
git clone https://github.com/KubangPawis/wikipedia-knowledge-graph.git

REM 2) Go into the project folder
cd /d C:\path\to\wiki_graph

REM 3) Create and activate a Python 3.11 virtual environment
py -3.11 -m venv venv
venv/Scripts/Activate.ps1

REM 4) Install dependencies
python -m pip install -U pip
python -m pip install -r requirements.txt

C. Usage

Run the scraper locally

Adjust paths/module names to match your repo layout (e.g., main.py).

python main.py

Configure your seed page (e.g., Artificial intelligence) inside main.py or a config file.
The scraper visits the seed, collects metadata and outgoing links, and writes CSVs.

(Optional) Run with Apache Airflow

Use the provided DAG (e.g., wiki_scraper_dag.py) to schedule ETL runs.
Enable the DAG in the Airflow UI and trigger a run to orchestrate extract → transform → load.

(Optional) Containerize with Docker

Build and run to ensure consistent environments across machines.

The course implementation used Pandas in-memory during runtime and exported CSVs at the end of a session.

Pipeline

Below is an illustration of the project pipeline:

🎯 Outputs

CSV files are written under an output directory (e.g., extracted_data/):

wiki_nodes.csv — page-level metadata (example class run: 11 records, 4 columns)
wiki_edges.csv — directed links (example class run: ~4,266 records, 2 columns)

Note: any *_1.csv files in the repo were used for testing during development.

✅ Ethical Web Scraping

Respect robots.txt and Terms of Use; avoid disallowed paths.
Implement delays/backoff to reduce load on servers.
Collect only public, non-personal information.

⚠️ Limitations

Known limitations from the course version:

Always starts from the same seed (no persisted crawl frontier).
No incremental continuation between sessions (restarts from scratch).
CSV sink only; DB sink (e.g., PostgreSQL) was planned for future work.

🛣️ Roadmap

Future improvements:

Add database persistence (PostgreSQL) + resume capability.
Configurable crawl depth and domain constraints.
Graph export formats (e.g., GraphML) or direct load to Neo4j/NetworkX.

💁 Contributing

Contributions are welcome!

Fork the repo
Create a feature branch: git checkout -b feat/my-change
Commit with clear messages
Open a Pull Request describing your change and its rationale

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
assets/images		assets/images
dags		dags
docs		docs
extracted_data		extracted_data
scripts		scripts
.gitignore		.gitignore
README.md		README.md
docker-compose.yaml		docker-compose.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WikiGraph

📖 Overview

✨ Key Features

🤔 What’s Scraped

Nodes (pages visited)

Edges (hyperlinks on each page)

🛠️ Tech Stack

🏁 Getting Started

A. Prerequisites

B. Setup

C. Usage

Pipeline

🎯 Outputs

✅ Ethical Web Scraping

⚠️ Limitations

🛣️ Roadmap

💁 Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

WikiGraph

📖 Overview

✨ Key Features

🤔 What’s Scraped

Nodes (pages visited)

Edges (hyperlinks on each page)

🛠️ Tech Stack

🏁 Getting Started

A. Prerequisites

B. Setup

C. Usage

Pipeline

🎯 Outputs

✅ Ethical Web Scraping

⚠️ Limitations

🛣️ Roadmap

💁 Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages