This project scrapes Wikipedia using BeautifulSoup, starting from a seed page and following hyperlinks to related pages. The goal is to produce structured nodes (pages visited) and edges (links between pages), which can then be loaded into a graph database or analyzed directly. The original seed used in class was Artificial intelligence
- Node & edge extraction from Wikipedia articles (metadata + outbound links).
- Polite crawling with delays; respects
robots.txtdisallow rules. - CSV outputs ready for downstream graph tooling or analytics.
- Automation via Airflow and containerization via Docker for reproducible runs.
- page_title, url, categories, last_modified (from the page’s info tab).
- source_url, target_url (directed edge representing an outgoing link).
Default seed used during the course:
/wiki/Artificial_intelligence. You can change the seed in your script/config.
- Language: Python
- Web Scraping: BeautifulSoup4
- Data Manipulation: Pandas
- Containerization: Docker
- Automation: Apache Airflow
- Version Control: Git/GitHub
To setup this project locally, ensure you have:
| Requirement | Version |
|---|---|
| Python | 3.11+ |
| Git | 2.30+ |
| Docker | Optional — Docker Desktop 4.0+ (Engine 20.10+) |
| Airflow | Optional — 2.7+ for scheduling/orchestration |
The following indicates a quick step-by-step to run the project using Command Prompt.
REM 1) Clone the repo (pick one)
git clone git@github.com:KubangPawis/wikipedia-knowledge-graph.git
git clone https://github.com/KubangPawis/wikipedia-knowledge-graph.git
REM 2) Go into the project folder
cd /d C:\path\to\wiki_graph
REM 3) Create and activate a Python 3.11 virtual environment
py -3.11 -m venv venv
venv/Scripts/Activate.ps1
REM 4) Install dependencies
python -m pip install -U pip
python -m pip install -r requirements.txt- Run the scraper locally
Adjust paths/module names to match your repo layout (e.g., main.py).
python main.py
- Configure your seed page (e.g., Artificial intelligence) inside
main.pyor a config file. - The scraper visits the seed, collects metadata and outgoing links, and writes CSVs.
- (Optional) Run with Apache Airflow
- Use the provided DAG (e.g.,
wiki_scraper_dag.py) to schedule ETL runs. - Enable the DAG in the Airflow UI and trigger a run to orchestrate extract → transform → load.
- (Optional) Containerize with Docker
- Build and run to ensure consistent environments across machines.
The course implementation used Pandas in-memory during runtime and exported CSVs at the end of a session.
Below is an illustration of the project pipeline:
CSV files are written under an output directory (e.g., extracted_data/):
wiki_nodes.csv— page-level metadata (example class run: 11 records, 4 columns)wiki_edges.csv— directed links (example class run: ~4,266 records, 2 columns)
Note: any *_1.csv files in the repo were used for testing during development.
- Respect robots.txt and Terms of Use; avoid disallowed paths.
- Implement delays/backoff to reduce load on servers.
- Collect only public, non-personal information.
Known limitations from the course version:
- Always starts from the same seed (no persisted crawl frontier).
- No incremental continuation between sessions (restarts from scratch).
- CSV sink only; DB sink (e.g., PostgreSQL) was planned for future work.
Future improvements:
- Add database persistence (PostgreSQL) + resume capability.
- Configurable crawl depth and domain constraints.
- Graph export formats (e.g., GraphML) or direct load to Neo4j/NetworkX.
Contributions are welcome!
- Fork the repo
- Create a feature branch: git checkout -b feat/my-change
- Commit with clear messages
- Open a Pull Request describing your change and its rationale

