This project performs an end-to-end exploratory analysis of a healthcare data breach dataset using a reproducible Python pipeline executed entirely in WSL (Windows Subsystem for Linux). The analysis examines breach types, breach locations, individuals affected, business associate involvement, and temporal trends.
The repository is intentionally structured to mirror production-style data analysis and data engineering workflows, emphasizing environment isolation, version control, and automation.
-
Linux-based development via WSL
-
Python virtual environments (venv)
-
Secure GitHub authentication using SSH
-
Data cleaning and transformation using pandas
-
Exploratory data analysis
-
Command-line-driven analysis
data-breach-project/
├── data/
│ ├── raw/ # Original, immutable data
│ │ └── breach_report.csv
│ └── processed/
│ ├── breach_clean.parquet
│ ├── breach_clean.csv
│ └── tables/ # EDA summary tables
├── scripts/ # Executable analysis pipeline
│ ├── 01_load_clean.py
│ ├── 02_eda_tables.py
│ ├── 03_visualizations.py
│ ├── 04_build_report.py
│ └── run_all.sh
├── figs/ # Generated visualizations
├── reports/ # Generated markdown report
├── requirements.txt
├── .gitignore
└── README.md
All commands were run from a WSL (Ubuntu) terminal.
# Create and activate virtual environment
python3 -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install -r requirements.txt
Python dependencies include: pandas, pyarrow, matplotlib, and tabulate.
The analysis is organized as a numbered, linear pipeline to ensure clarity and reproducibility.
- Data loading and cleaning:
python scripts/01_load_clean.py
Outputs cleaned datasets (.parquet and .csv)
- Exploratory data analysis (Tables):
python scripts/02_eda_tables.py
Output: Tables saved to data/processed/tables/
- Visualizations:
python scripts/03_visualizations.py
Output: Figures saved to figs/
- Report generation:
python scripts/04_build_report.py
Output: Report saved to reports/breach_report.md
I also created a shell script to easily run all commands:
./scripts/run_all.sh
-
Git initialized and managed entirely within WSL.
-
SSH-based GitHub authentication (no passwords or tokens).
-
Clean
.gitignoreto prevent committing virtual environments or OS artifacts. -
Repository pushed and tracked on main.
-
Raw data remains unchanged in data/raw/
-
All derived outputs were generated programmatically.
-
The project can be reproduced on any Linux system (WSL) with Python 3.