GitHub - martinorkuma/data_breach_project: End-to-end Python analysis of healthcare data breaches, using a reproducible WSL-based workflow to explore breach types, impact, and trends over time.

Healthcare Data Breach Analysis (Python, WSL)

Project Overview

This project performs an end-to-end exploratory analysis of a healthcare data breach dataset using a reproducible Python pipeline executed entirely in WSL (Windows Subsystem for Linux). The analysis examines breach types, breach locations, individuals affected, business associate involvement, and temporal trends.

The repository is intentionally structured to mirror production-style data analysis and data engineering workflows, emphasizing environment isolation, version control, and automation.

Key Skills Demonstrated

Linux-based development via WSL
Python virtual environments (venv)
Secure GitHub authentication using SSH
Data cleaning and transformation using pandas
Exploratory data analysis
Command-line-driven analysis

data-breach-project/
├── data/
│   ├── raw/                  # Original, immutable data
│   │   └── breach_report.csv
│   └── processed/
│       ├── breach_clean.parquet
│       ├── breach_clean.csv
│       └── tables/            # EDA summary tables
├── scripts/                   # Executable analysis pipeline
│   ├── 01_load_clean.py
│   ├── 02_eda_tables.py
│   ├── 03_visualizations.py
│   ├── 04_build_report.py
│   └── run_all.sh
├── figs/                      # Generated visualizations
├── reports/                   # Generated markdown report
├── requirements.txt
├── .gitignore
└── README.md

Environment Setup (WSL)

All commands were run from a WSL (Ubuntu) terminal.

# Create and activate virtual environment
python3 -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Python dependencies include: pandas, pyarrow, matplotlib, and tabulate.

Analysis Pipeline

The analysis is organized as a numbered, linear pipeline to ensure clarity and reproducibility.

Data loading and cleaning:

python scripts/01_load_clean.py

Outputs cleaned datasets (.parquet and .csv)

Exploratory data analysis (Tables):

python scripts/02_eda_tables.py

Output: Tables saved to data/processed/tables/

Visualizations:

    python scripts/03_visualizations.py

Output: Figures saved to figs/

Report generation:

python scripts/04_build_report.py

Output: Report saved to reports/breach_report.md

I also created a shell script to easily run all commands:

./scripts/run_all.sh

Version Control and Security

Git initialized and managed entirely within WSL.
SSH-based GitHub authentication (no passwords or tokens).
Clean .gitignore to prevent committing virtual environments or OS artifacts.
Repository pushed and tracked on main.

Reproducibility

Raw data remains unchanged in data/raw/
All derived outputs were generated programmatically.
The project can be reproduced on any Linux system (WSL) with Python 3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Healthcare Data Breach Analysis (Python, WSL)

Project Overview

Key Skills Demonstrated

Environment Setup (WSL)

Analysis Pipeline

Version Control and Security

Reproducibility

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
figs		figs
reports		reports
scripts		scripts
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Healthcare Data Breach Analysis (Python, WSL)

Project Overview

Key Skills Demonstrated

Environment Setup (WSL)

Analysis Pipeline

Version Control and Security

Reproducibility

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages