Skip to content

kavishrathod/superstore-data-engineering-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“Š Superstore Data Engineering Pipeline β€” SQL, Star Schema, Airflow, Power BI, Python EDA

End-to-end Data Engineering Pipeline project on the Superstore Sales dataset.
Covers SQL querying, Python EDA, Power BI dashboard, star schema design, automated data quality checks, and an Apache Airflow pipeline DAG.


πŸ“ Project Structure

sales-analysis/
β”‚
β”œβ”€β”€ data/
β”‚   └── superstore_sales.csv              # Dataset (Kaggle Superstore)
β”‚
β”œβ”€β”€ sql/
β”‚   β”œβ”€β”€ 01_basic_exploration.sql
β”‚   β”œβ”€β”€ 02_revenue_analysis.sql
β”‚   β”œβ”€β”€ 03_top_products.sql
β”‚   β”œβ”€β”€ 04_customer_segments.sql
β”‚   β”œβ”€β”€ 05_regional_analysis.sql
β”‚   └── star_schema_ddl.sql               # Star Schema warehouse design
β”‚
β”œβ”€β”€ notebooks/
β”‚   └── eda_superstore.ipynb              # Jupyter EDA notebook (11 sections)
β”‚
β”œβ”€β”€ data_engineering/
β”‚   β”œβ”€β”€ load_staging.py
β”‚   β”œβ”€β”€ quality/
β”‚   β”‚   β”œβ”€β”€ data_quality_checks.py        # 14 automated DQ checks
β”‚   β”‚   β”œβ”€β”€ dq_report.csv                 # Generated after running checks
β”‚   β”‚   └── dq_log.json                   # Run summary log
β”‚   β”œβ”€β”€ reports/                          # Auto-generated CSVs from pipeline
β”‚   └── airflow/
β”‚       └── dags/
β”‚           └── sales_pipeline_dag.py     # Airflow DAG (DQ β†’ Load β†’ Report)
β”‚
β”œβ”€β”€ powerbi/
β”‚   └── sales_dashboard
β”‚
β”œβ”€β”€ docs/
β”‚   └── insights_summary.md               # Key business insights
β”‚
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ .gitignore
β”œβ”€β”€ LICENSE
└── README.md

🎯 Objectives

  • Analyze revenue trends, top products, and customer segments using MySQL
  • Perform exploratory data analysis (EDA) in a Jupyter Notebook
  • Design a Star Schema (fact + 5 dimension tables) for warehouse modeling
  • Run automated Data Quality checks before any data is loaded
  • Orchestrate the pipeline with an Apache Airflow DAG
  • Visualize insights in a Power BI dashboard

πŸ—ƒοΈ Dataset

Source: Kaggle β€” Sample Superstore Sales Dataset

Column Description
Order ID / Date Order identifier and date
Customer ID / Segment Customer details (Consumer / Corporate / Home Office)
Region / State / City Geographic data
Category / Sub-Category Product classification
Sales / Profit / Discount Financial measures
Quantity / Ship Mode Order details

πŸ—„οΈ SQL Analysis (MySQL)

18 queries across 6 files
Concepts: GROUP BY, Aggregations, JOIN, Subqueries, HAVING, CASE WHEN, Stored Procedures

File Description
01_basic_exploration.sql Row count, nulls, date range, dataset overview
02_revenue_analysis.sql Monthly/yearly revenue, profit margins, discount impact
03_top_products.sql Top/bottom products and sub-categories by revenue and profit
04_customer_segments.sql Segment revenue, top customers, loyalty analysis
05_regional_analysis.sql Region, state, city-level performance + shipping
star_schema_ddl.sql ⭐ Full star schema DDL β€” fact + 5 dim tables + load queries

πŸ“ Star Schema Design

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  dim_date    β”‚
                    β”‚  (date_key)  β”‚
                    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ dim_customer │────│  fact_sales  │────│  dim_product β”‚
β”‚              β”‚    β”‚              β”‚    β”‚              β”‚
β”‚ customer_key β”‚    β”‚ order_date_key    β”‚ product_key  β”‚
β”‚ customer_id  β”‚    β”‚ ship_date_key β”‚   β”‚ product_id   β”‚
β”‚ segment      β”‚    β”‚ customer_key β”‚    β”‚ category     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚ product_key  β”‚    β”‚ sub_category β”‚
                    β”‚ location_key β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚ shipping_key β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ dim_location │────│              │────│ dim_shipping β”‚
β”‚              β”‚    β”‚ sales        β”‚    β”‚              β”‚
β”‚ city / state β”‚    β”‚ profit       β”‚    β”‚ ship_mode    β”‚
β”‚ region       β”‚    β”‚ discount     β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚ quantity     β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ” Data Quality Checks

Script: data_engineering/quality/data_quality_checks.py
Output: dq_report.csv + dq_log.json

Runs 14 automated checks across 4 dimensions:

Category Check
Completeness Null checks on all critical columns
Validity Sales > 0, Quantity > 0, Discount in [0,1], valid Segment/Region/Category
Consistency Ship Date β‰₯ Order Date, dates in expected range, high-discount anomalies
Uniqueness No duplicate rows, unique (Order ID + Product ID)
Outliers IQR-based outlier flags on Sales and Profit

Pipeline stops automatically if any FAIL check is detected.

python data_engineering/quality/data_quality_checks.py

πŸ”„ Airflow Pipeline

DAG: data_engineering/airflow/dags/sales_pipeline_dag.py
Schedule: Daily at 6:00 AM

data_quality_check β†’ load_to_mysql β†’ run_sql_reports β†’ notify_completion
Task Description
data_quality_check Validates source CSV; pushes DQ summary via XCom
load_to_mysql Loads clean data into MySQL staging table (INSERT IGNORE)
run_sql_reports Runs 3 reporting queries, exports CSVs to reports/
notify_completion Logs final pipeline summary (extendable to Slack/email)

Note: Airflow credentials are stored as Airflow Variables β€” never hardcoded.


πŸ“Š Power BI Dashboard

5+ visuals: KPI Cards Β· Monthly Trend Β· Top Sub-Categories Β· Regional Map Β· Segment Donut Β· Category Matrix
(Updating Shortly)

🐍 Python EDA (Jupyter Notebook)

Notebook: notebooks/eda_superstore.ipynb

11 sections covering: data loading, quality checks, distributions, monthly trends, category analysis, segment analysis, regional performance, discount impact, and correlation heatmap.

pip install -r requirements.txt
jupyter notebook notebooks/eda_superstore.ipynb

πŸ’‘ Key Insights

  1. Technology is the highest-revenue and highest-profit category
  2. Tables & Bookcases consistently produce negative profit β€” high discount risk
  3. Q4 (Nov–Dec) sees the strongest sales spike across all years
  4. Consumer segment accounts for ~50% of total revenue
  5. West region leads in sales; South has the highest avg discount
  6. Discounts above 30% almost always result in a loss-making transaction

πŸ› οΈ Tech Stack

MySQL Python Jupyter Airflow Power BI Pandas


πŸš€ How to Run

# 1. Clone repo
git clone https://github.com/your-username/sales-analysis.git
cd sales-analysis

# 2. Download dataset from Kaggle β†’ place at data/superstore_sales.csv

# 3. Install Python dependencies
pip install -r requirements.txt

# 4. Run Data Quality checks
python data_engineering/quality/data_quality_checks.py

# 5. Load data into MySQL (run star_schema_ddl.sql in MySQL Workbench)
# Then run SQL files in sql/ folder

# 6. Open EDA notebook
jupyter notebook notebooks/eda_superstore.ipynb

# 7. (Optional) Set up Airflow and copy DAG
cp data_engineering/airflow/dags/sales_pipeline_dag.py ~/airflow/dags/
airflow dags trigger superstore_sales_pipeline

πŸ‘€ Author

Kavish Rathod
B.E. Electronics & Telecommunications | Aspiring Data Analyst / Data Engineer
LinkedIn β€’ GitHub


πŸ“„ License

MIT License β€” see LICENSE

About

End-to-end Sales Analytics & Data Engineering project with SQL, Python EDA, Star Schema modeling, automated Data Quality checks, and an Airflow pipeline.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors