End-to-end Data Engineering Pipeline project on the Superstore Sales dataset.
Covers SQL querying, Python EDA, Power BI dashboard, star schema design, automated data quality checks, and an Apache Airflow pipeline DAG.
sales-analysis/
β
βββ data/
β βββ superstore_sales.csv # Dataset (Kaggle Superstore)
β
βββ sql/
β βββ 01_basic_exploration.sql
β βββ 02_revenue_analysis.sql
β βββ 03_top_products.sql
β βββ 04_customer_segments.sql
β βββ 05_regional_analysis.sql
β βββ star_schema_ddl.sql # Star Schema warehouse design
β
βββ notebooks/
β βββ eda_superstore.ipynb # Jupyter EDA notebook (11 sections)
β
βββ data_engineering/
β βββ load_staging.py
β βββ quality/
β β βββ data_quality_checks.py # 14 automated DQ checks
β β βββ dq_report.csv # Generated after running checks
β β βββ dq_log.json # Run summary log
β βββ reports/ # Auto-generated CSVs from pipeline
β βββ airflow/
β βββ dags/
β βββ sales_pipeline_dag.py # Airflow DAG (DQ β Load β Report)
β
βββ powerbi/
β βββ sales_dashboard
β
βββ docs/
β βββ insights_summary.md # Key business insights
β
βββ requirements.txt
βββ .gitignore
βββ LICENSE
βββ README.md
- Analyze revenue trends, top products, and customer segments using MySQL
- Perform exploratory data analysis (EDA) in a Jupyter Notebook
- Design a Star Schema (fact + 5 dimension tables) for warehouse modeling
- Run automated Data Quality checks before any data is loaded
- Orchestrate the pipeline with an Apache Airflow DAG
- Visualize insights in a Power BI dashboard
Source: Kaggle β Sample Superstore Sales Dataset
| Column | Description |
|---|---|
| Order ID / Date | Order identifier and date |
| Customer ID / Segment | Customer details (Consumer / Corporate / Home Office) |
| Region / State / City | Geographic data |
| Category / Sub-Category | Product classification |
| Sales / Profit / Discount | Financial measures |
| Quantity / Ship Mode | Order details |
18 queries across 6 files
Concepts: GROUP BY, Aggregations, JOIN, Subqueries, HAVING, CASE WHEN, Stored Procedures
| File | Description |
|---|---|
01_basic_exploration.sql |
Row count, nulls, date range, dataset overview |
02_revenue_analysis.sql |
Monthly/yearly revenue, profit margins, discount impact |
03_top_products.sql |
Top/bottom products and sub-categories by revenue and profit |
04_customer_segments.sql |
Segment revenue, top customers, loyalty analysis |
05_regional_analysis.sql |
Region, state, city-level performance + shipping |
star_schema_ddl.sql |
β Full star schema DDL β fact + 5 dim tables + load queries |
ββββββββββββββββ
β dim_date β
β (date_key) β
ββββββββ¬ββββββββ
β
ββββββββββββββββ ββββββββ΄ββββββββ ββββββββββββββββ
β dim_customer ββββββ fact_sales ββββββ dim_product β
β β β β β β
β customer_key β β order_date_key β product_key β
β customer_id β β ship_date_key β β product_id β
β segment β β customer_key β β category β
ββββββββββββββββ β product_key β β sub_category β
β location_key β ββββββββββββββββ
ββββββββββββββββ β shipping_key β ββββββββββββββββ
β dim_location ββββββ ββββββ dim_shipping β
β β β sales β β β
β city / state β β profit β β ship_mode β
β region β β discount β ββββββββββββββββ
ββββββββββββββββ β quantity β
ββββββββββββββββ
Script: data_engineering/quality/data_quality_checks.py
Output: dq_report.csv + dq_log.json
Runs 14 automated checks across 4 dimensions:
| Category | Check |
|---|---|
| Completeness | Null checks on all critical columns |
| Validity | Sales > 0, Quantity > 0, Discount in [0,1], valid Segment/Region/Category |
| Consistency | Ship Date β₯ Order Date, dates in expected range, high-discount anomalies |
| Uniqueness | No duplicate rows, unique (Order ID + Product ID) |
| Outliers | IQR-based outlier flags on Sales and Profit |
Pipeline stops automatically if any FAIL check is detected.
python data_engineering/quality/data_quality_checks.pyDAG: data_engineering/airflow/dags/sales_pipeline_dag.py
Schedule: Daily at 6:00 AM
data_quality_check β load_to_mysql β run_sql_reports β notify_completion
| Task | Description |
|---|---|
data_quality_check |
Validates source CSV; pushes DQ summary via XCom |
load_to_mysql |
Loads clean data into MySQL staging table (INSERT IGNORE) |
run_sql_reports |
Runs 3 reporting queries, exports CSVs to reports/ |
notify_completion |
Logs final pipeline summary (extendable to Slack/email) |
Note: Airflow credentials are stored as Airflow Variables β never hardcoded.
5+ visuals: KPI Cards Β· Monthly Trend Β· Top Sub-Categories Β· Regional Map Β· Segment Donut Β· Category Matrix
(Updating Shortly)
Notebook: notebooks/eda_superstore.ipynb
11 sections covering: data loading, quality checks, distributions, monthly trends, category analysis, segment analysis, regional performance, discount impact, and correlation heatmap.
pip install -r requirements.txt
jupyter notebook notebooks/eda_superstore.ipynb- Technology is the highest-revenue and highest-profit category
- Tables & Bookcases consistently produce negative profit β high discount risk
- Q4 (NovβDec) sees the strongest sales spike across all years
- Consumer segment accounts for ~50% of total revenue
- West region leads in sales; South has the highest avg discount
- Discounts above 30% almost always result in a loss-making transaction
# 1. Clone repo
git clone https://github.com/your-username/sales-analysis.git
cd sales-analysis
# 2. Download dataset from Kaggle β place at data/superstore_sales.csv
# 3. Install Python dependencies
pip install -r requirements.txt
# 4. Run Data Quality checks
python data_engineering/quality/data_quality_checks.py
# 5. Load data into MySQL (run star_schema_ddl.sql in MySQL Workbench)
# Then run SQL files in sql/ folder
# 6. Open EDA notebook
jupyter notebook notebooks/eda_superstore.ipynb
# 7. (Optional) Set up Airflow and copy DAG
cp data_engineering/airflow/dags/sales_pipeline_dag.py ~/airflow/dags/
airflow dags trigger superstore_sales_pipelineKavish Rathod
B.E. Electronics & Telecommunications | Aspiring Data Analyst / Data Engineer
LinkedIn β’ GitHub
MIT License β see LICENSE