Skip to content

DhruvB100/SQL-Server-ETL-Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SQL Server ETL Pipeline & Automated Reporting Dashboard

SQL Server Python Power BI

Stack: T-SQL · SQL Server / SSMS · Python · Excel (PowerQuery / PivotTables) · Power BI


What This Project Does

This project builds a complete data pipeline starting from raw CSV files and ending with automated reports. Here's the short version:

  1. Raw CSVs (accounts, transactions, fees) are ingested by a Python ETL script
  2. The ETL cleanses the data (removes duplicates, handles nulls, validates values)
  3. Clean data is loaded into a SQL Server database with enforced schema constraints
  4. SQL stored procedures and views calculate monthly KPIs (balances, variances, fees)
  5. Excel connects via PowerQuery for live-refresh pivot reports
  6. Power BI provides a visual dashboard for trend and variance analysis
  7. A Python script auto-generates a PowerPoint summary, cutting manual reporting by ~80%

Project Structure

SQL-Server-ETL-Pipeline/
│
├── data/
│   ├── raw/                    # Original CSV files (source of truth)
│   │   ├── accounts.csv        # 38 rows (incl. duplicates + nulls for testing)
│   │   ├── transactions.csv    # 185 rows, Jan–Dec 2024
│   │   └── fees.csv            # 132 rows, monthly and ad-hoc fees
│   └── processed/              # Cleaned CSVs (generated, not in repo)
│
├── sql/
│   ├── 01_create_database.sql  # Creates FinancialReportingDB
│   ├── 02_create_tables.sql    # All tables with constraints + FK relationships
│   ├── 03_stored_procedures.sql # Monthly KPI rollup procedures
│   ├── 04_views.sql            # Reporting views (used by Excel + Power BI)
│   └── 05_sample_queries.sql   # Validation and spot-check queries
│
├── python/
│   ├── etl_pipeline.py         # Main ETL script: CSV → SQL Server
│   ├── generate_report.py      # Auto-generates monthly PowerPoint report
│   └── requirements.txt        # Python dependencies
│
├── docs/
│   ├── data_dictionary.md      # Every table, column, constraint documented
│   ├── data_flow.md            # End-to-end data flow and design decisions
│   ├── excel_powerquery_setup.md # How to connect Excel + M code for queries
│   └── powerbi_setup.md        # Power BI dashboard setup + DAX measures
│
├── output/                     # Generated PowerPoint reports (not committed)
└── README.md

Technologies Used

Technology Purpose
SQL Server Express / SSMS Database engine, query development
T-SQL DDL (tables, constraints), stored procedures, views
Python 3.11 ETL pipeline, PowerPoint generation
pandas Data cleansing and transformation
pyodbc SQL Server connectivity from Python
python-pptx Automated PowerPoint report generation
Excel (PowerQuery) Live-refresh pivot reports connected to SQL Server
Power BI Desktop Interactive dashboard for trend/variance visualization

Setup Instructions

Prerequisites

  • SQL Server Express (free): download from Microsoft
  • SQL Server Management Studio (SSMS)
  • Python 3.10+
  • ODBC Driver 17 for SQL Server
  • Microsoft Excel 2016+
  • Power BI Desktop (free from Microsoft Store)

1. Set Up the Database

Open SSMS, connect to localhost\SQLEXPRESS, and run the SQL scripts in order:

-- In SSMS, open and run each script:
sql/01_create_database.sql    -- creates the database
sql/02_create_tables.sql      -- creates tables with constraints
sql/03_stored_procedures.sql  -- creates KPI stored procedures
sql/04_views.sql              -- creates reporting views

2. Install Python Dependencies

pip install -r python/requirements.txt

3. Run the ETL Pipeline

python python/etl_pipeline.py

Expected output:

===================================================
  ETL Pipeline - FinancialReportingDB
  Started: 2025-02-XX XX:XX:XX
===================================================

Connecting to SQL Server...
Connected!

--- Loading Accounts ---
  Raw rows: 38
  Exact duplicate rows removed: 5
  Duplicate account_ids removed (kept first): 0
  Rows skipped (missing required fields): 3
  Clean rows ready to insert: 30
  Result: 30 inserted, 3 skipped (cleansing), 0 failed (DB error)

--- Loading Transactions ---
  ...

--- Loading Fees ---
  ...

After the ETL runs, you can verify with:

-- In SSMS:
EXEC dbo.usp_GetMonthlyKPISummary @Year = 2024, @Month = 12;
SELECT * FROM dbo.vw_ETLRunHistory;

4. Generate the PowerPoint Report

python python/generate_report.py

Output will be saved to output/monthly_report_2024_12.pptx.

To change the report period, edit these lines in generate_report.py:

REPORT_YEAR  = 2024
REPORT_MONTH = 12

5. Connect Excel and Power BI

See the docs folder:


Key Design Decisions

Why store procedures over inline SQL in Python? I wanted to keep business logic in the database layer so both Excel and Python report scripts produce identical results. If the KPI formula changes, I update one stored procedure and both reports automatically use the new logic.

Why allow nullable balance_after in Transactions? The source data doesn't always include this field. Rather than reject those rows, I accept the transactions and note that balance can be reconstructed as a running total from the amount column if needed.

Why row-by-row insert instead of bulk insert? For a project this size (< 200 rows) it doesn't matter. In a real production system I'd use BULK INSERT or the pandas to_sql() method with a SQLAlchemy engine for much better performance.


Sample Queries

-- Monthly KPI summary for a specific month
EXEC dbo.usp_GetMonthlyKPISummary @Year = 2024, @Month = 6;

-- Month-over-month balance variance for 2024
EXEC dbo.usp_GetBalanceVariance @Year = 2024;

-- Accounts currently in overdraft
SELECT * FROM dbo.vw_AccountBalanceSnapshot WHERE is_overdraft = 1;

-- Total fee revenue by type (2024)
SELECT fee_type, SUM(collected_amount) AS total_collected
FROM dbo.vw_MonthlyFeeSummary
WHERE fee_year = 2024
GROUP BY fee_type
ORDER BY total_collected DESC;

Data Quality Notes

The raw CSV data intentionally includes some quality issues to demonstrate the ETL cleansing process:

Issue Count How Handled
Duplicate account rows 5 Removed by ETL
Accounts missing required fields 3 Excluded, logged
Accounts with null optional fields 2 Defaulted to 0
Overdraft accounts (negative balance) 2 Valid — allowed by schema

All cleansing decisions are logged to dbo.ETL_Log and documented in docs/data_flow.md.

About

End-to-end financial data pipeline: Python ETL (CSV → SQL Server) → T-SQL stored procedures & views → live Excel/Power BI dashboards → auto-generated PowerPoint reports via python-pptx.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors