Stack: T-SQL · SQL Server / SSMS · Python · Excel (PowerQuery / PivotTables) · Power BI
This project builds a complete data pipeline starting from raw CSV files and ending with automated reports. Here's the short version:
- Raw CSVs (accounts, transactions, fees) are ingested by a Python ETL script
- The ETL cleanses the data (removes duplicates, handles nulls, validates values)
- Clean data is loaded into a SQL Server database with enforced schema constraints
- SQL stored procedures and views calculate monthly KPIs (balances, variances, fees)
- Excel connects via PowerQuery for live-refresh pivot reports
- Power BI provides a visual dashboard for trend and variance analysis
- A Python script auto-generates a PowerPoint summary, cutting manual reporting by ~80%
SQL-Server-ETL-Pipeline/
│
├── data/
│ ├── raw/ # Original CSV files (source of truth)
│ │ ├── accounts.csv # 38 rows (incl. duplicates + nulls for testing)
│ │ ├── transactions.csv # 185 rows, Jan–Dec 2024
│ │ └── fees.csv # 132 rows, monthly and ad-hoc fees
│ └── processed/ # Cleaned CSVs (generated, not in repo)
│
├── sql/
│ ├── 01_create_database.sql # Creates FinancialReportingDB
│ ├── 02_create_tables.sql # All tables with constraints + FK relationships
│ ├── 03_stored_procedures.sql # Monthly KPI rollup procedures
│ ├── 04_views.sql # Reporting views (used by Excel + Power BI)
│ └── 05_sample_queries.sql # Validation and spot-check queries
│
├── python/
│ ├── etl_pipeline.py # Main ETL script: CSV → SQL Server
│ ├── generate_report.py # Auto-generates monthly PowerPoint report
│ └── requirements.txt # Python dependencies
│
├── docs/
│ ├── data_dictionary.md # Every table, column, constraint documented
│ ├── data_flow.md # End-to-end data flow and design decisions
│ ├── excel_powerquery_setup.md # How to connect Excel + M code for queries
│ └── powerbi_setup.md # Power BI dashboard setup + DAX measures
│
├── output/ # Generated PowerPoint reports (not committed)
└── README.md
| Technology | Purpose |
|---|---|
| SQL Server Express / SSMS | Database engine, query development |
| T-SQL | DDL (tables, constraints), stored procedures, views |
| Python 3.11 | ETL pipeline, PowerPoint generation |
| pandas | Data cleansing and transformation |
| pyodbc | SQL Server connectivity from Python |
| python-pptx | Automated PowerPoint report generation |
| Excel (PowerQuery) | Live-refresh pivot reports connected to SQL Server |
| Power BI Desktop | Interactive dashboard for trend/variance visualization |
- SQL Server Express (free): download from Microsoft
- SQL Server Management Studio (SSMS)
- Python 3.10+
- ODBC Driver 17 for SQL Server
- Microsoft Excel 2016+
- Power BI Desktop (free from Microsoft Store)
Open SSMS, connect to localhost\SQLEXPRESS, and run the SQL scripts in order:
-- In SSMS, open and run each script:
sql/01_create_database.sql -- creates the database
sql/02_create_tables.sql -- creates tables with constraints
sql/03_stored_procedures.sql -- creates KPI stored procedures
sql/04_views.sql -- creates reporting viewspip install -r python/requirements.txtpython python/etl_pipeline.pyExpected output:
===================================================
ETL Pipeline - FinancialReportingDB
Started: 2025-02-XX XX:XX:XX
===================================================
Connecting to SQL Server...
Connected!
--- Loading Accounts ---
Raw rows: 38
Exact duplicate rows removed: 5
Duplicate account_ids removed (kept first): 0
Rows skipped (missing required fields): 3
Clean rows ready to insert: 30
Result: 30 inserted, 3 skipped (cleansing), 0 failed (DB error)
--- Loading Transactions ---
...
--- Loading Fees ---
...
After the ETL runs, you can verify with:
-- In SSMS:
EXEC dbo.usp_GetMonthlyKPISummary @Year = 2024, @Month = 12;
SELECT * FROM dbo.vw_ETLRunHistory;python python/generate_report.pyOutput will be saved to output/monthly_report_2024_12.pptx.
To change the report period, edit these lines in generate_report.py:
REPORT_YEAR = 2024
REPORT_MONTH = 12See the docs folder:
Why store procedures over inline SQL in Python? I wanted to keep business logic in the database layer so both Excel and Python report scripts produce identical results. If the KPI formula changes, I update one stored procedure and both reports automatically use the new logic.
Why allow nullable balance_after in Transactions?
The source data doesn't always include this field. Rather than reject those rows,
I accept the transactions and note that balance can be reconstructed as a running
total from the amount column if needed.
Why row-by-row insert instead of bulk insert?
For a project this size (< 200 rows) it doesn't matter. In a real production
system I'd use BULK INSERT or the pandas to_sql() method with a SQLAlchemy
engine for much better performance.
-- Monthly KPI summary for a specific month
EXEC dbo.usp_GetMonthlyKPISummary @Year = 2024, @Month = 6;
-- Month-over-month balance variance for 2024
EXEC dbo.usp_GetBalanceVariance @Year = 2024;
-- Accounts currently in overdraft
SELECT * FROM dbo.vw_AccountBalanceSnapshot WHERE is_overdraft = 1;
-- Total fee revenue by type (2024)
SELECT fee_type, SUM(collected_amount) AS total_collected
FROM dbo.vw_MonthlyFeeSummary
WHERE fee_year = 2024
GROUP BY fee_type
ORDER BY total_collected DESC;The raw CSV data intentionally includes some quality issues to demonstrate the ETL cleansing process:
| Issue | Count | How Handled |
|---|---|---|
| Duplicate account rows | 5 | Removed by ETL |
| Accounts missing required fields | 3 | Excluded, logged |
| Accounts with null optional fields | 2 | Defaulted to 0 |
| Overdraft accounts (negative balance) | 2 | Valid — allowed by schema |
All cleansing decisions are logged to dbo.ETL_Log and documented in
docs/data_flow.md.