Data Enginieering > Data Warehouse - Project: Cloud Data Warehousing

The cornerstone of the project is a startup called Sparkify, a start up which is streaming music to its customers. Coming from a sophisticated song database the company likes to move their entire processes and the respective data onto a professional cloud solution. Key facts are the useage of a AWS S3 data bucket in connection with JSON based log files. The JSON files are triggerinig the user's activities on the app plus the metadata of the songs.

Project scope

As a data engineer the main tasks are:

building an ETL pipeline that extracts their data from S3
staging the data in AWS Redshift
transforming the data into a set of dimensional tables for analytical reasons
finding insights to what kind of songs the customers are listening to

ETL Pipeline

The ETL pipeline consists of three Python file which are described below.

etl.py

The primary file ist the the ETL pipeline builder

load_staging_tables loading the raw data from the AWS S3 Bucket into the staging tables hosted in Redshift
insert_tables transformation process of the staging tables to dimensiojnal tables - needed to do the analytics

create_tables.py

Focusing in the schema to create staging, fact, and dimension tables

drop_tables
create_tables

sql_queries.py

Collection of certain SQL query statements for the before-mentioned files: 'create_tables' and 'etl.py'

*_table_drop
*_table_create
staging_*_copy
*_table_insert

Database Schema

The database schema consists of staging tables, a fact table and multiple dimension tables. IN detail the structure is as follows:

Staging Tables

staging_events artist VARCHAR, auth VARCHAR, firstName VARCHAR, gender CHAR(1), itemInSession INT, lastName VARCHAR, length FLOAT, level VARCHAR, location TEXT, method VARCHAR, page VARCHAR, registration VARCHAR, sessionId INT, song VARCHAR, status INT, ts BIGINT, userAgent TEXT, userId INT

staging_songs artist_id VARCHAR, artist_latitude FLOAT, artist_location TEXT, artist_longitude FLOAT, artist_name VARCHAR, duration FLOAT, num_songs INT, song_id VARCHAR, title VARCHAR, year INT

Fact Table

songplays songplay_id INT IDENTITY(0,1), start_time TIMESTAMP, user_id INT, level VARCHAR, song_id VARCHAR, artist_id VARCHAR, session_id INT, location TEXT, user_agent TEXT

Dimension Tables

users user_id INT, first_name VARCHAR, last_name VARCHAR, gender CHAR(1), level VARCHAR

songs song_id VARCHAR, title VARCHAR, artist_id VARCHAR, year INT, duration FLOAT

artists artist_id VARCHAR, name VARCHAR, location TEXT , latitude FLOAT , longitude FLOAT

time start_time TIMESTAMP, hour INT, day INT, week INT, month INT, year INT, weekday VARCHAR

Entity relationship diagram

The following ER diagram gives an impression about the connection of the tables https://github.com/KCvW/DataEng/blob/main/ER%20Staging%20Tables.jpeg

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
Capstone		Capstone
Cassandra		Cassandra
Cloud Computing		Cloud Computing
Data Lakes		Data Lakes
Data Modeling		Data Modeling
Data Pipelines		Data Pipelines
Data Warehouse		Data Warehouse
Demo Lessons		Demo Lessons
2018-11-01-events.json		2018-11-01-events.json
Data Schema - Pipeline with Airflow.png		Data Schema - Pipeline with Airflow.png
ER Staging Tables.png		ER Staging Tables.png
README.md		README.md
create_tables.py		create_tables.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Enginieering > Data Warehouse - Project: Cloud Data Warehousing

Project scope

ETL Pipeline

etl.py

create_tables.py

sql_queries.py

Database Schema

Staging Tables

Fact Table

Dimension Tables

Entity relationship diagram

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data Enginieering > Data Warehouse - Project: Cloud Data Warehousing

Project scope

ETL Pipeline

etl.py

create_tables.py

sql_queries.py

Database Schema

Staging Tables

Fact Table

Dimension Tables

Entity relationship diagram

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages